Skip to content

feat: add fr_FR locale to nemotron personas datasets#468

Merged
johnnygreco merged 5 commits intomainfrom
add-fr-fr-locale
Mar 31, 2026
Merged

feat: add fr_FR locale to nemotron personas datasets#468
johnnygreco merged 5 commits intomainfrom
add-fr-fr-locale

Conversation

@johnnygreco
Copy link
Copy Markdown
Contributor

@johnnygreco johnnygreco commented Mar 25, 2026

Summary

  • Register the France locale (fr_FR, 2.71 GB) in NEMOTRON_PERSONAS_DATASET_SIZES, which auto-propagates to LOCALES_WITH_MANAGED_DATASETS, PersonaRepository, PersonSamplerParams validation, and the download service
  • Add 7 France-specific PII fields to dataset_based_person_fields.py: first_name_heritage, name_heritage, is_first_gen_immigrant, household_type, monthly_income_eur, commune, departement
  • Update person sampling docs with fr_FR locale listing, NGC download example, and field reference
  • Update persona repository tests for 8 locales

Verification

After downloading the fr_FR dataset, a 36-test suite was run against the person sampler to validate end-to-end behavior. Tests covered:

  • PersonSamplerParams validationfr_FR accepted as locale, works with personas toggle, sex/city/age_range filters, and correct people_gen_key routing
  • PII-only generation (PeopleGenFromDataset) — correct record count, fr_FR locale on all records, common PII fields present, all 7 France-specific fields present (commune, departement, monthly_income_eur, first_name_heritage, name_heritage, is_first_gen_immigrant, household_type), persona fields absent, sex/city filtering works, no unexpected fields in output
  • Generation with synthetic personas — persona fields (career_goals_and_ambitions, detailed_persona, Big Five traits) coexist with PII and France-specific fields, no unknown fields leak into output
  • Derived fields behavior — no SSN or phone number generated (US-only), email generated for adults and None for children, region preserved (not renamed to state), France-specific fields survive the generate_and_insert_derived_fields pipeline
  • End-to-end PersonSampler via SamplerRegistry — full pipeline with and without personas, sex and city filtering

All 36 tests passed.

Register the France locale (fr_FR, 2.71 GB) in NEMOTRON_PERSONAS_DATASET_SIZES
and add 7 France-specific PII fields: first_name_heritage, name_heritage,
is_first_gen_immigrant, household_type, monthly_income_eur, commune, departement.
@johnnygreco johnnygreco requested a review from a team as a code owner March 25, 2026 20:52
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR registers the fr_FR (France) locale in the Nemotron Personas dataset ecosystem. Adding a new locale is a well-understood pattern in this codebase, and the implementation follows the same shape used for existing locales (en_IN, ja_JP, pt_BR, etc.).

Key changes:

  • NEMOTRON_PERSONAS_DATASET_SIZES gains a fr_FR entry (2.71 GB), which auto-propagates to LOCALES_WITH_MANAGED_DATASETS, validation, the persona repository, and the CLI download command via a new LOCALES_WITH_MANAGED_DATASETS_STR constant — a nice DRY improvement that also fixes a pre-existing bug where the CLI help text omitted en_SG and pt_BR.
  • 7 France-specific PII fields are added to PII_FIELDS in dataset_based_person_fields.py, consistent with the locale-specific sections already present for Japan, Brazil, and India.
  • Docs, NGC download examples, and the parameter table are updated.
  • All test counts are bumped to 8; fr_FR assertions are added in test_persona_repository.py, test_download_service.py, and test_determine_locales_with_all_flag — but test_run_personas_with_all_flag in test_download_controller.py is missing an explicit fr_FR membership check (minor).

Confidence Score: 5/5

Safe to merge — no logic errors or correctness issues; only a minor test-completeness nit remains.

All remaining feedback is P2 (style/completeness). The single missing fr_FR assertion in test_run_personas_with_all_flag is covered indirectly by the count assertion, so there is no correctness risk. The core implementation is clean, follows established patterns, and is backed by a 36-test verification suite.

packages/data-designer/tests/cli/controllers/test_download_controller.py — missing explicit fr_FR assertion in test_run_personas_with_all_flag.

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/config/utils/constants.py Adds fr_FR entry to NEMOTRON_PERSONAS_DATASET_SIZES and introduces LOCALES_WITH_MANAGED_DATASETS_STR to DRY up the locale list used in help text and error messages.
packages/data-designer-engine/src/data_designer/engine/sampling_gen/entities/dataset_based_person_fields.py Adds 7 France-specific PII fields to PII_FIELDS; correctly follows the locale-specific section pattern used for Brazil, Japan, and India.
packages/data-designer/src/data_designer/cli/commands/download.py Replaces hardcoded (and previously incomplete) locale list in CLI help text with LOCALES_WITH_MANAGED_DATASETS_STR; net improvement that also fixes omission of en_SG and pt_BR.
packages/data-designer/tests/cli/controllers/test_download_controller.py Count bumped to 8 and fr_FR added to test_determine_locales_with_all_flag, but test_run_personas_with_all_flag is missing an explicit fr_FR assertion in its downloaded-locales list.
packages/data-designer/tests/cli/repositories/test_persona_repository.py Count updated to 8 and fr_FR added to the expected locale set; changes are complete and consistent.
packages/data-designer/tests/cli/services/test_download_service.py Count updated to 8 and fr_FR membership assertion added; looks correct.
docs/concepts/person_sampling.md Adds fr_FR to the supported locales list, NGC download example, France-specific field reference table, and the parameter description table; all consistent with code changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["NEMOTRON_PERSONAS_DATASET_SIZES\n+ fr_FR: 2.71 GB"] --> B["LOCALES_WITH_MANAGED_DATASETS\nlist of keys"]
    A --> C["LOCALES_WITH_MANAGED_DATASETS_STR\njoined string"]
    B --> D["PersonSamplerParams\nlocale validator"]
    B --> E["PersonaRepository\nregistry"]
    C --> D
    C --> F["CLI download help text\n--locale flag"]
    E --> G["DownloadService\nget_available_locales()"]
    E --> H["DownloadController\n_determine_locales()"]
    I["PII_FIELDS\n+ 7 fr_FR fields"] --> J["PeopleGenFromDataset\nfield allow-list"]
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer/tests/cli/controllers/test_download_controller.py
Line: 88-99

Comment:
**Missing `fr_FR` assertion in all-locales download test**

The test was updated to expect 8 locales and explicitly verifies 7 of them, but the new `fr_FR` locale is never asserted to be in `downloaded_locales`. The parallel test `test_determine_locales_with_all_flag` does include `assert "fr_FR" in result` (line 224), so this is an inconsistency. While the count check indirectly covers it, the explicit assertion would be consistent with the style used elsewhere.

```suggestion
    # Verify all 8 locales were downloaded
    assert mock_download.call_count == 8

    # Verify each locale was downloaded
    downloaded_locales = [call[0][0] for call in mock_download.call_args_list]
    assert "en_US" in downloaded_locales
    assert "en_IN" in downloaded_locales
    assert "en_SG" in downloaded_locales
    assert "fr_FR" in downloaded_locales
    assert "hi_Deva_IN" in downloaded_locales
    assert "hi_Latn_IN" in downloaded_locales
    assert "ja_JP" in downloaded_locales
    assert "pt_BR" in downloaded_locales
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (5): Last reviewed commit: "Merge branch 'main' into add-fr-fr-local..." | Re-trigger Greptile

Update hardcoded locale counts from 7 to 8 and add fr_FR assertions
in download controller and download service tests.
The --locale help text was hardcoded and already stale (missing en_SG,
pt_BR, fr_FR). Build it from LOCALES_WITH_MANAGED_DATASETS so it stays
in sync automatically.
Centralise the comma-joined locale list so it is defined once in
constants and reused in the CLI help text, PersonSamplerParams field
description, and locale validation error message.
Copy link
Copy Markdown
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥖 🇫🇷 🚀

Copy link
Copy Markdown
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the updated review below!

@nabinchha
Copy link
Copy Markdown
Contributor

Nice work on this one, @johnnygreco — clean addition with a great opportunistic refactor. Here are my thoughts.

Summary

This PR registers the France locale (fr_FR, 2.71 GB) in the Nemotron Personas system, adds 7 France-specific PII fields, and centralizes the comma-joined locale string into LOCALES_WITH_MANAGED_DATASETS_STR — eliminating the previously hardcoded (and already stale) locale list in the CLI help text. The implementation matches the stated intent cleanly.

Findings

Warnings — Worth addressing

packages/data-designer/tests/cli/controllers/test_download_controller.py:91-99 — Missing fr_FR assertion in test_run_personas_with_all_flag

  • What: The test bumps the expected count from 7 to 8 and asserts every other locale is in downloaded_locales, but never asserts "fr_FR" in downloaded_locales.
  • Why: The count check (== 8) would catch a missing locale, but the per-locale assertions are there to make failures readable — omitting fr_FR breaks the pattern and means a test that downloads the wrong 8th locale would still pass. The sibling test test_determine_locales_with_all_flag (line 226) does assert fr_FR, so this looks like an oversight.
  • Suggestion: Add assert "fr_FR" in downloaded_locales between the en_SG and hi_Deva_IN assertions.

Suggestions — Take it or leave it

packages/data-designer/tests/cli/repositories/test_persona_repository.py:53-59test_get_by_code_all_locales doesn't cover fr_FR, en_SG, or pt_BR

  • What: The test cases only cover 5 of the 8 locales. This is pre-existing (en_SG and pt_BR were already missing), but since the test name says "all locales" and this PR adds another one, it's a natural time to fill the gap.
  • Why: The test name implies exhaustive coverage — adding the missing entries would make it match reality.
  • Suggestion: Could add fr_FR, en_SG, and pt_BR to the test cases. Totally fine to skip or do in a follow-up since it's pre-existing.

What Looks Good

  • The LOCALES_WITH_MANAGED_DATASETS_STR refactor is a great improvement. The CLI help text was already stale (missing en_SG, pt_BR, fr_FR), and centralizing the string eliminates this entire class of staleness bugs. Nice proactive fix.

  • Clean, minimal changeset. Adding a locale touches exactly the files it should — the constant, the PII fields list, the docs, and the tests. No unnecessary changes, no scope creep.

  • Thorough test updates. All three test files are updated with the new count and fr_FR assertions. The coverage for the new locale is solid.

Verdict

Needs changes — One warning: the missing fr_FR assertion in test_run_personas_with_all_flag. Quick one-liner fix.

@johnnygreco johnnygreco merged commit 0d10bf8 into main Mar 31, 2026
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants