feat: add fr_FR locale to nemotron personas datasets#468
Conversation
Register the France locale (fr_FR, 2.71 GB) in NEMOTRON_PERSONAS_DATASET_SIZES and add 7 France-specific PII fields: first_name_heritage, name_heritage, is_first_gen_immigrant, household_type, monthly_income_eur, commune, departement.
Greptile SummaryThis PR registers the Key changes:
|
| Filename | Overview |
|---|---|
| packages/data-designer-config/src/data_designer/config/utils/constants.py | Adds fr_FR entry to NEMOTRON_PERSONAS_DATASET_SIZES and introduces LOCALES_WITH_MANAGED_DATASETS_STR to DRY up the locale list used in help text and error messages. |
| packages/data-designer-engine/src/data_designer/engine/sampling_gen/entities/dataset_based_person_fields.py | Adds 7 France-specific PII fields to PII_FIELDS; correctly follows the locale-specific section pattern used for Brazil, Japan, and India. |
| packages/data-designer/src/data_designer/cli/commands/download.py | Replaces hardcoded (and previously incomplete) locale list in CLI help text with LOCALES_WITH_MANAGED_DATASETS_STR; net improvement that also fixes omission of en_SG and pt_BR. |
| packages/data-designer/tests/cli/controllers/test_download_controller.py | Count bumped to 8 and fr_FR added to test_determine_locales_with_all_flag, but test_run_personas_with_all_flag is missing an explicit fr_FR assertion in its downloaded-locales list. |
| packages/data-designer/tests/cli/repositories/test_persona_repository.py | Count updated to 8 and fr_FR added to the expected locale set; changes are complete and consistent. |
| packages/data-designer/tests/cli/services/test_download_service.py | Count updated to 8 and fr_FR membership assertion added; looks correct. |
| docs/concepts/person_sampling.md | Adds fr_FR to the supported locales list, NGC download example, France-specific field reference table, and the parameter description table; all consistent with code changes. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["NEMOTRON_PERSONAS_DATASET_SIZES\n+ fr_FR: 2.71 GB"] --> B["LOCALES_WITH_MANAGED_DATASETS\nlist of keys"]
A --> C["LOCALES_WITH_MANAGED_DATASETS_STR\njoined string"]
B --> D["PersonSamplerParams\nlocale validator"]
B --> E["PersonaRepository\nregistry"]
C --> D
C --> F["CLI download help text\n--locale flag"]
E --> G["DownloadService\nget_available_locales()"]
E --> H["DownloadController\n_determine_locales()"]
I["PII_FIELDS\n+ 7 fr_FR fields"] --> J["PeopleGenFromDataset\nfield allow-list"]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/data-designer/tests/cli/controllers/test_download_controller.py
Line: 88-99
Comment:
**Missing `fr_FR` assertion in all-locales download test**
The test was updated to expect 8 locales and explicitly verifies 7 of them, but the new `fr_FR` locale is never asserted to be in `downloaded_locales`. The parallel test `test_determine_locales_with_all_flag` does include `assert "fr_FR" in result` (line 224), so this is an inconsistency. While the count check indirectly covers it, the explicit assertion would be consistent with the style used elsewhere.
```suggestion
# Verify all 8 locales were downloaded
assert mock_download.call_count == 8
# Verify each locale was downloaded
downloaded_locales = [call[0][0] for call in mock_download.call_args_list]
assert "en_US" in downloaded_locales
assert "en_IN" in downloaded_locales
assert "en_SG" in downloaded_locales
assert "fr_FR" in downloaded_locales
assert "hi_Deva_IN" in downloaded_locales
assert "hi_Latn_IN" in downloaded_locales
assert "ja_JP" in downloaded_locales
assert "pt_BR" in downloaded_locales
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (5): Last reviewed commit: "Merge branch 'main' into add-fr-fr-local..." | Re-trigger Greptile
Update hardcoded locale counts from 7 to 8 and add fr_FR assertions in download controller and download service tests.
The --locale help text was hardcoded and already stale (missing en_SG, pt_BR, fr_FR). Build it from LOCALES_WITH_MANAGED_DATASETS so it stays in sync automatically.
Centralise the comma-joined locale list so it is defined once in constants and reused in the CLI help text, PersonSamplerParams field description, and locale validation error message.
|
Nice work on this one, @johnnygreco — clean addition with a great opportunistic refactor. Here are my thoughts. SummaryThis PR registers the France locale ( FindingsWarnings — Worth addressing
Suggestions — Take it or leave it
What Looks Good
VerdictNeeds changes — One warning: the missing |
Summary
fr_FR, 2.71 GB) inNEMOTRON_PERSONAS_DATASET_SIZES, which auto-propagates toLOCALES_WITH_MANAGED_DATASETS,PersonaRepository,PersonSamplerParamsvalidation, and the download servicedataset_based_person_fields.py:first_name_heritage,name_heritage,is_first_gen_immigrant,household_type,monthly_income_eur,commune,departementfr_FRlocale listing, NGC download example, and field referenceVerification
After downloading the
fr_FRdataset, a 36-test suite was run against the person sampler to validate end-to-end behavior. Tests covered:PersonSamplerParamsvalidation —fr_FRaccepted as locale, works with personas toggle, sex/city/age_range filters, and correctpeople_gen_keyroutingPeopleGenFromDataset) — correct record count,fr_FRlocale on all records, common PII fields present, all 7 France-specific fields present (commune,departement,monthly_income_eur,first_name_heritage,name_heritage,is_first_gen_immigrant,household_type), persona fields absent, sex/city filtering works, no unexpected fields in outputcareer_goals_and_ambitions,detailed_persona, Big Five traits) coexist with PII and France-specific fields, no unknown fields leak into outputNonefor children,regionpreserved (not renamed tostate), France-specific fields survive thegenerate_and_insert_derived_fieldspipelinePersonSamplerviaSamplerRegistry— full pipeline with and without personas, sex and city filteringAll 36 tests passed.