Make the FRS dataset build deterministic (seed all RNG draws) by MaxGhenis · Pull Request #425 · PolicyEngine/policyengine-uk-data

MaxGhenis · 2026-06-04T10:01:27Z

Problem

Several assignments in the FRS dataset build drew from the unseeded global numpy RNG, so otherwise-identical builds produced different datasets — non-reproducible by construction. The most damaging instance: property_purchased (which households are charged stamp duty) intermittently spiked the first income decile's effective tax rate to 251%, failing test_first_decile_tax_rate_reasonable and blocking releases for ~2 weeks.

Fix — seed every randomness source on the build path

frs.py property_purchased — draw the ~3.85% purchaser sample from a seeded default_rng(0) instead of global np.random.
imputations/capital_gains.py — quantile draws for the capital-gains amount imputation now come from a seeded default_rng(0) (these feed person-level capital_gains → CGT revenue).
frs.py BRMA assignment — both pandas .sample() calls (region/category rent sampling and the household-level pick) now take a seeded random_state generator.

The SPI synthetic sampling (income.py) was already seeded. The only remaining unseeded np.random is childcare/takeup_rate.py, which is not reached by the dataset build (test-only); left for separate cleanup. Every randomness source on the live FRS build path is now seeded, so the same inputs always produce the same dataset.

Pairs with

Default property_purchased to False to stop phantom stamp duty policyengine-uk#1752 — flips property_purchased's default True → False (fail-safe for any household the build doesn't set; this is what fixes the 251% test once released).
Forbid randomness inside variable formulas policyengine-core#500 — forbids randomness inside formulas at the engine level, so this class of bug can't recur in model code.

Test plan

Seeded draws reproducible across runs; pandas accepts a Generator as random_state (verified, pandas 2.3.3 / numpy 2.1.3).
Independent review: no unseeded randomness remains on the build path.
Full data rebuild green once #1752 releases.

The data build set property_purchased via unseeded np.random.random(), so every build drew a different vector of purchasers. That made the dataset non-reproducible and intermittently spiked the first income decile's effective tax rate (the draw occasionally marked too many high-property, low-income households as purchasers), failing test_first_decile_tax_rate_reasonable and blocking releases. Draw from a seeded numpy Generator (default_rng(0)) instead of the global RNG, whose state depends on whatever ran earlier in the build. Same FRS input now always yields the same ~3.85% purchaser assignment. Pairs with the policyengine-uk fix flipping property_purchased's default to False, which fail-safes any household this build does not explicitly set.

Independent review found the property_purchased seed was necessary but not sufficient for a reproducible build: two more assignments drew from the unseeded global numpy RNG. - imputations/capital_gains.py: quantile draws for the capital gains amount imputation now come from a seeded default_rng(0), so capital gains (and CGT revenue) are reproducible. - frs.py BRMA assignment: both pandas .sample() calls (region/category rent sampling and the household-level pick) now take a seeded random_state generator instead of the global RNG. The SPI synthetic sampling (income.py) was already seeded. The only remaining unseeded np.random is childcare/takeup_rate.py, which is not reached by the dataset build (test-only); left for separate cleanup. Broadened the changelog to reflect that the whole FRS build is now deterministic.

MaxGhenis added 2 commits June 4, 2026 11:00

MaxGhenis changed the title ~~Make property_purchased assignment deterministic (seeded RNG)~~ Make the FRS dataset build deterministic (seed all RNG draws) Jun 4, 2026

MaxGhenis merged commit 843293c into main Jun 4, 2026
4 checks passed

MaxGhenis deleted the fix/deterministic-property-purchased branch June 4, 2026 14:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the FRS dataset build deterministic (seed all RNG draws)#425

Make the FRS dataset build deterministic (seed all RNG draws)#425
MaxGhenis merged 2 commits into
mainfrom
fix/deterministic-property-purchased

MaxGhenis commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix — seed every randomness source on the build path

Pairs with

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxGhenis commented Jun 4, 2026 •

edited

Loading