Skip to content

Make the FRS dataset build deterministic (seed all RNG draws)#425

Merged
MaxGhenis merged 2 commits into
mainfrom
fix/deterministic-property-purchased
Jun 4, 2026
Merged

Make the FRS dataset build deterministic (seed all RNG draws)#425
MaxGhenis merged 2 commits into
mainfrom
fix/deterministic-property-purchased

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

@MaxGhenis MaxGhenis commented Jun 4, 2026

Problem

Several assignments in the FRS dataset build drew from the unseeded global numpy RNG, so otherwise-identical builds produced different datasets — non-reproducible by construction. The most damaging instance: property_purchased (which households are charged stamp duty) intermittently spiked the first income decile's effective tax rate to 251%, failing test_first_decile_tax_rate_reasonable and blocking releases for ~2 weeks.

Fix — seed every randomness source on the build path

  • frs.py property_purchased — draw the ~3.85% purchaser sample from a seeded default_rng(0) instead of global np.random.
  • imputations/capital_gains.py — quantile draws for the capital-gains amount imputation now come from a seeded default_rng(0) (these feed person-level capital_gains → CGT revenue).
  • frs.py BRMA assignment — both pandas .sample() calls (region/category rent sampling and the household-level pick) now take a seeded random_state generator.

The SPI synthetic sampling (income.py) was already seeded. The only remaining unseeded np.random is childcare/takeup_rate.py, which is not reached by the dataset build (test-only); left for separate cleanup. Every randomness source on the live FRS build path is now seeded, so the same inputs always produce the same dataset.

Pairs with

Test plan

  • Seeded draws reproducible across runs; pandas accepts a Generator as random_state (verified, pandas 2.3.3 / numpy 2.1.3).
  • Independent review: no unseeded randomness remains on the build path.
  • Full data rebuild green once #1752 releases.

MaxGhenis added 2 commits June 4, 2026 11:00
The data build set property_purchased via unseeded np.random.random(),
so every build drew a different vector of purchasers. That made the
dataset non-reproducible and intermittently spiked the first income
decile's effective tax rate (the draw occasionally marked too many
high-property, low-income households as purchasers), failing
test_first_decile_tax_rate_reasonable and blocking releases.

Draw from a seeded numpy Generator (default_rng(0)) instead of the
global RNG, whose state depends on whatever ran earlier in the build.
Same FRS input now always yields the same ~3.85% purchaser assignment.

Pairs with the policyengine-uk fix flipping property_purchased's
default to False, which fail-safes any household this build does not
explicitly set.
Independent review found the property_purchased seed was necessary but
not sufficient for a reproducible build: two more assignments drew from
the unseeded global numpy RNG.

- imputations/capital_gains.py: quantile draws for the capital gains
  amount imputation now come from a seeded default_rng(0), so capital
  gains (and CGT revenue) are reproducible.
- frs.py BRMA assignment: both pandas .sample() calls (region/category
  rent sampling and the household-level pick) now take a seeded
  random_state generator instead of the global RNG.

The SPI synthetic sampling (income.py) was already seeded. The only
remaining unseeded np.random is childcare/takeup_rate.py, which is not
reached by the dataset build (test-only); left for separate cleanup.
Broadened the changelog to reflect that the whole FRS build is now
deterministic.
@MaxGhenis MaxGhenis changed the title Make property_purchased assignment deterministic (seeded RNG) Make the FRS dataset build deterministic (seed all RNG draws) Jun 4, 2026
@MaxGhenis MaxGhenis merged commit 843293c into main Jun 4, 2026
4 checks passed
@MaxGhenis MaxGhenis deleted the fix/deterministic-property-purchased branch June 4, 2026 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant