Make the FRS dataset build deterministic (seed all RNG draws)#425
Merged
Conversation
The data build set property_purchased via unseeded np.random.random(), so every build drew a different vector of purchasers. That made the dataset non-reproducible and intermittently spiked the first income decile's effective tax rate (the draw occasionally marked too many high-property, low-income households as purchasers), failing test_first_decile_tax_rate_reasonable and blocking releases. Draw from a seeded numpy Generator (default_rng(0)) instead of the global RNG, whose state depends on whatever ran earlier in the build. Same FRS input now always yields the same ~3.85% purchaser assignment. Pairs with the policyengine-uk fix flipping property_purchased's default to False, which fail-safes any household this build does not explicitly set.
Independent review found the property_purchased seed was necessary but not sufficient for a reproducible build: two more assignments drew from the unseeded global numpy RNG. - imputations/capital_gains.py: quantile draws for the capital gains amount imputation now come from a seeded default_rng(0), so capital gains (and CGT revenue) are reproducible. - frs.py BRMA assignment: both pandas .sample() calls (region/category rent sampling and the household-level pick) now take a seeded random_state generator instead of the global RNG. The SPI synthetic sampling (income.py) was already seeded. The only remaining unseeded np.random is childcare/takeup_rate.py, which is not reached by the dataset build (test-only); left for separate cleanup. Broadened the changelog to reflect that the whole FRS build is now deterministic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Several assignments in the FRS dataset build drew from the unseeded global numpy RNG, so otherwise-identical builds produced different datasets — non-reproducible by construction. The most damaging instance:
property_purchased(which households are charged stamp duty) intermittently spiked the first income decile's effective tax rate to 251%, failingtest_first_decile_tax_rate_reasonableand blocking releases for ~2 weeks.Fix — seed every randomness source on the build path
frs.pyproperty_purchased— draw the ~3.85% purchaser sample from a seededdefault_rng(0)instead of globalnp.random.imputations/capital_gains.py— quantile draws for the capital-gains amount imputation now come from a seededdefault_rng(0)(these feed person-levelcapital_gains→ CGT revenue).frs.pyBRMA assignment — both pandas.sample()calls (region/category rent sampling and the household-level pick) now take a seededrandom_stategenerator.The SPI synthetic sampling (
income.py) was already seeded. The only remaining unseedednp.randomischildcare/takeup_rate.py, which is not reached by the dataset build (test-only); left for separate cleanup. Every randomness source on the live FRS build path is now seeded, so the same inputs always produce the same dataset.Pairs with
property_purchased's defaultTrue → False(fail-safe for any household the build doesn't set; this is what fixes the 251% test once released).Test plan
Generatorasrandom_state(verified, pandas 2.3.3 / numpy 2.1.3).