Skip to content

Add survey real-data validation against R using federal survey datasets#267

Merged
igerber merged 4 commits intomainfrom
survey-real-data-validation
Apr 4, 2026
Merged

Add survey real-data validation against R using federal survey datasets#267
igerber merged 4 commits intomainfrom
survey-real-data-validation

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 4, 2026

Summary

  • Validate diff-diff's survey variance against R's survey package using three real federal survey datasets
  • Suite A (API): 8 tests — TSL variance with strata, FPC, subpopulations, covariates, and Fay's BRR replicates from R's canonical apistrat dataset
  • Suite B (NHANES): 4 tests + 1 skipped — TSL with real CDC strata + PSU + nest=TRUE, using ACA young adult coverage provision (2007-08 vs 2015-16)
  • Suite C (RECS): 3 tests — JK1 replicate weight variance with 60 real EIA replicate columns
  • All metrics (ATT, SE, df, CI) match R to machine precision (< 1e-10 differences)
  • Add real-data section to survey tutorial (Section 10) demonstrating NHANES ACA DiD with actual CDC data
  • Document results in docs/benchmarks.rst with reproduction instructions
  • Exclude large data files from AI review diff to avoid Codex input limit

Methodology references

  • N/A — no estimator or math changes. This PR adds validation tests, benchmark scripts, and documentation only.

Validation

  • Tests added: tests/test_survey_real_data.py (15 tests + 1 skip)
  • R benchmark scripts: benchmarks/R/benchmark_realdata_{api,nhanes,recs}.R
  • Tutorial updated: docs/tutorials/16_survey_did.ipynb (Section 10: real NHANES data)
  • Results validated against published ACA literature (Antwi et al. 2013, Sommers 2012)

Security / privacy

  • Confirm no secrets/PII in this PR: Yes
  • NHANES data is public-use (no geographic identifiers, no individual names)
  • RECS data is public-use microdata from EIA

Generated with Claude Code

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

Overall Assessment
✅ Looks good

No unmitigated P0/P1 findings. The PR does not change estimator implementations, weighting math, variance formulas, identification checks, or defaults; the main issues are P2/P3 gaps in the new validation harness and in how its guarantees are documented. Static review only: pytest was not available in this environment, so I could not execute the new suite here.

Executive Summary

  • No estimator-level methodology deviation was introduced. The changed code is test/docs/benchmark scaffolding, and the new subpopulation tests follow the Methodology Registry’s domain-estimation pattern.
  • P2: the skipped NHANES repeated-cross-section CallawaySantAnna path is wired to inconsistent time scales between the R benchmark generator and the Python test, so it is not reproducible as written.
  • P2: the docs/changelog claim full machine-precision ATT/SE/df/CI parity across 15 tests, but the pytest module only enforces a subset of those metrics and uses tolerances that are far looser than that claim.
  • P3: realdata is added as a marker, but default pytest still runs it because only slow is excluded.
  • P3: the AI review workflow now omits full diffs for benchmark JSON/CSV fixtures and tutorial notebooks, creating a blind spot for future review automation.

Methodology

  • P2. Impact: the new NHANES RC-DiD cross-validation path is not aligned to the estimator’s time/first_treat contract. The R script remaps to period_cs = period + 1 and first_treat_cs = 2 before calling did::att_gt, but the Python test calls CallawaySantAnna(panel=False) on the original period=0/1 data while keeping first_treat=2. In CallawaySantAnna, first_treat is interpreted on the same scale as time, and overall aggregation only uses post-treatment cells, so this wiring leaves cohort g=2 with no post-treatment cells on the Python side. It is currently masked only because the committed NHANES golden file does not contain b5_cs_rc. Locations: benchmarks/R/benchmark_realdata_nhanes.R:172, benchmarks/R/benchmark_realdata_nhanes.R:203, benchmarks/data/real/nhanes_aca_subset.csv:1, tests/test_survey_real_data.py:396, diff_diff/staggered.py:1539, diff_diff/staggered_aggregation.py:56. Concrete fix: export/use the remapped period_cs and first_treat_cs values in the JSON fixture and Python test, or remap to 1/2 in the Python test before calling fit(). If B5 is intentionally unsupported, remove it from the claimed validation suite rather than leaving a mismatched dormant path.
  • No other methodology findings. I did not find any estimator-code change that conflicts with docs/methodology/REGISTRY.md.

Code Quality

  • No findings.

Performance

  • P3. Impact: the new file is tagged realdata, but default pytest selection still runs it because addopts excludes only slow. If that is intentional, the marker is informational only; if not, the suite routing is inconsistent and baseline CI cost rises unnecessarily. Locations: tests/test_survey_real_data.py:54, pyproject.toml:99. Concrete fix: either change default addopts to exclude realdata and run these in a dedicated job, or mark them slow if they are meant to be opt-in.

Maintainability

  • P3. Impact: the AI PR review workflow now strips all benchmark JSON/CSV diffs and tutorial notebooks from the prompt. Those are the substantive artifacts in this PR, so future automated review will only see filenames, not the actual fixture/notebook changes. Location: ai_pr_review.yml:150. Concrete fix: narrow the exclusion to raw/generated assets only, or append compact summaries for excluded files (row counts, column names, hashes, notebook cell/output counts) so review automation still has visibility.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 4, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 4, 2026
Echo a note into the prompt listing any excluded data files so Codex
knows its diff is partial. Keeps PR #267 in sync with #268.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
igerber and others added 2 commits April 4, 2026 15:07
Validate diff-diff's survey variance estimation against R's survey package
using three real-world datasets: California API (strata+FPC), NHANES ACA
young adult coverage (strata+PSU+nest), and RECS 2020 (JK1 replicate
weights). All 15 tests match R to machine precision (<1e-10 differences).

Includes R benchmark scripts, Python download scripts, golden value JSON
files, and a real-data section in the survey tutorial demonstrating the
ACA dependent coverage provision DiD on actual CDC data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…move B5

- Tighten tolerances from 1-5% to 1e-8 (matching observed < 1e-10 gaps)
- Add missing df/CI assertions to A4, A5, A7, B2, B3, B4
- Remove dormant B5 CallawaySantAnna test (time-scale mismatch, R can't
  produce golden values for 2-period RC-DiD)
- C3 DEFF: change to smoke test (different naive baselines vs R)
- Narrow AI review workflow exclusion to benchmarks/data/real/ only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the survey-real-data-validation branch from 1984199 to eca4ac3 Compare April 4, 2026 19:07
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: eca4ac34fa233863fccfb1dd2d1ab763e05c99e4


Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

  • Prior issues were partially addressed: the Python suite removed NHANES B5 and tightened numerical tolerances from percent-level slack to 1e-8.
  • Severity P1 [Newly identified]: the new A5 API subpopulation validation explicitly waives df/CI parity against R, even though the Methodology Registry still says subpopulation survey df_survey should match survey::degf() after subset(). That leaves a real inference discrepancy undocumented and untested.
  • Severity P2: the docs/changelog still claim machine-precision ATT/SE/df/CI parity across all 15 tests, but several scenarios still assert only subsets of those outputs.
  • Severity P2: the NHANES R benchmark script still contains the optional B5 path while the committed golden JSON/tests no longer do, so the published “rerun the benchmark, then run the tests” workflow remains environment-dependent.
  • Severity P3: the new realdata marker is not excluded by default, and the AI review workflow now hides notebook diffs without surfacing a summary.
  • Static review only: I could not run the suite here because the sandbox Python environment is missing dependencies (numpy import failed).

Methodology

Code Quality

  • No findings.

Performance

  • Severity P3. Impact: realdata is registered but default pytest still runs it because addopts excludes only slow; the new marker is informational only. References: pyproject.toml:99, tests/test_survey_real_data.py:55. Concrete fix: add not realdata to default addopts, or also mark the suite slow.

Maintainability

  • Severity P3. Impact: the AI PR review workflow now excludes docs/tutorials/*.ipynb in addition to the large benchmark assets and removed the explicit excluded-file notice, so future automated review will miss substantive tutorial/notebook changes in this feature. References: .github/workflows/ai_pr_review.yml:150. Concrete fix: restore an excluded-file summary (hashes, notebook cell counts, fixture key lists) or stop dropping notebook source diffs entirely.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Resolve the A5 subpopulation inference contract: either align SurveyDesign.subpopulation() survey df/CI with the benchmarked R domain behavior and then assert A5 df/CI in the real-data test, or add a Registry **Note:** / **Deviation from R:** entry and remove all “Exact” / machine-precision df/CI claims for A5.
  2. Make NHANES benchmark reproduction deterministic by deleting or hard-skipping B5 in benchmarks/R/benchmark_realdata_nhanes.R, or by moving it to a separate optional benchmark not referenced by the committed fixture/docs.
  3. Bring the public validation claims into line with what CI actually checks, or add the missing assertions for A7/C1/C2/C3 (and clarify the narrower scope of A4/A5).

…ssertions

P1: Document subpopulation df deviation in REGISTRY.md — subpopulation()
preserves all strata (conservative, per Lumley 2004) while R's subset()
drops empty strata. ATT and SE match; only df differs.

P2: Add CI assertions to A7, C1, C2. Remove B5 from NHANES R script
(was generating non-deterministic output). Narrow docs/changelog claims
to reflect what each test actually asserts (A4 ATT-only, A5 ATT/SE-only).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 26c16458b4d0f59198458cb98d0ab9bc5c5e6451


Overall Assessment

✅ Looks good

Highest unmitigated severity: P2.

Executive Summary

  • Prior P1 methodology issues appear resolved. The A5 subpopulation df difference is now explicitly documented as a Deviation from R in the registry, and the dormant NHANES B5 path was removed from both the R benchmark and Python suite.
  • The main remaining issue is a documentation/test-scope mismatch for A4: the public benchmark docs still imply exact df parity there, but the test only enforces ATT parity and a finite positive SE.
  • The AI PR review workflow still excludes notebook diffs entirely, so future automated reviews will miss substantive tutorial changes like this PR’s new NHANES section.
  • The committed notebook outputs include local absolute filesystem paths in stderr, which is low-severity but unnecessary environment leakage.
  • Static review only. I could not run pytest here because pytest is not installed in the environment.

Methodology

No unmitigated methodology findings. The previously raised A5 subpopulation df divergence is now documented as a Deviation from R in docs/methodology/REGISTRY.md:2367, and that matches the current subpopulation() implementation in diff_diff/survey.py:422. The non-deterministic NHANES B5 path was also removed from both the R benchmark and Python suite in benchmarks/R/benchmark_realdata_nhanes.R:162 and tests/test_survey_real_data.py:430.

Code Quality

No findings.

Performance

  • Severity P3. Impact: the new realdata marker is informational only; default pytest still runs these tests because addopts filters only not slow, so the new suite becomes part of ordinary runs whenever the fixtures exist. References: pyproject.toml:99, pyproject.toml:100, tests/test_survey_real_data.py:56. Concrete fix: add not realdata to default addopts, or also mark this module slow if the intended behavior is opt-in execution.

Maintainability

Tech Debt

No findings.

Security

Documentation/Tests

  • Severity P2. Impact: the public validation docs still overstate what CI actually enforces for A4. The benchmark page says all comparisons match to machine precision and marks A4 df as exact, but the test only checks ATT parity plus finite/>0 SE, so a TWFE df regression would not be caught by CI. References: docs/benchmarks.rst:599, docs/benchmarks.rst:657, docs/survey-roadmap.md:294, tests/test_survey_real_data.py:200. Concrete fix: either add an A4 assertion for result.survey_metadata.df_survey == r["df"] (and any other intended comparable metric), or narrow the docs to ATT-only validation for A4 and qualify the opening “all comparisons” sentence accordingly.

Remove /Users/... and /var/folders/... paths from committed notebook
warning outputs. These are UserWarning tracebacks that leaked local
environment metadata.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 4, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: ceee94c1043c57d3df7c9f2d67cdfa337161308d


Overall Assessment

Looks good.

Highest unmitigated severity: P2.

Executive Summary

  • No unmitigated methodology defects in the estimator code path. The notable A5 subpopulation df divergence is now explicitly documented in the methodology registry, so it is correctly treated as a documented deviation rather than a blocker.
  • The main remaining issue is still a docs/test-scope mismatch for A4: public docs still describe broader R parity than the committed TWFE test actually enforces.
  • The AI review workflow now explicitly excludes notebook diffs, so future automated reviews will miss substantive tutorial changes like this PR’s new NHANES section.
  • The earlier notebook-output path-leakage concern appears resolved; I did not find the prior absolute local/temp paths in the committed notebook text.
  • Static review only; pytest is not installed in this environment, so I did not execute the new suite.

Methodology

Code Quality

  • No findings.

Performance

  • Severity P3 (informational). Impact: if realdata was intended to make the new suite opt-in, the current pytest defaults do not do that; default runs still include these tests whenever fixtures are present because only slow is excluded. Concrete fix: if opt-in behavior is desired, add not realdata to default addopts, or also mark the module slow. Refs: pyproject.toml:L99-L104, tests/test_survey_real_data.py:L55-L56

Maintainability

  • Severity P3. Impact: the AI PR review workflow now hard-excludes docs/tutorials/*.ipynb from the compiled diff, so future automated reviews will miss substantive tutorial changes entirely; this PR’s new NHANES section is exactly the kind of change that becomes invisible. Concrete fix: include a notebook cell-source summary in the prompt (for example via nbconvert/jupytext), or restore an explicit excluded-file summary so reviewers know content was omitted. Refs: .github/workflows/ai_pr_review.yml:L150-L156, docs/tutorials/16_survey_did.ipynb:L1509-L1529

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the public validation docs still overstate the scope of A4/TWFE parity. The benchmark page says all comparisons match to machine precision and still labels A4 df as exact, while the test only asserts ATT equality plus a finite/positive SE; the roadmap summary likewise says ATT/SE/df/CI match across the API variants before listing A4/A5 exceptions. That leaves room for a TWFE df/CI regression to slip through CI while docs continue to promise exact parity. Concrete fix: either add explicit A4 assertions for every comparable quantity the docs claim, or narrow the docs/roadmap language to ATT-only for A4 and “all directly comparable metrics” instead of blanket machine-precision parity. Refs: docs/benchmarks.rst:L599-L602, docs/benchmarks.rst:L657-L661, docs/benchmarks.rst:L734-L747, docs/survey-roadmap.md:L294-L303, tests/test_survey_real_data.py:L200-L217

@igerber igerber merged commit 7f385dd into main Apr 4, 2026
14 checks passed
@igerber igerber deleted the survey-real-data-validation branch April 4, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant