Add bootstrap lonely_psu adjust, CV on estimates, weight trimming, and pretrends+survey by igerber · Pull Request #260 · igerber/diff-diff

igerber · 2026-04-03T15:08:31Z

Summary

Implement lonely_psu="adjust" for survey-aware bootstrap by pooling singleton PSUs into a combined pseudo-stratum (multiplier and Rao-Wu paths)
Add coef_var property (SE / |estimate|) to all 12 results classes with CV display in summary() output
Add trim_weights() utility to prep.py for capping extreme survey weights (absolute or quantile-based)
Thread survey weights through ImputationDiD pretrends: weighted demeaning, WLS, design-based VCV, survey df in F-test

Methodology references (required if estimator / math changes)

Method name(s): Lonely PSU adjustment in bootstrap, coefficient of variation, survey-weighted pre-trend test
Paper / source link(s): Rust & Rao (1996) "Variance Estimation for Complex Surveys Using Replication Techniques"; Borusyak, Jaravel & Spiess (2024) Test 1 (Equation 9)
Any intentional deviations from the source (and why): Bootstrap lonely_psu="adjust" pools singleton PSUs into a pseudo-stratum rather than the TSL approach of centering around the global mean. This is the bootstrap analogue — generates nonzero variance contribution from singletons while preserving the weight-generation architecture.

Validation

Tests added/updated: tests/test_survey_phase8.py (16 new tests: 3 bootstrap adjust, 5 CV, 5 trim_weights, 3 pretrends+survey)
All existing ImputationDiD (80+) and bootstrap (100+) tests pass with no regressions

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

8d: Implement lonely_psu="adjust" for survey-aware bootstrap. Singleton PSUs are pooled into a combined pseudo-stratum for weight generation in both multiplier (generate_survey_multiplier_weights_batch) and Rao-Wu (generate_rao_wu_weights) paths. This matches the analytical TSL "adjust" behavior of centering around the global mean. 8e-i: Add coef_var property (SE / |estimate|) to all 12 results classes with CV display in summary() output. Returns NaN when estimate is 0 or SE is non-finite. Used by federal agencies (NCHS) for publication standards (CV < 30%). 8e-ii: Add trim_weights() utility to prep.py for capping extreme survey weights via absolute upper bound or quantile-based threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Thread survey weights through _compute_lead_coefficients(): - Pass survey weights to _iterative_demean() for weighted demeaning - Pass survey weights to solve_ols() for WLS point estimates - Replace cluster-robust VCV with compute_survey_vcov() for design-based inference (strata/PSU/FPC) - Use survey df in safe_inference() for t-distribution critical values - Adjust F-test denominator df in _pretrend_test() to use survey df Persist resolved_survey and survey_weights in _fit_data dict so _pretrend_test() can access them post-fit. Subset resolved_survey to Omega_0 via dataclasses.replace() before passing to lead regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-03T15:08:36Z

/ai-review

github-actions · 2026-04-03T15:18:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f997ece20a03b1def9aff3c11d4caa6a2c0fe285

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

Executive Summary

The new ImputationDiD survey-pretrend path replaces the old NotImplementedError, but it does so with inconsistent survey-design handling: it naively subsets the resolved design for Omega_0, even though the survey subsystem says subpopulation analysis should retain the full design with zeroed weights, and it then mixes different survey-df sources for lead-level vs joint-test inference.
lonely_psu="adjust" is now enabled in the shared survey bootstrap helpers, but both the Methodology Registry and the SurveyDesign docstring still document a different contract. That is an undocumented variance-method change affecting both multiplier-bootstrap and Rao-Wu estimators.
The new lonely-PSU tests only cover the multiplier / CallawaySantAnna path. The changed Rao-Wu branch used by SunAbraham, SyntheticDiD, and TROP is not exercised.
The repo still contains the old survey-pretrends exception test, so the branch now has conflicting test expectations. I could not run pytest locally because pytest is not installed in this environment.

Methodology

P1 Affected method: ImputationDiD pre-period lead regression / pretrend test (BJS Test 1 / Equation 9). The registry still says pretrends=True with survey_design is unsupported at docs/methodology/REGISTRY.md#L867, but the new implementation slices the resolved survey object down to Omega_0 via dc_replace at diff_diff/imputation.py#L1731 and diff_diff/imputation.py#L2234. That conflicts with the survey subsystem’s own guidance that subpopulation analysis should keep the full design and zero weights outside the domain at diff_diff/survey.py#L430. The same code then uses resolved_survey_0.df_survey for per-lead inference at diff_diff/imputation.py#L2140, but uses the full design df for the joint F-test at diff_diff/imputation.py#L2281; df_survey itself is derived from stored n_psu / n_strata metadata at diff_diff/survey.py#L590. Impact: lead-level and joint pretrend inference can silently use inconsistent design semantics, and the survey VCV block also reconstructs residuals from possibly-NaN dropped coefficients at diff_diff/imputation.py#L2132. Concrete fix: either keep this path unsupported for now, or re-implement it as a true untreated-domain survey analysis using one consistent resolved design object for compute_survey_vcov() and both df calls, preferably via SurveyDesign.subpopulation(), and use the residuals returned by solve_ols rather than recomputing them from the coefficient vector.
P1 Affected methods: survey multiplier bootstrap (CallawaySantAnna, ImputationDiD, TwoStageDiD) and Rao-Wu bootstrap (SunAbraham, SyntheticDiD, TROP). SurveyDesign still documents lonely_psu="adjust" as “center around grand mean” at diff_diff/survey.py#L52, and the registry still says survey bootstrap only supports remove / certainty and should raise for adjust at docs/methodology/REGISTRY.md#L2217. The new code instead pools singleton PSUs into a pseudo-stratum in both the multiplier helper at diff_diff/bootstrap_utils.py#L540 and the Rao-Wu helper at diff_diff/bootstrap_utils.py#L649. Because these are shared helpers, the change reaches callers such as diff_diff/imputation_bootstrap.py#L317, diff_diff/two_stage_bootstrap.py#L285, diff_diff/sun_abraham.py#L1519, diff_diff/synthetic_did.py#L770, and diff_diff/trop_global.py#L1173. Impact: this is an untracked methodology deviation in variance construction, not a documented implementation choice. Concrete fix: either restore the previous NotImplementedError, or add a Note / Note (deviation...) to REGISTRY.md and the survey docstrings that formally defines the pseudo-stratum bootstrap rule, its rationale, and every estimator it affects.

Code Quality

No material findings beyond the methodology issues above.

Performance

No material findings in the changed paths.

Maintainability

P3 trim_weights is imported at the package root at diff_diff/init.py#L97 but never added to __all__ in the public export list starting at diff_diff/init.py#L306. Impact: the new helper is only half-exposed as public API. Concrete fix: add "trim_weights" to __all__ and a small import smoke test.

Tech Debt

No separate tech-debt-only findings. I found no TODO.md entry mitigating the P1 issues above, so they remain unmitigated under the review rubric.

Security

No findings.

Documentation/Tests

P1 The new lonely_psu="adjust" tests only cover the multiplier helper / CallawaySantAnna path at tests/test_survey_phase8.py#L789 and tests/test_survey_phase8.py#L813. Nothing exercises the changed Rao-Wu branch at diff_diff/bootstrap_utils.py#L603, even though that branch now feeds SunAbraham, SyntheticDiD, and TROP through diff_diff/sun_abraham.py#L1519, diff_diff/synthetic_did.py#L770, and diff_diff/trop_global.py#L1173. Impact: per the project checklist, a new parameter/bootstrap interaction is landing without coverage on all affected code paths. Concrete fix: add direct generate_rao_wu_weights() tests for mixed/all-singleton strata and FPC/no-FPC, plus at least one estimator-level regression through a Rao-Wu consumer.
P2 The old contract test still expects NotImplementedError for survey pretrends at tests/test_pretrends_event_study.py#L737, while the new tests only assert no-raise / finite outputs at tests/test_survey_phase8.py#L1073. Impact: the suite now has conflicting expectations and still does not pin down the actual survey-design behavior introduced here. Concrete fix: replace the stale exception test with a regression that asserts the intended survey-domain treatment and df convention, including a case where some PSUs or strata disappear from Omega_0.
Verification note: I could not execute the test suite locally because pytest is not installed in this environment.

Path to Approval

Rework the ImputationDiD survey-pretrend implementation to use one consistent untreated-domain survey design, rather than naive Omega_0 subsetting, and use that same design for compute_survey_vcov() and both lead-level / joint-test df calculations.
Either revert survey-bootstrap lonely_psu="adjust" to NotImplementedError, or document the pseudo-stratum rule in docs/methodology/REGISTRY.md and the survey docstrings as an explicit deviation, naming every affected estimator.
Add Rao-Wu coverage for lonely_psu="adjust": direct helper tests plus at least one estimator-level regression through a Rao-Wu consumer; add a multiplier-bootstrap regression through another shared-helper consumer if support is intended there too.
Replace the stale survey-pretrend exception test with a regression that checks the supported behavior and the chosen df/design convention.

P1: Use consistent full-design df_survey for both per-lead inference and F-test in ImputationDiD pretrends. Add explicit survey_df parameter to _compute_lead_coefficients() passed from the full resolved_survey. P1: Document bootstrap lonely_psu="adjust" pseudo-stratum pooling in REGISTRY.md with deviation note listing all affected estimators. P1: Add Rao-Wu lonely_psu="adjust" test (direct helper test). P2: Fix stale test_imputation_pretrends_survey_raises in test_pretrends_event_study.py — now expects success. P2: Update REGISTRY.md pretrends+survey note to reflect support. P3: Add trim_weights to __all__ in __init__.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-03T17:02:14Z

/ai-review

github-actions · 2026-04-03T17:09:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 713f261e4259346c70e09708cc4ee9c8fd8d050a

Overall Assessment

⛔ Blocker

Highest unmitigated severity: P0.

Executive Summary

The prior documentation blocker around lonely_psu="adjust" is largely resolved: the pseudo-stratum bootstrap rule is now documented in the registry, so that earlier methodology-deviation finding is mitigated at docs/methodology/REGISTRY.md:L2218-L2223.
The remaining blocker is the new ImputationDiD survey-pretrend support for replicate-weight designs. In fit(), the pre-period Equation 9 lead regression silently drops survey weighting for replicate surveys because resolved_survey is stripped before the pretrend path runs at diff_diff/imputation.py:L536-L550 and diff_diff/imputation.py:L1724-L1767.
The same feature then breaks replicate inference in two ways: the replicate-refit override now feeds negative-horizon pretrend keys through a treated-horizon ATT refit that cannot produce them at diff_diff/imputation.py:L583-L687, and pretrend_test() falls back to compute_survey_vcov() even for replicate designs at diff_diff/imputation.py:L2233-L2273 and diff_diff/survey.py:L1399-L1438.
Test coverage improved, but it still misses the combinations that expose the blocker: replicate-weight ImputationDiD(pretrends=True) and an estimator-level Rao-Wu consumer for lonely_psu="adjust". The new tests only cover direct Rao-Wu helper output and non-replicate survey pretrends at tests/test_survey_phase8.py:L915-L949, tests/test_survey_phase8.py:L1111-L1177, and tests/test_pretrends_event_study.py:L729-L750.

Methodology

Severity P0. Impact: ImputationDiD pre-period lead coefficients are wrong for replicate-weight survey designs. The registry now says pretrends=True with survey_design is supported for the Equation 9 lead regression at docs/methodology/REGISTRY.md:L860-L867, and separately documents replicate-weight ImputationDiD variance as estimator-level two-stage refit at docs/methodology/REGISTRY.md:L2254-L2316. But fit() passes resolved_survey=None into _aggregate_event_study() whenever uses_replicate_variance is true at diff_diff/imputation.py:L536-L550. The new pretrend branch only threads survey weights into _compute_lead_coefficients() when both survey_weights and resolved_survey are present at diff_diff/imputation.py:L1724-L1767, so replicate-weight surveys silently fall back to unweighted demeaning / OLS at diff_diff/imputation.py:L2068-L2108. Concrete fix: explicitly branch on resolved_survey.uses_replicate_variance; either reject pretrends=True for replicate-weight surveys for now, or implement a replicate-weight lead-regression refit that actually uses the survey weights and replicate design.
Severity P0. Impact: replicate-weight pretrend inference is internally inconsistent and can collapse to all-NaN. The replicate override now derives _sorted_rel_times from the expanded event-study output, which includes the new negative-horizon pretrend keys at diff_diff/imputation.py:L583-L590. _refit_imp() only recomputes treated-horizon ATT cells from tau_r at diff_diff/imputation.py:L608-L651, so negative horizons become NaN on every replicate, and compute_replicate_refit_variance() drops any replicate vector with any non-finite component at diff_diff/survey.py:L1807-L1885. Separately, pretrend_test() now subsets the replicate design and passes it into _compute_lead_coefficients() at diff_diff/imputation.py:L2233-L2273, which unconditionally swaps in compute_survey_vcov() whenever resolved_survey_0 is present at diff_diff/imputation.py:L2134-L2139. compute_survey_vcov() is the TSL sandwich and ignores replicate columns entirely at diff_diff/survey.py:L1399-L1438. Concrete fix: keep negative-horizon lead coefficients out of the treated ATT replicate vector, and compute pretrend SE/F-test for replicate surveys via replicate lead-regression refits rather than compute_survey_vcov().

Code Quality

No material findings beyond the methodology defects above.

Performance

No material findings in the changed paths.

Maintainability

No unmitigated maintainability findings. The prior trim_weights export gap is fixed in diff_diff/init.py:L92-L99 and diff_diff/init.py:L306-L314.

Tech Debt

No mitigating TODO.md entry covers the P0 issues above, and these would not be mitigatable by tracking anyway under the review rubric.

Security

No findings.

Documentation/Tests

Severity P1. Impact: the new tests still do not cover the combinations that expose the blocker, so the new supported behavior can regress silently. The added survey-pretrend tests exercise only non-replicate survey designs at tests/test_survey_phase8.py:L1111-L1177 and tests/test_pretrends_event_study.py:L729-L750. The Rao-Wu adjust addition still has only a direct helper test, not an estimator-level regression through SunAbraham, SyntheticDiD, or TROP, at tests/test_survey_phase8.py:L915-L949. Concrete fix: add BRR/JK1/SDR replicate-weight regressions for ImputationDiD(pretrends=True) covering both fit(..., aggregate="event_study") and pretrend_test(), plus one end-to-end Rao-Wu estimator test with lonely_psu="adjust".

Path to Approval

In ImputationDiD, either re-block pretrends=True for replicate-weight survey designs, or implement a dedicated replicate-weight Equation 9 lead-regression path that uses the full-sample and replicate survey weights consistently.
Remove negative-horizon pretrend keys from the existing treated-horizon replicate ATT refit vector, or extend that refit so it explicitly re-estimates the lead regression per replicate before forming the replicate vcov.
Make pretrend_test() use the same replicate-weight methodology as fit() for survey replicate designs; do not route replicate designs through compute_survey_vcov().
Add regression tests for SurveyDesign(replicate_weights=..., replicate_method=...) with ImputationDiD(pretrends=True) and one estimator-level Rao-Wu lonely_psu="adjust" consumer path.

Verification note: pytest is not installed in this environment (pytest: command not found), so this review is static only.

Replicate-weight survey designs need per-replicate lead regression refits for pretrends, which are not yet implemented. Reject with NotImplementedError for now — analytical survey designs (strata/PSU/FPC) remain supported. Fixes: replicate designs silently falling back to unweighted pretrends, negative-horizon keys contaminating the replicate refit vector, and compute_survey_vcov being called for replicate designs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-03T17:14:52Z

/ai-review

github-actions · 2026-04-03T17:27:10Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0d916630782a55a4614966e4cb1139cbf35020bf

Overall Assessment

⛔ Blocker

Highest unmitigated severity: P0.

Executive Summary

The prior methodology-documentation blocker for lonely_psu="adjust" is resolved: the registry now explicitly documents the pooled pseudo-stratum bootstrap rule, so that deviation is mitigated under the review rubric at docs/methodology/REGISTRY.md:L2217.
The earlier replicate-weight event-study blocker is only partially fixed. fit() now re-blocks pretrends=True for replicate designs in event-study/all paths at diff_diff/imputation.py:L242, but the public pretrend_test() path still accepts replicate-weight fits and silently computes Taylor-linearized rather than replicate-based inference.
The new analytical-survey pretrend path has a second SE bug: it recomputes residuals from a coefficient vector that may contain NaNs, which breaks the existing weighted rank-deficiency contract and can turn otherwise-estimable survey pretrend SEs/F-tests into all-NaN.
The new coef_var feature is minorly wrong for exact-zero SEs: every implementation returns NaN instead of 0.0.
Static review only. I could not execute the new tests locally because the available Python environment is missing numpy/pandas.

Methodology

Affected methods: ImputationDiD pretrend Test 1 / Equation 9 and survey-aware bootstrap lonely-PSU handling. Cross-checking the cited BJS paper, Test 1 is an untreated-only OLS lead regression estimated on untreated observations and tested with a heteroskedasticity-/cluster-robust Wald test. (academic.oup.com)

Severity P0. Impact: replicate-weight survey pretrend_test() is still silently wrong. The new fit-time guard only blocks pretrends=True for aggregate in ("event_study", "all") at diff_diff/imputation.py:L242, but users can still fit a replicate-weight survey model and call the public method at diff_diff/imputation_results.py:L411. _pretrend_test() subsets the replicate design and passes it into _compute_lead_coefficients() at diff_diff/imputation.py:L2245, and _compute_lead_coefficients() unconditionally swaps in compute_survey_vcov() whenever resolved_survey_0 is present at diff_diff/imputation.py:L2146. But compute_survey_vcov() is the Taylor-linearized sandwich path, not replicate-weight variance, by construction at diff_diff/survey.py:L1399. The same code then uses replicate df_survey for coefficient inference and the F-test at diff_diff/imputation.py:L2285 and diff_diff/imputation.py:L2300. That yields a Taylor-linearized numerator with replicate-weight denominator df on a path the registry says is not implemented at docs/methodology/REGISTRY.md:L867. Concrete fix: until a full replicate lead-regression refit exists, make _pretrend_test() raise NotImplementedError for resolved_survey.uses_replicate_variance; if full support is intended, implement per-replicate Equation 9 refits and a replicate-based Wald test.
Severity P1. Impact: the new analytical-survey lead-regression SE path breaks on weighted rank-deficient fits. After solve_ols() returns, the survey override recomputes residuals as y_dm - X_dm @ coefficients at diff_diff/imputation.py:L2150. That is unsafe because solve_ols() intentionally returns finite residuals even when dropped columns get NaN coefficients; it rebuilds fitted values from kept columns only at diff_diff/linalg.py:L674 and diff_diff/linalg.py:L868. In the new survey path, any dropped coefficient can therefore poison the residual vector and turn the survey VCV, per-horizon SEs, and F-test into all-NaN even when the identifiable lead coefficients are fine. Concrete fix: reuse result[1] from solve_ols() when calling compute_survey_vcov(), or reconstruct fitted values from non-NaN coefficients only, and add a survey + collinear-covariate regression test.
Severity P3. Impact: no action required. The bootstrap lonely_psu="adjust" implementation now pools singleton PSUs into a pseudo-stratum at diff_diff/bootstrap_utils.py:L540 and diff_diff/bootstrap_utils.py:L649, whereas the analytical TSL adjust path centers singleton PSU scores on the global mean at diff_diff/survey.py:L1351. Because the registry now documents that bootstrap rule as an intentional analogue at docs/methodology/REGISTRY.md:L2217, I am treating it as a documented deviation, not a defect.

Code Quality

Severity P2. Impact: the new coef_var property is wrong for exact-zero SEs. Every new implementation requires se > 0, so an estimate with se == 0 reports NaN/omits CV instead of returning the mathematically correct 0.0; representative instances are diff_diff/results.py:L108, diff_diff/imputation_results.py:L156, and diff_diff/staggered_results.py:L144. Concrete fix: change the guard to allow finite zero SEs and return 0.0 when estimate != 0 and se == 0.

Performance

No material findings in the changed paths.

Maintainability

No material findings beyond the correctness issues above.

Tech Debt

No separate tech-debt finding. TODO.md does not track the P0/P1 issues above, and those would not be mitigable there under the review rubric.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new tests do not cover the two remaining broken paths. The added survey pretrend tests cover analytical-survey success at tests/test_survey_phase8.py:L1114 and one fit-time replicate rejection at tests/test_survey_phase8.py:L1179, but there is no regression for results.pretrend_test() after a replicate-weight fit and no survey-pretrend test with a dropped covariate/lead column. Concrete fix: add one test that pretrend_test() raises on a BRR/JK replicate design after fit(), and one analytical-survey test with a rank-deficient nuisance regressor that still expects finite lead SEs for identified horizons.

Path to Approval

Block replicate-weight survey designs in _pretrend_test() unless and until a true replicate-weight Equation 9 refit path is implemented; do not route replicate designs through compute_survey_vcov().
Fix _compute_lead_coefficients() to use the finite residuals already returned by solve_ols() (or equivalent kept-column reconstruction) before calling compute_survey_vcov().
Add regression tests for results.pretrend_test() on replicate-weight survey fits and for analytical-survey pretrends with rank-deficient covariates.

P0: Block replicate-weight survey designs in _pretrend_test() — the public method was still accessible after a replicate fit, routing through compute_survey_vcov (TSL) instead of replicate variance. P1: Use residuals from solve_ols (result[1]) instead of recomputing y_dm - X_dm @ coefficients in the survey VCV path. solve_ols already handles rank-deficient fits by rebuilding fitted values from kept columns only, so residuals are finite even with NaN coefficients. P2: Add regression test for pretrend_test() rejection on replicate fits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-03T17:52:53Z

/ai-review

github-actions · 2026-04-03T18:02:25Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 84143550e5561011c498d5c5b62f800bdc8a21e6

Overall Assessment

⛔ Blocker

Highest unmitigated severity: P0.

Executive Summary

The prior replicate-weight blocker is fixed: fit(pretrends=True) now rejects replicate-weight survey designs, and the public pretrend path rejects them too at diff_diff/imputation.py:L242 and diff_diff/imputation.py:L2198.
The prior residual-handling issue is only partially resolved. The new survey pretrend path correctly reuses solve_ols() residuals at diff_diff/imputation.py:L2153, but it still calls compute_survey_vcov() on the full possibly singular design matrix.
[Newly identified] analytical-survey pretrend_test() can still emit a finite joint-test p-value when survey df is nonpositive by coercing df_survey to 1, while coefficient-level inference on the same path correctly treats that df as undefined.
The new bootstrap lonely_psu="adjust" pooling rule is documented in the Methodology Registry, so under the review rubric it is informational, not a defect.
coef_var remains minorly wrong for exact-zero SEs across the new results classes.
Static review only. I could not execute the test suite here because pytest is unavailable and the local Python environment is missing numpy.

Methodology

Affected methods: Borusyak-Jaravel-Spiess Test 1 / Equation 9 pretrend regression and survey replication/bootstrap handling. In the paper, Test 1 is an OLS regression on untreated observations only, with joint significance tested via a heteroskedasticity- and cluster-robust Wald test; Rust & Rao is the cited general reference on jackknife/BRR/bootstrap replication methods for complex surveys. (academic.oup.com)

Severity P0 [Newly identified]. Impact: analytical-survey pretrend_test() can silently report a finite joint-test p-value when the full-design survey df is nonpositive. The per-horizon survey path sends survey_df into safe_inference() at diff_diff/imputation.py:L2170, and safe_inference() returns all-NaN inference when df <= 0 at diff_diff/utils.py:L178. But the joint F-test branch replaces the same denominator df with max(resolved_survey.df_survey, 1) at diff_diff/imputation.py:L2313, even though analytical survey df is defined as n_psu - n_strata at diff_diff/survey.py:L619 and the registry says this path should use the full-design df_survey at docs/methodology/REGISTRY.md:L867. That yields internally inconsistent and misleading statistical output on designs with n_psu == n_strata. Concrete fix: do not coerce nonpositive df_survey to 1; return p_value = np.nan for the joint test (or implement a documented asymptotic fallback consistently), and add a regression test with one PSU per stratum.
Severity P1 [Newly identified]. Impact: survey-weighted pretrends still fail on rank-deficient lead/covariate designs. The new code now uses solve_ols() residuals, but it still passes the full demeaned design matrix into compute_survey_vcov() at diff_diff/imputation.py:L2146. compute_survey_vcov() solves against full X'WX and raises on singular matrices at diff_diff/survey.py:L1450. Elsewhere, the library’s survey OLS paths first reduce to the kept columns when coefficients are NaN, then expand the vcov back out, e.g. diff_diff/estimators.py:L1318. Concrete fix: mirror that kept-column reduction/expand-back pattern in _compute_lead_coefficients(), then add an analytical-survey regression test with a duplicate covariate or aliased lead.
Severity P3. Impact: no action required. The bootstrap lonely_psu="adjust" implementation pools singleton PSUs into a pseudo-stratum at diff_diff/bootstrap_utils.py:L540 and diff_diff/bootstrap_utils.py:L649. That is explicitly documented as a bootstrap analogue in the registry at docs/methodology/REGISTRY.md:L2217, so it is a documented deviation, not a defect.

Code Quality

Severity P2. Impact: exact-zero SE estimates still report coef_var = NaN instead of 0.0. Representative implementations guard on se > 0 at diff_diff/results.py:L108, diff_diff/imputation_results.py:L156, and diff_diff/staggered_results.py:L144; the same pattern appears across the new coef_var properties. Concrete fix: allow finite se == 0 and return 0.0 when the estimate is finite and nonzero.

Performance

No material findings in the changed paths.

Maintainability

No separate maintainability finding beyond the rank-deficient survey-vcov bug above.

Tech Debt

No separate tech-debt finding. TODO.md does not already track the P0/P1 issues above, so they remain unmitigated.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new tests cover the analytical-survey happy path and replicate rejection, but they do not cover the two remaining broken edges. The added survey tests are at tests/test_survey_phase8.py:L1114, tests/test_survey_phase8.py:L1179, and tests/test_survey_phase8.py:L1215; the covariate pretrend test at tests/test_pretrends_event_study.py:L533 is unweighted and does not hit the new survey path. Concrete fix: add one analytical-survey test with a duplicate covariate/aliased lead, and one analytical-survey test with n_psu == n_strata asserting the joint-test p-value contract.

Path to Approval

Fix _pretrend_test() so nonpositive analytical df_survey does not get coerced to 1; return NaN joint-test inference (or a documented asymptotic fallback) and add a regression test with one PSU per stratum.
Fix _compute_lead_coefficients() to compute survey vcov on kept columns only when solve_ols() drops rank-deficient columns, then expand back to the full matrix; add a survey-pretrend regression test with duplicate covariates or an aliased lead.

P0: Don't coerce nonpositive survey df_survey to 1 in pretrend F-test. Return NaN p-value when df_survey <= 0 (variance unidentified). P1: Reduce to kept (finite-coefficient) columns before calling compute_survey_vcov in _compute_lead_coefficients, then expand VCV back to full matrix with NaN for dropped columns. Prevents singular X'WX on rank-deficient lead/covariate designs. P2: Change coef_var guard from se > 0 to se >= 0 across all 12 results classes. SE=0 with finite nonzero estimate (e.g., FPC census) now correctly returns CV=0.0 instead of NaN. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-03T18:14:57Z

/ai-review

github-actions · 2026-04-03T18:24:28Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 01e9a8fd48fd42e2e7593c9f74cfe32bc7e9fa5d

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1.

This re-review is materially better than the last one: the prior nonpositive-df_survey joint-test bug and the rank-deficient survey-VCV issue both look fixed in diff_diff/imputation.py:L2153-L2165 and diff_diff/imputation.py:L2321-L2332. The remaining blocker is a newly identified survey-pretrend variance issue.

Executive Summary

Previous blocker checks appear resolved: the survey pretrend joint test now returns NaN when df_survey <= 0, and the lead-regression survey VCV now reduces to kept columns before calling compute_survey_vcov().
Severity P1 [Newly identified]. The new analytical-survey pretrend path still handles Omega_0 as a physically subsetted survey design, and it asks solve_ols() for a cluster-robust VCV before replacing it with survey VCV. On valid survey-domain inputs this can change design-based SEs, and on one-PSU untreated domains it can raise instead of returning survey-style NaN inference. (r-survey.r-forge.r-project.org)
Severity P3. The new lonely_psu="adjust" bootstrap rule is explicitly documented in the Methodology Registry, so under this rubric it is informational rather than a defect.
Severity P2. coef_var still returns NaN instead of 0.0 for exact-zero SEs across the new result classes.
Static review only. I could not run the tests here because the local Python environment is missing numpy.

Methodology

Affected methods: Borusyak-Jaravel-Spiess Test 1 / Equation 9 pretrend regression, its new analytical-survey extension, and survey-aware bootstrap lonely-PSU handling. In the paper, Test 1 estimates gamma by OLS on untreated observations only and uses a heteroskedasticity- and cluster-robust Wald test; Rust & Rao is a general reference on replication-based variance methods for complex surveys. (academic.oup.com)

Severity P1 [Newly identified]. Impact: the new analytical-survey pretrend implementation violates the survey subpopulation contract. It slices weights/strata/PSUs/FPC down to Omega_0 at diff_diff/imputation.py:L1736-L1761 and diff_diff/imputation.py:L2268-L2293, even though the survey module itself says subpopulation analysis should preserve the full design and zero out excluded observations at diff_diff/survey.py:L427-L432. Official survey docs describe the same requirement: subsetting a survey design should keep the original strata/cluster information rather than treat the domain as a new sample. That means the new code can silently change lonely-PSU handling, FPC behavior, and lead-coefficient SEs whenever some PSU/stratum has no untreated observations. The same path also requests a cluster-robust VCV from solve_ols() at diff_diff/imputation.py:L2106-L2121, so a one-PSU untreated domain can error before the survey VCV path runs. Concrete fix: preserve full survey structure outside Omega_0 using zero weights (or equivalent full-length zero-padded score arrays), and when survey VCV is intended call solve_ols(..., return_vcov=False, cluster_ids=None) and compute only compute_survey_vcov() afterward. (r-survey.r-forge.r-project.org)
Severity P3. Impact: no action required. The pooled-singleton implementation in diff_diff/bootstrap_utils.py:L498-L552 and diff_diff/bootstrap_utils.py:L603-L671 is explicitly documented as a deviation in docs/methodology/REGISTRY.md:L2217-L2223, so under the review rubric it is informational rather than blocking. (pubmed.ncbi.nlm.nih.gov)

Code Quality

Severity P2. Impact: the new coef_var properties treat finite se == 0 as undefined, so exact-zero-SE results report NaN instead of 0.0 and omit the CV line from summary(). Representative copies of the same guard are in diff_diff/results.py:L107-L114, diff_diff/imputation_results.py:L155-L162, and diff_diff/staggered_results.py:L144-L150. Concrete fix: allow finite zero SEs and return 0.0 when the estimate is finite and nonzero, then add a regression test for se=0.0.

Performance

No material findings in the changed paths.

Maintainability

No separate maintainability findings beyond the survey-pretrend issue above.

Tech Debt

No mitigating entry for the remaining issues is present in TODO.md, so they remain unmitigated under the stated rubric.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new survey-pretrend tests in tests/test_survey_phase8.py:L1114-L1177 cover the happy path and replicate-weight rejection, but they do not cover the remaining broken edge: an analytical-survey pretrend domain where Omega_0 drops a PSU/stratum or collapses to one untreated PSU. The CV tests in tests/test_survey_phase8.py:L960-L1006 also miss the exact-zero-SE case. Concrete fix: add one analytical-survey pretrend test with an always-treated PSU/stratum and one coef_var(se=0.0) regression test.

Path to Approval

Rework the analytical-survey pretrend path so Omega_0 is handled as a survey subpopulation, not a naively subsetted design, and skip the intermediate cluster-robust VCV computation when survey VCV is the intended backend.
Add regression coverage for:
1. an analytical-survey pretrend case where an always-treated PSU/stratum is absent from Omega_0, and
2. a one-untreated-PSU analytical-survey case that should return handled NaN inference rather than raising.

P1: Skip cluster_ids in solve_ols when survey VCV will replace the cluster-robust VCV anyway. Prevents errors on single-PSU untreated domains where cluster-robust VCV would fail before survey VCV runs. P2: Add test_coef_var_zero_se regression (SE=0 -> CV=0.0). Defer subpopulation-based Omega_0 survey design to TODO.md as medium priority — current physical subsetting is correct for the common case but can change lonely-PSU/FPC behavior when some PSU/stratum has no untreated observations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-03T21:15:20Z

/ai-review

github-actions · 2026-04-03T21:25:28Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c0580dfa6661e2822be3ac35b84147122edbb5cc

Overall Assessment

Needs changes

Executive Summary

This re-review is materially better than the last one: the earlier rank-deficient survey-VCV and nonpositive-df_survey issues in ImputationDiD look fixed at diff_diff/imputation.py:L2158 and diff_diff/imputation.py:L2326.
P1 remains in the new analytical-survey pretrend support: the code still physically subsets the survey design to Omega_0 before design-based variance is computed, at diff_diff/imputation.py:L1736 and diff_diff/imputation.py:L2271.
The new TODO entry at TODO.md:L57 correctly acknowledges that limitation, but under this rubric it does not mitigate a P1 incorrect-variance/methodology issue.
The new lonely_psu="adjust" bootstrap behavior is explicitly documented in the Methodology Registry at docs/methodology/REGISTRY.md:L2217, so I’m treating it as P3 informational rather than a blocker.
The earlier zero-SE coef_var concern appears resolved in the new properties, e.g. diff_diff/results.py:L108.
Static review only: I could not execute the test suite here because pytest, numpy, pandas, and scipy are not installed.

Methodology

P1 Impact: The new support note at docs/methodology/REGISTRY.md:L867 says analytical survey pretrends are supported, but the implementation still constructs a physically subsetted ResolvedSurveyDesign with dc_replace(...) in both the fit-time pretrend path and _pretrend_test() (diff_diff/imputation.py:L1736, diff_diff/imputation.py:L2271) and then feeds that reduced design into compute_survey_vcov() at diff_diff/imputation.py:L2160. Your own survey API documents that subpopulation analysis should preserve the full design with zeroed weights at diff_diff/survey.py:L427. BJS Test 1 is an OLS/Wald procedure on untreated observations only, but survey-domain analysis should still retain the original cluster/strata design information rather than physically dropping out-of-domain rows. That means a PSU/stratum with no untreated observations can still flip lonely-PSU/FPC handling and change the reported survey SEs and p-values for the new feature. (academic.oup.com) Concrete fix: keep the full survey design and zero weights outside Omega_0 (or equivalently zero-pad the score/residual arrays back to full length) before calling compute_survey_vcov() in both call sites; if that is not being implemented in this PR, restore the previous rejection instead of claiming support.
P3 Impact: No blocker. The pooled-singleton bootstrap rule for lonely_psu="adjust" is explicitly documented in the registry at docs/methodology/REGISTRY.md:L2217, with representative implementations at diff_diff/bootstrap_utils.py:L498 and diff_diff/bootstrap_utils.py:L603. Concrete fix: none required for approval.

Code Quality

No material findings beyond the methodology issue above.

Performance

No material findings.

Maintainability

No separate maintainability finding, but the duplicated survey-subsetting blocks in diff_diff/imputation.py:L1736 and diff_diff/imputation.py:L2271 should be centralized once the P1 is fixed.

Tech Debt

P3 Impact: The new tracked item at TODO.md:L57 is useful bookkeeping, but it does not mitigate the P1 above because this rubric does not allow TODO-tracking to downgrade incorrect variance/methodology issues. Concrete fix: keep the TODO, but still fix or revert the feature before approval.

Security

No findings.

Documentation/Tests

P2 Impact: The new survey-pretrend tests cover happy-path acceptance and replicate-weight rejection (tests/test_survey_phase8.py:L1130, tests/test_survey_phase8.py:L1195), but they still do not cover the domain case that triggers the remaining bug: a full-design PSU/stratum that contributes zero untreated observations to Omega_0. Concrete fix: add one analytical-survey regression test with an always-treated PSU/stratum, or equivalent domain exclusion, and assert that both fit-time pre-period SEs and _pretrend_test() use full-design subpopulation semantics rather than singleton handling after physical subsetting.
Static review only; I could not run the tests in this environment because the Python scientific stack is unavailable.

Path to Approval

Replace the dc_replace(...) survey subsetting in both pretrend code paths with full-design subpopulation handling before compute_survey_vcov(), or revert the new analytical-survey pretrend support until that implementation exists.
Add a regression test where at least one full-design PSU/stratum is absent from Omega_0, covering both event-study pre-period coefficients and _pretrend_test().

Replace physical Omega_0 subsetting with zero-padded full-design approach: scores and residuals from the lead regression are zero-padded back to full-panel length before calling compute_survey_vcov(), so the full PSU/strata structure is preserved for design-based variance. This follows the standard survey subpopulation convention (Lumley 2004, R survey::subset.survey.design) where out-of-domain observations contribute zero to the estimating equation but retain their design information for correct variance estimation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-03T21:32:39Z

/ai-review

github-actions · 2026-04-03T21:42:24Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 383b23b725b9ff3376175ba7ceea135184dc7871

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior methodology blocker on ImputationDiD survey pretrends looks fixed: the lead-regression path now preserves the full survey design and zero-pads Omega_0 scores before survey VCV computation.
The new lonely_psu="adjust" bootstrap rule is documented in the Methodology Registry, so I’m treating that change as P3 informational rather than a defect.
P1 remains in the new trim_weights() utility: quantile-based trimming silently turns the entire weight column into NaN when any input weight is missing.
The new survey-pretrend tests still do not directly cover the previously requested domain edge case where a full-design PSU/stratum contributes no untreated Omega_0 observations.
Static review only; I could not run the test suite here because numpy is not installed in this environment.

Methodology

Severity: P3. Impact: The earlier survey-pretrend methodology issue appears resolved. The new path in diff_diff/imputation.py:L1736, diff_diff/imputation.py:L2140, and diff_diff/imputation.py:L2277 now keeps the full design and zero-pads Omega_0 scores before compute_survey_vcov(), matching the updated registry note at docs/methodology/REGISTRY.md:L867. Likewise, the pooled-singleton bootstrap rule in diff_diff/bootstrap_utils.py:L540 and diff_diff/bootstrap_utils.py:L649 is explicitly documented at docs/methodology/REGISTRY.md:L2217, so under this rubric it is informational, not blocking. Concrete fix: none. BJS frames pre-trend testing as OLS on untreated observations only, and the survey package defines survey.lonely.psu="adjust" as centering lonely strata at the population mean while also applying that option during replicate-design construction. (academic.oup.com)

Code Quality

Severity: P1. Impact: trim_weights() computes the quantile cap from the raw weight vector at diff_diff/prep.py:L1278 and then applies it with np.minimum at diff_diff/prep.py:L1283. If weight_col contains any NaN, np.quantile returns NaN, and the subsequent trim step poisons the entire output column with NaN. That is missing edge-case handling in a new public API and can silently corrupt weights before estimation. Concrete fix: either reject non-finite weights up front or use np.nanquantile plus masked trimming so existing missing rows stay missing instead of converting all rows to NaN; add a regression test with one missing weight and quantile=....

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The new resolved entry at TODO.md:L57 matches an actual code fix rather than deferring an active correctness issue.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new survey-pretrend tests at tests/test_survey_phase8.py:L1130 and tests/test_survey_phase8.py:L1173 only exercise the shared fixture from tests/test_survey_phase8.py:L48, where every stratum still has untreated observations. That leaves the exact regression scenario from the previous review untested: a full-design PSU/stratum with zero Omega_0 rows, which is what the new zero-padded full-design variance path is supposed to fix. Concrete fix: add one analytical-survey regression test with an always-treated PSU/stratum, or equivalent domain exclusion, and assert finite pre-period SEs plus valid pretrend_test() output.
Severity: P3. Impact: The registry was updated, but the in-code docstrings in diff_diff/imputation.py:L2027 and diff_diff/imputation.py:L2206 still describe the lead regression and F-test as cluster-robust only. That is now incomplete for analytical survey designs. Concrete fix: update those docstrings/comments to mention survey-weighted demeaning, design-based VCV, and df_survey when a full survey design is present.

Path to Approval

Harden trim_weights() against missing/non-finite weights in the quantile path, and add a regression test proving that one NaN input does not poison the full trimmed column.
Add a survey-pretrend regression test where at least one full-design PSU/stratum has no untreated Omega_0 observations, covering both pre-period event-study effects and pretrend_test().

Use np.nanquantile instead of np.quantile so NaN weights don't poison the quantile computation and corrupt the entire output column. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-03T21:47:36Z

/ai-review

github-actions · 2026-04-03T21:55:49Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8fc20baca08dd8758db55d91245951be16b9f532

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings remain; the highest remaining severity is P2.

Executive Summary

The previous P1 on trim_weights() NaN poisoning is fixed in diff_diff/prep.py:L1278-L1285 and covered by tests/test_survey_phase8.py:L1121-L1128.
The prior survey-pretrend methodology blocker appears resolved: analytical survey pretrends now carry the full design into variance estimation and use survey df in the pretrend F-test at diff_diff/imputation.py:L1736-L1762, diff_diff/imputation.py:L2136-L2172, and diff_diff/imputation.py:L2206-L2328.
The new lonely_psu="adjust" bootstrap behavior is explicitly documented in docs/methodology/REGISTRY.md:L2217-L2224, so under this rubric it is P3-informational, not a defect.
One previous non-blocking gap remains: the new survey-pretrend tests still do not hit the exact zero-Omega_0 PSU/stratum case that the zero-padding path was added to handle.
Static review only here: pytest is not installed in this environment (pytest: command not found).

Methodology

Severity: P3-informational. Impact: The ImputationDiD survey-pretrend path now preserves the paper’s Test 1 structure while swapping in a documented analytical-survey variance path; that resolves the earlier undocumented mismatch rather than introducing a new one. The registry note at docs/methodology/REGISTRY.md:L867-L867 matches the implementation in diff_diff/imputation.py:L1736-L1762, diff_diff/imputation.py:L2136-L2172, and diff_diff/imputation.py:L2206-L2328. Borusyak, Jaravel, and Spiess describe Test 1 / Equation 9 as a regression on untreated observations only. citeturn0search2 Concrete fix: none.
Severity: P3-informational. Impact: The pooled-singleton bootstrap rule for lonely_psu="adjust" in diff_diff/bootstrap_utils.py:L497-L550 and diff_diff/bootstrap_utils.py:L603-L671 is a documented registry deviation, so it is non-blocking under the stated review rules. Rust and Rao is a general replication-techniques reference, not a repo-specific pooled-singleton specification. citeturn1search0 Concrete fix: none.

Code Quality

No findings. The prior trim_weights() quantile/NaN corruption issue is fixed in diff_diff/prep.py:L1278-L1285 and regression-tested in tests/test_survey_phase8.py:L1121-L1128.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The previous coverage gap remains. The new survey-pretrend tests in tests/test_survey_phase8.py:L1141-L1204 and tests/test_pretrends_event_study.py:L730-L750 verify supported survey cases, but they still do not exercise a full-design PSU/stratum with zero untreated Omega_0 rows, which is the highest-risk branch in the new zero-padding variance path at diff_diff/imputation.py:L1736-L1762 and diff_diff/imputation.py:L2136-L2165. Concrete fix: add a regression test with at least one always-treated PSU/stratum, or an equivalent domain restriction that removes all untreated Omega_0 observations from one design PSU, and assert finite pre-period SEs plus valid pretrend_test() output.
Severity: P3. Impact: Some in-code docs still describe the old behavior. diff_diff/imputation.py:L2027-L2035, diff_diff/imputation.py:L2208-L2212, and diff_diff/imputation_results.py:L413-L417 still say pretrend inference is cluster-robust only, and diff_diff/bootstrap_utils.py:L430-L444 / diff_diff/bootstrap_utils.py:L561-L569 still describe purely within-stratum bootstrap generation even though pooled-singleton adjust is now supported. Concrete fix: update those docstrings to mention analytical-survey pretrend inference and the pooled-singleton bootstrap path, including the skipped FPC scaling on pooled singletons.

P2: Add test_pretrends_survey_always_treated_psu — regression test with a PSU/stratum that has mostly treated obs (sparse untreated domain). Exercises the zero-padding subpopulation path. P3: Update docstrings in _compute_lead_coefficients, _pretrend_test, pretrend_test (results), and both bootstrap helper functions to describe survey VCV support and pooled-singleton adjust behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-03T23:17:06Z

/ai-review

github-actions · 2026-04-03T23:25:10Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4cbe488a602b100acb5326d03adb61a699d5bcc0

Overall Assessment

⚠️ Needs changes

One unmitigated P1 remains in the new lonely_psu="adjust" survey-bootstrap path.

Executive Summary

The previous survey-pretrend methodology blocker looks resolved: the new analytical survey pretrend path is implemented in diff_diff/imputation.py:1736, diff_diff/imputation.py:2104, and diff_diff/imputation.py:2209, and the deviation is now documented in docs/methodology/REGISTRY.md:867.
P1: lonely_psu="adjust" bootstrap still silently drops the singleton contribution when the design has exactly one singleton stratum total.
The pooled-singleton bootstrap deviation itself is documented in docs/methodology/REGISTRY.md:2217; the defect is the undocumented one-singleton fallback, not the documented 2+-singleton pooling rule.
The previous P2 coverage gap on the zero-Omega_0 survey-pretrend branch is still not actually covered by the new tests.
Static review only: the provided Python environment could parse the changed files, but it did not have project runtime deps (numpy), so I could not execute the test suite.

Methodology

Severity: P1. Impact: In diff_diff/bootstrap_utils.py:544 and diff_diff/bootstrap_utils.py:656, lonely_psu="adjust" only gives a nonzero bootstrap contribution when there are at least two singleton PSUs to pool. With exactly one singleton stratum total, the multiplier path sets that PSU’s bootstrap weights to zero and the Rao-Wu path leaves its base weights unchanged, so the singleton contributes zero bootstrap variance. That is not just an implementation choice: the library’s analytical adjust path in diff_diff/survey.py:1250 still uses grand-mean centering, and standard survey-package adjust semantics do the same rather than dropping the singleton. This can silently understate SEs for any affected bootstrap inference path. Concrete fix: in both bootstrap generators, either implement a nonzero single-singleton fallback that is consistent with the advertised adjust semantics, or explicitly reject that configuration with a clear error/warning and matching registry note instead of returning understated variance. (r-survey.r-forge.r-project.org)
Severity: P3. Impact: The broader pretrends + analytical survey_design change is now a documented deviation rather than an undocumented methodology drift. The imputation paper’s estimator is still fit on untreated observations only, and the repo now explicitly documents the survey-specific WLS/design-based extension plus the zero-padded full-design variance construction. Concrete fix: none. (academic.oup.com)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The earlier survey-pretrend TODO is appropriately marked resolved in TODO.md:57.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The new bootstrap tests do not isolate the one-singleton design that triggers the P1 above. tests/test_survey_phase8.py:789 only checks that a full estimator run has positive SE with one singleton plus several regular strata, so it would still pass even if the singleton were silently treated like remove/certainty. tests/test_survey_phase8.py:813 and tests/test_survey_phase8.py:861 cover two-singleton and all-singleton designs, not the single-singleton edge case. Concrete fix: add direct multiplier and Rao-Wu tests for exactly one singleton stratum plus at least one regular stratum, and assert that the singleton is not silently given zero/base-weight variance under adjust.
Severity: P2. Impact: the prior zero-Omega_0 survey-pretrend coverage gap is still open. tests/test_survey_phase8.py:1206 says it exercises “a PSU/stratum that has no untreated obs,” but it sets first_treat = 2, so every PSU in stratum 0 still has period 1 in Omega_0; the zero-padding branch added in diff_diff/imputation.py:2139 is still not hit. Concrete fix: add a case with first_treat = 1 or an equivalent domain restriction so one PSU/stratum has zero untreated rows, then assert finite pre-period SEs and valid _pretrend_test() output.
Severity: P3. Impact: the top bullets in docs/methodology/REGISTRY.md:860 and docs/methodology/REGISTRY.md:863 still say pre-period coefficients use cluster-robust SEs/Wald variance, even though the note at docs/methodology/REGISTRY.md:867 now allows design-based survey VCV. Concrete fix: reword those bullets to say “cluster-robust by default; design-based when analytical survey_design is present.”

Path to Approval

Fix the single-total-singleton lonely_psu="adjust" bootstrap behavior in both diff_diff/bootstrap_utils.py:544 and diff_diff/bootstrap_utils.py:656 so it no longer silently drops that singleton’s variance contribution.
Add regression tests for that exact configuration in both bootstrap families, with assertions that inspect the singleton’s contribution directly rather than only checking that the overall estimator SE is positive.

P1: Emit UserWarning when lonely_psu="adjust" has only 1 singleton stratum total in both multiplier and Rao-Wu bootstrap. Singleton gets zero variance (same as remove) since pooling requires 2+ singletons. Matches R survey package behavior. P2: Fix test_pretrends_survey_always_treated_psu to use first_treat=1 so stratum 0 has genuinely zero untreated rows, exercising the zero-padding branch. P2: Add single-singleton warning tests for both bootstrap families. P3: Update REGISTRY bullets at lines 860/863 to mention design-based survey VCV alongside cluster-robust. Document single-singleton bootstrap fallback behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-03T23:36:44Z

/ai-review

github-actions · 2026-04-03T23:43:51Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5bec03c543fc08bd398a33cbf3e874e48df33ca1

Overall Assessment

✅ Looks good

Executive Summary

All prior P1+ blockers appear resolved or non-blocking under this rubric: the ImputationDiD survey-pretrend path is now explicitly documented as a deviation in the Methodology Registry and implemented consistently in the touched code, and the single-singleton bootstrap fallback is now documented and directly tested.
The prior zero-Omega_0 survey-pretrend coverage gap looks closed by tests/test_survey_phase8.py:L1278-L1329.
P2: trim_weights() accepts contradictory bounds (lower > upper, or lower above the quantile-derived cap) and silently returns weights above the requested upper cap; this needs a validation guard in diff_diff/prep.py:L1270-L1288.
P3: the new registry note overstates source alignment for the one-singleton lonely_psu="adjust" bootstrap fallback. The behavior is documented, so it is non-blocking here, but the “matches R’s survey package behavior” claim is not supported by the upstream materials I checked. citeturn0view3turn2view0turn0view4
Static review only: this sandbox lacks numpy and pytest, so I could not execute the touched tests.

Methodology

Severity: P3. Impact: The ImputationDiD survey-pretrend change is now a documented deviation rather than an undocumented methodology drift. Equation 9 / Test 1 in Borusyak, Jaravel, and Spiess is the untreated-observation lead regression, and the PR now documents its survey-specific extension in docs/methodology/REGISTRY.md:L853-L867 and implements that path in diff_diff/imputation.py:L1736-L1765 and diff_diff/imputation.py:L2009-L2343. Concrete fix: none. citeturn0view0
Severity: P3. Impact: docs/methodology/REGISTRY.md:L2217-L2226 says the one-singleton bootstrap fallback “matches R’s survey package behavior,” but the R survey materials I checked do not support that specific claim: the lonely-PSU docs describe adjust as grand-mean centering, jknweights(..., lonely.psu="adjust") does not collapse to remove-style zero contribution, and the bootstrap helper subbootstratum() has no explicit one-singleton remove fallback. The library behavior itself is documented and warned in diff_diff/bootstrap_utils.py:L544-L567 and diff_diff/bootstrap_utils.py:L667-L700, so this is informational rather than blocking under the stated rubric. Concrete fix: reword the registry/comment text to describe this as a library-specific documented fallback unless a primary source can be cited for the R-equivalence claim. citeturn0view3turn2view0turn0view4

Code Quality

Severity: P2. Impact: diff_diff/prep.py:L1270-L1288 validates upper vs quantile, but never validates lower <= upper after the quantile cap is materialized. With upper=5, lower=6 (or a quantile that resolves below lower), the function silently returns weights at 6, exceeding the requested upper cap and violating the function’s own “caps weights at specified thresholds” contract. Concrete fix: after resolving upper from quantile, raise ValueError when both bounds are present and lower > upper, and add a regression test next to tests/test_survey_phase8.py:L1146-L1202.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The resolved survey-pretrend entry added in TODO.md:L57 is appropriate.

Security

No findings.

Documentation/Tests

No findings. The previously open coverage gaps called out in the prior review appear addressed by the new one-singleton bootstrap tests in tests/test_survey_phase8.py:L917-L983 and the zero-untreated-domain survey-pretrend test in tests/test_survey_phase8.py:L1278-L1329.

P2: Validate lower <= upper in trim_weights after quantile resolution. Raise ValueError when lower exceeds upper to prevent silent corruption. P3: Reword single-singleton bootstrap fallback as library-specific documented behavior (not R equivalence — R's analytical adjust uses grand-mean centering, but the bootstrap single-singleton analogue is not defined in the literature). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-04T00:53:43Z

/ai-review

github-actions · 2026-04-04T01:01:18Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f065afe25a025dbb632a0e8dac7c60682bcc0f7f

Overall Assessment

✅ Looks good

Executive Summary

Previous re-review items appear addressed: the survey-pretrend work is marked resolved in TODO.md:57, the registry now frames the one-singleton bootstrap fallback as library-specific in docs/methodology/REGISTRY.md:2217, and the earlier lower > upper bug in trim_weights() is fixed in diff_diff/prep.py:1283.
Against the Methodology Registry and in-code references, I did not find any new unmitigated P0/P1 methodology defects in the estimator/variance changes.
P2: trim_weights() still accepts non-finite or negative caps and can silently emit invalid weights instead of failing fast in diff_diff/prep.py:1278 and diff_diff/prep.py:1289.
P3: the new survey-pretrend path is not directly regression-tested for survey+covariates or survey+n_bootstrap>0, even though those combinations use distinct logic in diff_diff/imputation.py:2009 and diff_diff/imputation_bootstrap.py:201.
Static review only: this sandbox does not have numpy/pytest, so I could not execute the touched tests.

Methodology

P3 Impact: None. The statistically meaningful departures in this diff are documented rather than implicit: survey-weighted ImputationDiD pretrends are recorded in docs/methodology/REGISTRY.md:867 and implemented in diff_diff/imputation.py:2009, and bootstrap lonely_psu="adjust" pooling plus the one-singleton fallback are documented in docs/methodology/REGISTRY.md:2217 and implemented in diff_diff/bootstrap_utils.py:544 / diff_diff/bootstrap_utils.py:667. Concrete fix: none.

Code Quality

P2 Impact: trim_weights() now rejects contradictory bounds, but it still accepts invalid cap values. If upper/lower are np.nan or negative, the clipping at diff_diff/prep.py:1289 and diff_diff/prep.py:1291 can silently produce all-NaN or negative weights. Concrete fix: validate upper and lower with np.isfinite(...) and >= 0 after resolving the quantile cap, then raise ValueError before applying np.minimum/np.maximum.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The prior survey-pretrend limitation is properly tracked as resolved in TODO.md:57.

Security

No findings.

Documentation/Tests

P3 Impact: The new survey-pretrend coverage in tests/test_survey_phase8.py:1221 and tests/test_survey_phase8.py:1339 exercises analytical no-covariate fits and replicate-design rejection, but not survey+covariates or analytical survey+bootstrap. Those are separate execution paths in diff_diff/imputation.py:2070 and diff_diff/imputation_bootstrap.py:201. Concrete fix: add one test with SurveyDesign(weights, strata, psu) + covariates=["x1"] + pretrends=True, and one with the same survey design + n_bootstrap>0, asserting finite pre-period SEs and that bootstrap does not overwrite the analytical pre-period inference.

P2: Validate upper/lower are finite and non-negative in trim_weights. Reject NaN/negative/inf caps before applying np.minimum/np.maximum. P3: Add survey+covariates pretrend test and survey+bootstrap pretrend test exercising the distinct code paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber · 2026-04-04T01:04:21Z

/ai-review

github-actions · 2026-04-04T01:17:55Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 4fbb82de620ac83364a0a51cb2a5b745815007ea

Overall Assessment

✅ Looks good

Executive Summary

Previous re-review items are substantially addressed: the survey-pretrends item is now marked resolved in TODO.md:L57, the singleton-bootstrap fallback is documented in docs/methodology/REGISTRY.md:L2217-L2227, and trim_weights() now rejects non-finite or negative caps in diff_diff/prep.py:L1283-L1289.
I did not find any unmitigated P0/P1 issues in the estimator, variance, weighting, or default-behavior changes.
P3: the methodology deltas in this PR are documented extensions/deviations, not undocumented drift.
P3: the new survey+bootstrap pretrend regression test is still partial; it checks finiteness, but not the key invariant that bootstrap leaves analytical pre-period inference unchanged.
P3: the trim_weights() cap-validation fix is in code, but the exact invalid-cap cases from the prior review are still not directly regression-tested.
Verification note: pytest is not installed in this environment, so this is a static review; I did run a syntax-only compile() pass over the changed Python files, and that succeeded.

Methodology

P3 Impact: None. Test 1 in Borusyak, Jaravel, and Spiess is an untreated-sample OLS lead regression with a heteroskedasticity- and cluster-robust Wald test, while the PR’s survey path in diff_diff/imputation.py:L2009-L2337 extends that to survey-weighted demeaning/WLS plus design-based VCV and survey-df F inference. The bootstrap lonely_psu="adjust" path in diff_diff/bootstrap_utils.py:L424-L702 also departs from the survey package’s documented grand-mean adjustment for lonely PSUs by pooling singleton PSUs. Both departures are explicitly documented in docs/methodology/REGISTRY.md:L853-L867 and docs/methodology/REGISTRY.md:L2217-L2227, so they are mitigated rather than blockers. Concrete fix: none. (academic.oup.com)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The previously deferred survey-pretrends item is explicitly marked resolved in TODO.md:L57.

Security

No findings.

Documentation/Tests

P3 Impact: The new survey+bootstrap pretrend test at tests/test_survey_phase8.py:L1308-L1328 only checks that pre-period SEs are finite. It does not verify the invariant enforced by diff_diff/imputation_bootstrap.py:L201-L204 and relied on by diff_diff/imputation.py:L808-L825 that bootstrap must not overwrite analytical pre-period inference. A regression that started writing bootstrap SEs into negative horizons could still pass. Concrete fix: mirror tests/test_pretrends_event_study.py:L561-L606 for the survey path by fitting once with n_bootstrap=0 and once with n_bootstrap>0, then asserting identical pre-period effects and SEs.
P3 Impact: The invalid-cap validation added in diff_diff/prep.py:L1283-L1289 closes the prior correctness issue, but the new tests in tests/test_survey_phase8.py:L1146-L1210 still do not exercise the exact regression surface (upper=np.nan, negative upper, infinite lower, negative lower). Concrete fix: add explicit pytest.raises cases for those four inputs.

igerber and others added 2 commits April 3, 2026 10:59

Fix trim_weights NaN poisoning and add regression test

8fc20ba

Use np.nanquantile instead of np.quantile so NaN weights don't poison the quantile computation and corrupt the entire output column. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

igerber merged commit cd4f5a6 into main Apr 4, 2026
14 checks passed

igerber deleted the survey-maturity-2 branch April 4, 2026 10:26

igerber mentioned this pull request Apr 4, 2026

Bump version to 2.8.4 #262

Merged

Conversation

igerber commented Apr 3, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

igerber commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

igerber commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

igerber commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

igerber commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

igerber commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

igerber commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

igerber commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

igerber commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

igerber commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

igerber commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

igerber commented Apr 4, 2026

Uh oh!

github-actions bot commented Apr 4, 2026

Uh oh!

igerber commented Apr 4, 2026

Uh oh!

github-actions bot commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant