All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- HAD
trends_lin=Truelinear-trend detrending mode onHeterogeneousAdoptionDiD.fit(aggregate="event_study"),joint_pretrends_test, andjoint_homogeneity_test. Mirrors RDIDHAD::did_had(..., trends_lin=TRUE)(paper Eq. 17 / Eq. 18 / page 32 joint-Stute homogeneity-with-trends). Per-group linear-trend slope estimated asY[g, F-1] - Y[g, F-2]and applied as(t - base) × slopeadjustment to per-event-time outcome evolutions. Requires F ≥ 3 (panel must contain F-2). The "consumed" placebo at our event-timee=-2is auto-dropped (R reduces max placebo lag by 1 with the same effect). Mutually exclusive with survey weighting (survey_design/survey/weights): raisesNotImplementedErrorperfeedback_per_method_survey_element_contract(weighted slope estimator not derived from paper; tracked in TODO.md as a follow-up). Bit-exact backcompat fortrends_lin=False(default). Patch-level (additive keyword-only kwarg). - HAD R-package end-to-end parity test vs
DIDHADv2.0.0 (Credible-Answers/did_had) on thedesign="continuous_at_zero"(Design 1') surface. New parity fixturebenchmarks/data/did_had_golden.jsongenerated bybenchmarks/R/generate_did_had_golden.Rcovers 3 paper-derived synthetic DGPs (Uniform, Beta(2,2), Beta(0.5,1)) × 5 method combinations (overall, event-study, placebo, yatchew, trends_lin). The harness explicitly forcesHeterogeneousAdoptionDiD(design="continuous_at_zero")because Rdid_hadalways evaluates the local-linear atd=0regardless of dose distribution; our defaultdesign="auto"may legitimately choosecontinuous_near_d_lowerormass_pointon dose distributions with boundary density bounded away from zero (e.g., Beta(2,2)) and thereby diverge from R numerically — that divergence is methodologically defensible but out of scope for this parity test. Python parity testtests/test_did_had_parity.pyasserts point estimate / SE / CI bounds atatol=1e-8and Yatchew T-stat atatol=1e-10after a documented× G/(G-1)finite-sample convention shift. Two intentional convention deviations from R, documented indocs/methodology/REGISTRY.md: (a) we report the bias-corrected point estimate (modern CCF 2018 convention; R'sEstimatecolumn reports the conventional estimate with the bias-corrected CI separately — ourattmatches R's CI midpoint); (b) Yatchew uses paper Appendix E's literal (1/G) variance-denominator convention while R uses base-Rvar()'s (1/(N-1)) sample-variance convention (parity is bit-exact after the× G/(G-1)shift). Yatchew on placebos with R's mean-independence null (order=0) is not yet exposed in ouryatchew_hr_test(we currently only support the linearity null) and is skipped in the parity test; tracked as TODO follow-up.
- Rust dependency upgrades: bumped
rand0.8 → 0.10 andrand_xoshiro0.6 → 0.8 in the Rust backend (the two crates are coupled throughrand_coreand must move together). MSRV bumped from Rust 1.84 → 1.85 to satisfy the new dependency requirements. Three call sites inrust/src/bootstrap.rsupdated for therand 0.9API rename:gen::<bool>()→random::<bool>(),gen::<f64>()→random::<f64>(),gen_range(0..6)→random_range(0..6). Webb wild bootstrap byte stream shifted as a side effect:rand 0.9reworked the internal algorithm forrandom_range(improved rejection sampling), soXoshiro256PlusPlus::seed_from_u64(seed)followed byrandom_range(0..6)consumes RNG bytes differently than the oldgen_range(0..6)did. Distributional properties of Webb weights are unchanged (still uniform over the 6-point support); aggregate inference (SE, p-values, CI) converges to the same values for any reasonablen_bootstrap. Rademacher and Mammen byte streams are bit-identical to the prior release. Anyone with a saved Rust+Webb baseline pinning specific seeded results will see different numbers; the regression test suite uses within-build seed-reproducibility (not cross-version baselines) so all internal tests pass unchanged. New regression guardTestRustBackend::test_bootstrap_weights_bit_identity_snapshotpins fixed-seed weights for all three weight types, so any future RNG drift fails loudly with a localized error message.
3.3.1 - 2026-04-25
- HAD survey-design API consolidated to single
survey_design=kwarg across all 8 HAD surfaces:HeterogeneousAdoptionDiD.fit,did_had_pretest_workflow,qug_test,stute_test,yatchew_hr_test,stute_joint_pretest,joint_pretrends_test,joint_homogeneity_test. Matches the rest of the library (ContinuousDiD,EfficientDiD,ChaisemartinDHaultfoeuillealready usedsurvey_design=). On data-in surfaces (HAD.fit, workflow, joint data-in wrappers)survey_design=accepts aSurveyDesigninstance (column references resolved againstdataat fit time, same convention as the rest of the library). On the three array-in linearity helpers (stute_test,yatchew_hr_test,stute_joint_pretest)survey_design=accepts a pre-resolvedResolvedSurveyDesign; passing aSurveyDesignraisesTypeErrorwith migration guidance tomake_pweight_design(arr)(pweight-only) or pre-resolution.qug_testis the 8th surface and accepts the same kwarg signature for consistency, but all non-Nonevalues raiseNotImplementedErrorper the Phase 4.5 C0 permanent deferral (no migration path; the qug-specific mutex error reflects this). New public helpermake_pweight_design(weights: np.ndarray) -> ResolvedSurveyDesignexported from thediff_difftop level for the pweight-only convenience on the three array-in linearity helpers (formerly the privatesurvey._make_trivial_resolved, kept as a permanent private alias); validates 1-D input at the front door. Three-way mutex (survey_design + survey + weights) extends the prior 2-way (survey + weights) — at most one may be non-None per call. Patch-level addition (additive new kwarg + permanent alias for the helper; no breaking changes this release).
HeterogeneousAdoptionDiD.fit(survey=, weights=),did_had_pretest_workflow(survey=, weights=), and the 6 HAD pretest helpers'survey=/weights=kwargs are deprecated in favor of the canonicalsurvey_design=. EmitsDeprecationWarningwith migration guidance; the deprecated kwargs continue to route through the unchanged legacy back-end paths so numerical results are identical to pre-PR (bit-exact regression locked by parity tests intests/test_had_dual_knob_deprecation.py). Bothsurvey=andweights=will be removed in the next minor release. Carve-out forqug_test: the deprecation is kwarg-name-consolidation only;qug_testpermanently rejects all non-Nonesurvey_design/survey/weightsvalues (Phase 4.5 C0 deferral) andmake_pweight_design(arr)is NOT a valid migration target — the deprecation warning text onqug_testis qug-specific and points users todid_had_pretest_workflow(..., survey_design=...)for survey-aware HAD pretesting (which skips the QUG step under survey).
ChaisemartinDHaultfoeuille.by_path+controls(DID^X residualization) — the per-baseline OLS residualization (Web Appendix Section 1.2) is now compatible withby_path=k. The residualization runs once on the first-differenced outcome BEFORE path enumeration, so all four downstream surfaces (analytical per-path SE, bootstrap SE, per-path placebos, per-path joint sup-t bands) consume the residualizedY_matautomatically (Frisch-Waugh-Lovell). Per-period effects remain unadjusted, consistent with the existingcontrols+ per-period DID contract (per-period DID does not support residualization). Failed-stratum baselines (rank-deficient X) zero outN_matfor affected groups, which the path enumeration treats as ineligible per its existing convention. Deviation from R on multi-baseline switcher panels (point estimates): Rdid_multiplegt_dyn(..., by_path, controls)re-runs the per-baseline OLS residualization on each path's restricted subsample (path's switchers + same-baseline not-yet-treated controls), so its residualization coefficients vary per path when switchers have different baseline values. Our global-residualization architecture coincides with R on single-baseline switcher panels (every switcher shares the sameD_{g,1}) — per-path point estimates match R exactly there. On multi-baseline panels, point estimates can diverge; the estimator emits aUserWarningat fit-time when this configuration is detected so practitioners do not silently consume estimates that disagree with R. SE inherits the cross-path cohort-sharing SE deviation from R documented forpath_effects— bootstrap SE, placebo SE, and sup-t crit are Monte Carlo / joint-distribution analogs of the same residualized analytical IF and carry the same deviation. R-parity confirmed againstdid_multiplegt_dyn(..., by_path=3, controls="X1")via the newmulti_path_reversible_by_path_controlssingle-baseline golden-value scenario (per-path point estimates match R bit-exactly — measured rtol ~1e-11 across all path × horizon cells — on this one-observation-per-cell scenario; per-path SE within ~6.5% of R, well inside the Phase 2 multi-horizon envelope). Cell-aggregated panels with multiple observations per(g, t)also coincide with our equal-cell-weighting first stage rather than R'sN_gt-weighted first stage per the existing DID^X cell-weighting deviation documented indocs/methodology/REGISTRY.mdNote (Phase 3 DID^X covariate adjustment). Gate atchaisemartin_dhaultfoeuille.py:988-992removed;by_pathdocstring updated to add the new compatibility paragraph (with the multi-baseline caveat) and removecontrolsfrom the incompatible list. R-parity test attests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathControls; cross-surface inheritance + multi-baselineUserWarningregression-tested attests/test_chaisemartin_dhaultfoeuille.py::TestByPathControls(analytical + bootstrap + placebo + sup-t +to_dataframe(level="by_path")cband columns + multi-baseline warning). Seedocs/methodology/REGISTRY.md§ChaisemartinDHaultfoeuilleNote (Phase 3 by_path ...)→ "Per-path covariate residualization (DID^X)" for the full contract.- HAD linearity-family pretests under survey (Phase 4.5 C).
stute_test,yatchew_hr_test,stute_joint_pretest,joint_pretrends_test,joint_homogeneity_test, anddid_had_pretest_workflownow acceptweights=/survey=keyword-only kwargs. Stute family uses PSU-level Mammen multiplier bootstrap viabootstrap_utils.generate_survey_multiplier_weights_batch(the same kernel as PR #363's HAD event-study sup-t bootstrap): each replicate draws an(n_bootstrap, n_psu)Mammen multiplier matrix, broadcast to per-obs perturbationeta_obs[g] = eta_psu[psu(g)], weighted OLS refit, weighted CvM via new_cvm_statistic_weightedhelper. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence AND PSU clustering. Yatchew uses closed-form weighted OLS + pweight-sandwich variance components (no bootstrap):sigma2_lin = sum(w·eps²)/sum(w),sigma2_diff = sum(w_avg·diff²)/(2·sum(w))with arithmetic-mean pair weightsw_avg_g = (w_g+w_{g-1})/2,sigma4_W = sum(w_avg·prod)/sum(w_avg),T_hr = sqrt(sum(w))·(sigma2_lin-sigma2_diff)/sigma2_W. All three Yatchew components reduce bit-exactly to the unweighted formulas atw=ones(G)(locked atatol=1e-14by direct helper test). The pweightweights=shortcut routes through a synthetic trivialResolvedSurveyDesign(newsurvey._make_trivial_resolvedhelper) so the same kernel handles both entry paths.did_had_pretest_workflow(..., survey=, weights=)removes the Phase 4.5 C0NotImplementedError, dispatches to the survey-aware sub-tests, skips the QUG step withUserWarning(per C0 deferral), setsqug=Noneon the report, and appends a"linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0"suffix to the verdict.HADPretestReport.qugretyped fromQUGTestResultstoOptional[QUGTestResults];summary()/to_dict()/to_dataframe()updated to None-tolerant rendering. Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raiseNotImplementedErrorat every entry point (defense in depth, reciprocal-guard discipline) — parallel follow-up after this PR. Stratified designs (SurveyDesign(strata=...)) also raiseNotImplementedErroron the Stute family — the within-stratum demean +sqrt(n_h/(n_h-1))correction that the HAD sup-t bootstrap applies to match the Binder-TSL stratified target has not been derived for the Stute CvM functional, so applying raw multipliers fromgenerate_survey_multiplier_weights_batchdirectly to residual perturbations would leave the bootstrap p-value silently miscalibrated. Phase 4.5 C narrows survey support to pweight-only, PSU-only (SurveyDesign(weights=, psu=)), and FPC-only (SurveyDesign(weights=, fpc=)) designs; stratified is a follow-up after the matching Stute-CvM stratified-correction derivation lands. Strictly positive weights required on Yatchew (the adjacent-difference variance is undefined under contiguous-zero blocks). Per-rowweights=/survey=colaggregated to per-unit via existing HAD helpers_aggregate_unit_weights/_aggregate_unit_resolved_survey(constant-within-unit invariant enforced). Unweighted code paths preserved bit-exactly. Patch-level addition (additive on stable surfaces). Seedocs/methodology/REGISTRY.md§ "QUG Null Test" — Note (Phase 4.5 C) for the full methodology. ChaisemartinDHaultfoeuille.by_path+n_bootstrap > 0joint sup-t bands — per-path joint sup-t simultaneous confidence intervals across horizons1..L_maxwithin each path. A single shared(n_bootstrap, n_eligible)multiplier weight matrix (using the estimator's configuredbootstrap_weights— Rademacher / Mammen / Webb) is drawn per path and broadcast across all horizons of that path, producing correlated bootstrap distributions across horizons. The path-specific critical valuec_p = quantile(max_l |t_l|, 1 - α)is used to construct symmetric joint bandseffect_l ± c_p · se_lper horizon. Surfaced onresults.path_sup_t_bands(dict keyed by path tuple, each entry withcrit_value / alpha / n_bootstrap / method / n_valid_horizons); ascband_conf_intper horizon entry onpath_effects[path]["horizons"][l]; and ascband_lower/cband_uppercolumns onresults.to_dataframe(level="by_path")(mirrors the OVERALLlevel="event_study"schema; positive-horizon rows of banded paths get populated values, placebo / unbanded / empty-window rows get NaN). Gates: a path needs>= 2valid horizons (finite bootstrap SE > 0) AND a strict majority (more than 50%) of finite sup-t draws to receive a band. Empty-state contract:path_sup_t_bands is Nonewhen not requested;{}when requested but no path passes both gates. Methodology asymmetry vs OVERALLevent_study_sup_t_bands: the per-path sup-t draws a fresh shared weight matrix per path AFTER the per-path SE bootstrap block has already populatedresults.path_sesvia independent per-(path, horizon) draws — asymptotically equivalent to OVERALL's self-consistent reuse but NOT bit-identical. Documented intentional choice to preserve RNG-state isolation for existing per-path SE seed-reproducibility tests. Inherits the cross-path cohort-sharing SE deviation from R documented forpath_effects. Deviation from R:did_multiplegt_dyndoes not provide joint / sup-t bands at any surface — this is a Python-only methodology extension consistent with the existing OVERALL sup-t bands (also Python-only). Bands cover joint inference WITHIN a single path across horizons; they do NOT provide simultaneous coverage across paths. Pre-audit fix bundled: stale "Phase 2 placeholder" docstring on the existingsup_t_bandsfield updated to the actual contract description. Tests attests/test_chaisemartin_dhaultfoeuille.py::TestByPathSupTBands(@pytest.mark.slow). Seedocs/methodology/REGISTRY.md§ChaisemartinDHaultfoeuilleNote (Phase 3 by_path per-path joint sup-t bands)for the full contract.ChaisemartinDHaultfoeuille.by_path+placebo=True— per-path backward-horizon placebosDID^{pl}_{path, l}forl = 1..L_max. The same per-path SE convention used for the event-study (joiners/leavers IF precedent: switcher-side contributions zeroed for non-path groups; cohort structure and control pool unchanged; plug-in SE with path-specific divisorN^{pl}_{l, path}) is applied to backward horizons via the newswitcher_subset_maskparameter on_compute_per_group_if_placebo_horizon. Surfaced onresults.path_placebo_event_study[path][-l](negative-int inner keys mirroringplacebo_event_study);summary()renders the rows alongside per-path event-study horizons;to_dataframe(level="by_path")emits negative-horizon rows alongside the existing positive-horizon rows. Bootstrap (whenn_bootstrap > 0) propagates per-(path, lag)percentile CI / p-value through the same_bootstrap_one_targetdispatch as the per-path event-study, with the canonical NaN-on-invalid contract enforced on the new surface (PR #364 library-wide invariant). SE inherits the cross-path cohort-sharing deviation from R documented forpath_effects(full-panel cohort-centered plug-in vs R's per-path re-run): tracks R within tolerance on single-path-cohort panels, diverges materially on cohort-mixed panels — the bootstrap SE is a Monte Carlo analog of the analytical SE and inherits the same deviation. R-parity confirmed attests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathPlaceboon the newmulti_path_reversible_by_path_placeboscenario (point estimates exact match; SE within Phase-2 envelope rtol ≤ 5%); positive analytical + bootstrap invariants attests/test_chaisemartin_dhaultfoeuille.py::TestByPathPlacebo(and the gated::TestBootstrapsubclass). Seedocs/methodology/REGISTRY.md§ChaisemartinDHaultfoeuilleNote (Phase 3 by_path ...)→ "Per-path placebos" for the full contract.- Tutorial 19: dCDH for Marketing Pulse Campaigns (
docs/tutorials/19_dcdh_marketing_pulse.ipynb) — end-to-end practitioner walkthrough on a 60-market reversible-treatment panel covering the TWFE decomposition diagnostic (twowayfeweights),DCDHPhase 1 (DID_M, joiners-vs-leavers, single-lag placebo), theL_maxmulti-horizon event study with multiplier bootstrap, a stakeholder communication template, and drift guards. README listing for Tutorial 17 (Brand Awareness Survey) backfilled in the same edit. Cross-link fromdocs/practitioner_decision_tree.rst§ "Reversible Treatment" added.
3.3.0 - 2026-04-25
SyntheticDiD(variance_method="placebo")SE now uses R-default warm-start matchingsynthdid:::placebo_se. R's placebo loop seeds Frank-Wolfe per draw withweights.boot$omega = sum_normalize(weights$omega[ind[1:N0_placebo]])(fit-time ω subsetted + renormalized) and the fit-timeweights$lambda— Python previously used uniform cold-start, producing finite-iter convergence-pattern drift on a handful of draws relative to R's reference SE. New_placebo_variance_sekwargsinit_omega/init_lambdathread fit-time weights through the existing two-pass FW dispatcher; on the global FW optimum the values are init-independent (strictly convex objective), so the change is a finite-iter parity fix, not a methodology change. Existing placebo SE values shift by sub-percent on most panels; the bit-identity baseline pin inTestScaleEquivariance::test_baseline_parity_small_scale[placebo]was rebased from0.29385822261006445to0.293840360160448. New R-parity testtests/test_methodology_sdid.py::TestJackknifeSERParity::test_placebo_se_matches_rasserts SE matches R'svcov(method="placebo")to within< 1e-8using R's exact permutation sequence (recorded bybenchmarks/R/generate_sdid_placebo_parity_fixture.Rintotests/data/sdid_placebo_indices_r.json). The_placebo_indiceskwarg on_placebo_variance_seis the test seam; not part of the public API.
qug_testanddid_had_pretest_workflowsurvey-aware NotImplementedError gates (Phase 4.5 C0 decision gate).qug_test(d, *, survey=None, weights=None)anddid_had_pretest_workflow(..., *, survey=None, weights=None)now accept the two kwargs as keyword-only with defaultNone. Passing either non-NoneraisesNotImplementedErrorwith an educational message naming the methodology rationale and pointing users to joint Stute (Phase 4.5 C, planned) as the survey-compatible alternative. Mutex guard onsurvey=+weights=mirrorsHeterogeneousAdoptionDiD.fit()athad.py:2890. QUG-under-survey is permanently deferred — the test statistic uses extreme order statisticsD_{(1)}, D_{(2)}which are NOT smooth functionals of the empirical CDF, so standard survey machinery (Binder-TSL linearization, Rao-Wu rescaled bootstrap, Krieger-Pfeffermann (1997) EDF tests) does not yield a calibrated test; under cluster sampling theExp(1)/Exp(1)limit law's independence assumption breaks; and the EVT-under-unequal-probability-sampling literature (Quintos et al. 2001, Beirlant et al.) addresses tail-index estimation, not boundary tests. The workflow's gate is temporary — Phase 4.5 C will close it for the linearity-family pretests with mechanism varying by test: Rao-Wu rescaled bootstrap forstute_testand the joint variants (stute_joint_pretest,joint_pretrends_test,joint_homogeneity_test); weighted OLS residuals + weighted variance estimator foryatchew_hr_test(Yatchew 1997 is a closed-form variance-ratio test, not bootstrap-based). Sister pretests (stute_test,yatchew_hr_test,stute_joint_pretest,joint_pretrends_test,joint_homogeneity_test) keep their closed signatures in this release — Phase 4.5 C will add kwargs and implementation together to avoid API churn. Unweightedqug_test(d)anddid_had_pretest_workflow(...)calls are bit-exact pre-PR (kwargs are keyword-only after*; positional path unchanged). New tests attests/test_had_pretests.py::TestQUGTest(5 rejection / mutex / message / regression tests) and the newTestHADPretestWorkflowSurveyGuardsclass (6 tests covering both kwarg paths, mutex, methodology pointer, both aggregate paths, and unweighted regression). Seedocs/methodology/REGISTRY.md§ "QUG Null Test" — Note (Phase 4.5 C0) for the full methodology rationale plus a sketch of the (out-of-scope) theoretical bridge that combines endpoint-estimation EVT (Hall 1982, Aarssen-de Haan 1994, Hall-Wang 1999, Beirlant-de Wet-Goegebeur 2006), survey-aware functional CLTs (Boistard-Lopuhaä-Ruiz-Gazen 2017, Bertail-Chautru-Clémençon 2017), and tail-empirical-process theory (Drees 2003) — publishable methodology research, not engineering work.HeterogeneousAdoptionDiDmass-pointsurvey=/weights=+ event-studyaggregate="event_study"survey composition + multiplier-bootstrap sup-t simultaneous confidence band (Phase 4.5 B). Closes the two Phase 4.5 ANotImplementedErrorgates:design="mass_point" + weights/surveyandaggregate="event_study" + weights/survey. Weighted 2SLS sandwich in_fit_mass_point_2slsfollows the Wooldridge 2010 Ch. 12 pweight convention (w²in the HC1 meat,w·uin the CR1 cluster score, weighted breadZ'WX); HC1 and CR1 ("stata"se_type) bit-parity withestimatr::iv_robust(..., weights=, clusters=)atatol=1e-10(new cross-language golden atbenchmarks/data/estimatr_iv_robust_golden.json, generated bybenchmarks/R/generate_estimatr_iv_robust_golden.R;estimatradded tobenchmarks/R/requirements.R)._fit_mass_point_2slsgainsweights=+return_influence=kwargs and now always returns a 3-tuple(beta, se, psi)—psiis the per-unit IF on the β̂-scale scaled socompute_survey_if_variance(psi, trivial_resolved) ≈ V_HC1[1,1]atatol=1e-10(PR #359 IF scale convention applied uniformly; nosum(psi²)claims). Event-study per-horizon variance:survey=path composes Binder-TSL viacompute_survey_if_variance;weights=shortcut uses the analytical weighted-robust SE (continuous: CCT-2014bc_fit.se_robust / |den|; mass-point: weighted 2SLS pweight sandwich from_fit_mass_point_2sls— HC1 / classical / CR1).survey_metadata/variance_formula/effective_dose_meanpopulated in both regimes (previously hardcodedNoneathad.py:3366). New multiplier-bootstrap sup-t:_sup_t_multiplier_bootstrapreusesdiff_diff.bootstrap_utils.generate_survey_multiplier_weights_batchfor PSU-level draws with stratum centering + sqrt(n_h/(n_h-1)) small-sample correction + FPC scaling + lonely-PSU handling. On theweights=shortcut, sup-t calibration is routed through a synthetic trivialResolvedSurveyDesignso the centered + small-sample-corrected branch fires uniformly — targets the analytical HC1 variance family (compute_survey_if_variance(IF, trivial) ≈ V_HC1per the PR #359 IF scale invariant) rather than the rawsum(ψ²) = ((n-1)/n) · V_HC1that unit-level Rademacher multipliers would produce on the HC1-scaled IF. Perturbations:delta = weights @ IFwith NO(1/n)prefactor (matchingstaggered_bootstrap.py:373idiom), normalized by per-horizon analytical SE,(1-alpha)-quantile of the sup-t distribution. At H=1 the quantile reduces toΦ⁻¹(1 − alpha/2) ≈ 1.96up to MC noise (regression-locked byTestSupTReducesToNormalAtH1).HeterogeneousAdoptionDiD.__init__gainsn_bootstrap: int = 999andseed: Optional[int] = None(CS-parity singular seed);fit()gainscband: bool = True(only consulted on weighted event-study).HeterogeneousAdoptionDiDEventStudyResultsextended withvariance_formula,effective_dose_mean,cband_low,cband_high,cband_crit_value,cband_method,cband_n_bootstrap(allNoneon unweighted fits); surfaced into_dict,to_dataframe,summary,__repr__. Unweighted event-study withcband=Falsepreserves pre-Phase 4.5 B numerical output bit-exactly (stability invariant, locked by regression tests). Zero-weight subpopulation convention carries over from PR #359 (filter for design decisions; preserve full ResolvedSurveyDesign for variance). Non-pweight SurveyDesigns (aweight,fweight, replicate designs) raiseNotImplementedErroron both new paths (reciprocal-guard discipline). Pretest surfaces (qug_test,stute_test,yatchew_hr_test, joint variants,did_had_pretest_workflow) remain unweighted in this release — Phase 4.5 C / C0. Seedocs/methodology/REGISTRY.md§HeterogeneousAdoptionDiD "Weighted 2SLS (Phase 4.5 B)", "Event-study survey composition", and "Sup-t multiplier bootstrap" for derivations and invariants.PanelProfile.outcome_shapeandPanelProfile.treatment_doseextensions +llms-autonomous.txtworked examples (Wave 2 of the AI-agent enablement track).profile_panel(...)now populates two new optional sub-dataclasses on the returnedPanelProfile:outcome_shape: Optional[OutcomeShape](numeric outcomes only — exposesn_distinct_values,pct_zeros,value_min/value_max,skewnessandexcess_kurtosis(NaN-safe;Nonewhenn_distinct_values < 3or variance is zero),is_integer_valued,is_count_like(heuristic: integer-valued AND has zeros AND right-skewed AND > 2 distinct values AND non-negative support, i.e.value_min >= 0; flags WooldridgeDiD QMLE consideration over linear OLS — the non-negativity clause aligns the routing signal withWooldridgeDiD(method="poisson")'s hard rejection of negative outcomes atwooldridge.py:1105-1109),is_bounded_unit([0, 1] support)) andtreatment_dose: Optional[TreatmentDoseShape](continuous treatments only — exposesn_distinct_doses,has_zero_dose,dose_min/dose_max/dose_meanover non-zero doses). BothOutcomeShapeandTreatmentDoseShapeare mostly descriptive context.profile_paneldoes not see the separatefirst_treatcolumn thatContinuousDiD.fit()consumes; the estimator's actual fit-time gates key offfirst_treat(defines never-treated controls asfirst_treat == 0, force-zeroes nonzerodoseon those rows with aUserWarning, and rejects negative dose only among treated unitsfirst_treat > 0; seecontinuous_did.py:276-327and:348-360). In the canonicalContinuousDiDsetup (Callaway, Goodman-Bacon, Sant'Anna 2024), the doseD_iis time-invariant per unit andfirst_treatis a separate column the caller supplies (not derived from the dose column). Under that setup, several facts on the dose column predictfit()outcomes:PanelProfile.has_never_treated(proxiesP(D=0) > 0because the canonical convention tiesfirst_treat == 0toD_i == 0);PanelProfile.treatment_varies_within_unit == False(the actual fit-time gate at line 222-228, holds regardless offirst_treat);PanelProfile.is_balanced(the actual fit-time gate at line 329-338); absence of theduplicate_unit_time_rowsalert (silent last-row-wins overwrite, must deduplicate before fit); andtreatment_dose.dose_min > 0(predicts the strictly-positive-treated-dose requirement at line 287-294 because treated units carry their constant dose across all periods). Whenhas_never_treated == False(no zero-dose controls but all observed doses non-negative),ContinuousDiDdoes not apply (Remark 3.1 lowest-dose-as-control is not implemented);HeterogeneousAdoptionDiDIS a routing alternative on this branch (HAD's own contract requires non-negative dose, which is satisfied). Whendose_min <= 0(negative treated doses),ContinuousDiDdoes not apply ANDHeterogeneousAdoptionDiDis not a fallback — HAD also raises on negative post-period dose (had.py:1450-1459); the applicable alternative is linear DiD with the treatment as a signed continuous covariate. Re-encoding the treatment column is an agent-side preprocessing choice that changes the estimand and is not documented in REGISTRY as a supported fallback. The estimator's force-zero coercion onfirst_treat == 0rows with nonzerodoseis implementation behavior for inconsistent inputs, not a documented method for manufacturing never-treated controls. The agent must validate the suppliedfirst_treatcolumn independently —profile_paneldoes not see it. The shape extensions provide distributional context (effect-size range, count-shape detection) that supplements but does not replace those gates. Both fields areNonewhen their classification gate is not met (e.g.,treatment_dose is Nonefor binary treatments).to_dict()serializes the nested dataclasses as JSON-compatible nested dicts. New exports:OutcomeShape,TreatmentDoseShapefrom top-leveldiff_diff.llms-autonomous.txtgains a new §5 "Worked examples" section with three end-to-end PanelProfile -> reasoning -> validation walkthroughs (binary staggered with never-treated controls, continuous dose with zero baseline, count-shaped outcome) plus §2 field-reference subsections for the new shape fields and §4.7 / §4.11 cross-references for outcome-shape considerations. Existing §5-§8 of the autonomous guide are renumbered to §6-§9. Descriptive only — no recommender language inside the worked examples.HeterogeneousAdoptionDiD.fit(survey=..., weights=...)on continuous-dose paths (Phase 4.5 survey support). Thecontinuous_at_zero(paper Design 1') andcontinuous_near_d_lower(Design 1 continuous-near-d̲) designs accept survey weights through two interchangeable kwargs:weights=<array>(pweight shortcut, weighted-robust SE from the CCT-2014 lprobust port) andsurvey=SurveyDesign(weights, strata, psu, fpc)(design-based inference via Binder-TSL variance using the existingcompute_survey_if_variancehelper atdiff_diff/survey.py:1802). Point estimates match across both entry paths; SE diverges by design (pweight-only vs PSU-aggregated).HeterogeneousAdoptionDiDResults.survey_metadatais a repo-standardSurveyMetadatadataclass (weight_type / effective_n / design_effect / sum_weights / weight_range / n_strata / n_psu / df_survey); HAD-specific extras (variance_formulalabel,effective_dose_mean) are separate top-level result fields.to_dict()surfaces the fullSurveyMetadataobject plusvariance_formula+effective_dose_mean;summary()rendersvariance_formula,effective_n,effective_dose_mean, and (when the survey= path is used)df_survey;__repr__surfacesvariance_formula+effective_dose_meanwhen present. The HADmass_pointdesign andaggregate="event_study"path raiseNotImplementedErrorunder survey/weights (deferred to Phase 4.5 B: weighted 2SLS + event-study survey composition); the HAD pretests stay unweighted in this release (Phase 4.5 C). Parity ceiling acknowledged — no public weighted-CCF bias-corrected local-linear reference exists in any language; methodology confidence comes from (1) uniform-weights bit-parity atatol=1e-14on the full lprobust output struct, (2) cross-language weighted-OLS parity (manual R reference) atatol=1e-12, and (3) Monte Carlo oracle consistency on known-τ DGPs._nprobust_port.lprobustgainsweights=andreturn_influence=(used internally by the Binder-TSL path);bias_corrected_local_linearremoves the Phase 1cNotImplementedErroronweights=and forwards. Auto-bandwidth selection remains unweighted in this release — passh/bexplicitly for weight-aware bandwidths. Seedocs/methodology/REGISTRY.md§HeterogeneousAdoptionDiD "Weighted extension (Phase 4.5 survey support)".stute_joint_pretest,joint_pretrends_test,joint_homogeneity_test+StuteJointResult(HeterogeneousAdoptionDiD Phase 3 follow-up). Joint Cramér-von Mises pretests across K horizons with shared-η Mammen wild bootstrap (preserves vector-valued empirical-process unit-level dependence per Delgado-Manteiga 2001 / Hlávka-Hušková 2020). The corestute_joint_pretestis residuals-in; two thin data-in wrappers construct per-horizon residuals for the two nulls the paper spells out: mean-independence (step 2 pre-trends,OLS(Y_t − Y_base ~ 1)per pre-period) and linearity (step 3 joint,OLS(Y_t − Y_base ~ 1 + D)per post-period). Sum-of-CvMs aggregation (S_joint = Σ_k S_k); per-horizon scale-invariant exact-linear short-circuit. Closes the paper Section 4.2 step-2 gap that Phase 3did_had_pretest_workflowpreviously flagged with an "Assumption 7 pre-trends test NOT run" caveat. Seedocs/methodology/REGISTRY.md§HeterogeneousAdoptionDiD "Joint Stute tests" for algorithm, invariants, and scope exclusion of Eq 18 linear-trend detrending (deferred to Phase 4 Pierce-Schott replication).did_had_pretest_workflow(aggregate="event_study"): multi-period dispatch on balanced ≥3-period panels. Runs QUG atF+ joint pre-trends Stute across earlier pre-periods + joint homogeneity-linearity Stute across post-periods. Step 2 closure requires ≥2 pre-periods; with only a single pre-period (the baseF-1)pretrends_joint=Noneand the verdict flags the skip. Reuses the Phase 2b event-study panel validator (last-cohort auto-filter under staggered timing withUserWarning;ValueErrorwhenfirst_treat_col=Noneand the panel is staggered). The data-in wrappersjoint_pretrends_testandjoint_homogeneity_testalso route through that same validator internally, so direct wrapper calls inherit the last-cohort filter and constant-post-dose invariant.HADPretestReportextended withpretrends_joint,homogeneity_joint, andaggregatefields; serialization methods (summary,to_dict,to_dataframe,__repr__) preserve the Phase 3 output bit-exactly onaggregate="overall"— noaggregatekey, no header row, no schema drift — and only surface the new fields onaggregate="event_study".ChaisemartinDHaultfoeuille.by_path— per-path event-study disaggregation, mirroring Rdid_multiplegt_dyn(..., by_path=k). Passingby_path=k(positive int) to the estimator reports separateDID_{path,l}+ SE + inference for the top-k most common observed treatment paths in the window[F_g-1, F_g-1+L_max], answering the practitioner question "is a single pulse enough, or do you need sustained exposure?" across paths like(0,1,0,0)vs(0,1,1,0)vs(0,1,1,1). The per-path SE follows the joiners-only / leavers-only IF precedent (switcher-side contribution zeroed for non-path groups; control pool and cohort structure unchanged; plug-in SE with path-specific divisor). Requiresdrop_larger_lower=False(multi-switch groups are the object of interest) andL_max >= 1. Binary treatment only in this release; combinations withtrends_linear,trends_nonparam,heterogeneity,design2,honest_did, andsurvey_designraiseNotImplementedErrorand are deferred to follow-up PRs (n_bootstrap > 0,placebo=True, joint sup-t bands, andcontrolsare now supported — see the dedicated entries elsewhere in[Unreleased]). Results exposeresults.path_effects: Dict[Tuple[int, ...], Dict[str, Any]]andresults.to_dataframe(level="by_path"); the summary grows a "Treatment-Path Disaggregation" block. Ties in path frequency are broken lexicographically on the path tuple for deterministic ranking. Overflow (by_path > n_observed_paths) returns all observed paths with aUserWarning. Seedocs/methodology/REGISTRY.md§ChaisemartinDHaultfoeuilleNote (Phase 3 by_path per-path event-study disaggregation)for the full contract.ChaisemartinDHaultfoeuille.by_path+n_bootstrap > 0— bootstrap SE for per-path event-study effects. The top-k paths are enumerated once on the observed data (R-faithful path-stability semantics: matchesdid_multiplegt_dyn(..., by_path=k, bootstrap=B), confirmed empirically againstDIDmultiplegtDYN 2.3.3), and the existing multiplier bootstrap (bootstrap_weights ∈ {"rademacher", "mammen", "webb"}) runs per(path, horizon)target via the shared_bootstrap_one_target/compute_effect_bootstrap_statshelpers. Point estimates are unchanged from the analytical path. Bootstrap SE replaces the analytical SE inpath_effects[path]["horizons"][l]["se"], andp_value/conf_intpropagate the bootstrap percentile statistics (library Round-10 convention, same asoverall/joiners/leavers/multi_horizon);t_statis SE-derived viasafe_inferenceper the anti-pattern rule. Interpretation is conditional on the observed path set — practitioners wanting unconditional inference capturing path-selection uncertainty need a pairs-bootstrap (no R precedent). SE inherits the analytical cross-path cohort-sharing deviation: bootstrap input is the same full-panel cohort-centered path IF as the analytical path, so the bootstrap SE is a Monte Carlo analog of the analytical SE and inherits the existing analytical-path divergence from R on mixed-path cohorts (see REGISTRY.md for the full mechanism). On single-path-cohort panels, bootstrap and analytical SE both track R up to the Phase 2 envelope. Deviation from R (CI method): R's per-path bootstrap CI is normal-theory around the bootstrap SE (half-width ≈1.96·se); ours is the bootstrap percentile CI, intentionally diverging from R to keep the dCDH inference surface internally consistent across all bootstrap targets. Positive regressions attests/test_chaisemartin_dhaultfoeuille.py::TestByPathBootstrap(@pytest.mark.slow): point-estimate invariance, finite SE on non-degenerate panels, bootstrap-vs-analytical SE within 30% rtol on cohort-clean panels, degenerate-cohort NaN propagation, Rademacher/Mammen/Webb parity, seed reproducibility, and percentile-vs-normal-theory CI pinning. Seedocs/methodology/REGISTRY.md§ChaisemartinDHaultfoeuilleNote (Phase 3 by_path ...)→ Bootstrap SE for the full write-up.- R-parity for
ChaisemartinDHaultfoeuille.by_pathagainstDIDmultiplegtDYN 2.3.3. Two new scenarios inbenchmarks/data/dcdh_dynr_golden_values.jsongenerated fromdid_multiplegt_dyn(..., by_path=k):mixed_single_switch_by_path(2 paths,by_path=2) andmulti_path_reversible_by_path(4 observed paths,by_path=3, via a new deterministic multi-path DGP pattern in the R generator). Per-path point estimates and per-path switcher counts match R exactly; per-path SE matches within the Phase 2 multi-horizon SE envelope (observed rtol ≤ 10.2% on the 2-path scenario, ≤ 4.2% on the 4-path scenario). Parity tests live attests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPath, matching paths by tuple label via set-equality (robust to R's undocumented frequency-tie tiebreak) and cross-checking per-path switcher counts before SE comparison. Deviation documented: cross-path cohort sharing — our full-panel cohort-centered plug-in vs R's per-path re-run diverges materially when a(D_{g,1}, F_g, S_g)cohort spans multiple observed paths; the two coincide when every cohort is single-path. The parity scenarios are constructed to keep cohorts single-path (scenario 13 by design, scenario 14 via path-assignment-deterministic-on-F_g). Seedocs/methodology/REGISTRY.md§ChaisemartinDHaultfoeuilleNote (Phase 3 by_path...)for the full write-up. profile_panel()utility +llms-autonomous.txtreference guide (agent-facing) — newdiff_diff.profile_panel(df, *, unit, time, treatment, outcome)returns a frozenPanelProfiledataclass of structural facts (panel balance, treatment-type classification —"binary_absorbing"/"binary_non_absorbing"/"continuous"/"categorical", cohort structure, outcome characteristics, and atuple[Alert, ...]of factual observations)..to_dict()returns a JSON-serializable view. Paired with a new bundled"autonomous"variant onget_llm_guide()—get_llm_guide("autonomous")returns a reference-shaped guide (distinct from the existing workflow-prose"practitioner"variant) with §1 audience disclaimer, §2PanelProfilefield reference, §3 embedded 17-estimator × 9-design-feature support matrix, §4 per-design-feature reasoning citing Baker et al. (2025) and Roth / Sant'Anna (2023), §5 post-fit validation index, §6 BR/DR schema reference, §7 citations, §8 intentional omissions. Both pieces are bundled inside the wheel (no GitHub / RTD dependency at runtime);diff_diff/__init__.pymodule docstring leads with an agent-entry block listingprofile_panel,get_llm_guide("autonomous"),get_llm_guide("practitioner"), andBusinessReportsohelp(diff_diff)surfaces them. Descriptive, not opinionated —profile_panelalerts never recommend a specific estimator, and the guide enumerates trade-offs rather than dispatching. Exports:profile_panel,PanelProfile,Alertfrom top-leveldiff_diff.target_parameterblock in BR/DR schemas (experimental; schema version bumped to 2.0) —BUSINESS_REPORT_SCHEMA_VERSIONandDIAGNOSTIC_REPORT_SCHEMA_VERSIONbumped from"1.0"to"2.0"because the new"no_scalar_by_design"value on theheadline.status/headline_metric.statusenum (dCDHtrends_linear=True, L_max>=2configuration) is a breaking change per the REPORTING.md stability policy. BusinessReport and DiagnosticReport now emit a top-leveltarget_parameterblock naming what the headline scalar actually represents for each of the 16 result classes. Closes BR/DR foundation gap #6 (target-parameter clarity). Fields:name,definition,aggregation(machine-readable dispatch tag),headline_attribute(raw result attribute),reference(citation pointer). BR's summary emits the shortnameright after the headline; DR's overall-interpretation paragraph does the same; both full reports carry a "## Target Parameter" section with the full definition. Per-estimator dispatch is sourced from REGISTRY.md and lives in the newdiff_diff/_reporting_helpers.py::describe_target_parameter. A few branches read fit-time config (EfficientDiDResults.pt_assumption,StackedDiDResults.clean_control,ChaisemartinDHaultfoeuilleResults.L_max/covariate_residuals/linear_trends_effects); others emit a fixed tag (the fit-timeaggregatekwarg on CS / Imputation / TwoStage / Wooldridge does not change theoverall_attscalar — disambiguating horizon / group tables is tracked under gap #9). Seedocs/methodology/REPORTING.md"Target parameter" section.- SyntheticDiD coverage Monte Carlo calibration table added to
docs/methodology/REGISTRY.md§SyntheticDiD — rejection rates at α ∈ {0.01, 0.05, 0.10} acrossplacebo/bootstrap/jackknifeon 3 representative DGPs (balanced / exchangeable, unbalanced, and Arkhangelsky et al. (2021) AER §6.3 non-exchangeable). Artifact atbenchmarks/data/sdid_coverage.json(500 seeds × B=200), regenerable viabenchmarks/python/coverage_sdid.py.
- SyntheticDiD
variance_method="bootstrap"now runs the paper-faithful refit bootstrap with R-default warm-start. Re-estimates ω̂_b and λ̂_b via two-pass sparsified Frank-Wolfe on each pairs-bootstrap draw using the fit-time normalized-scale zeta — Arkhangelsky et al. (2021) Algorithm 2 step 2, matching the behavior of R's defaultsynthdid::vcov(method="bootstrap")(which rebindsattr(estimate, "opts")so the renormalized ω serves as Frank-Wolfe initialization). The Python path threads that warm-start throughcompute_sdid_unit_weights(..., init_weights=_sum_normalize(ω̂[boot_control_idx]))andcompute_time_weights(..., init_weights=λ̂)on each bootstrap draw.compute_sdid_unit_weightsandcompute_time_weightsgain a newinit_weightskwarg; when provided, the Rust top-level fast-path is skipped in favor of the Python two-pass dispatcher (whose inner FW calls still dispatch to Rust). Without this kwarg both helpers remain backward-compatible and keep the Rust fast-path. The previous fixed-weight bootstrap path is removed entirely — it was not paper-faithful and, despite prior documentation claiming otherwise, also did not match R's default bootstrap (the previous R-parity test fixture invokedsynthdid_estimate(weights=...)without rebindingopts, which silently runs fixed-weight, so the 1e-10 parity was between two paths both wrong in the same direction). Coverage MC at the new artifact above quantifies the correctness fix on 3 representative null DGPs. Users' existingvariance_method="bootstrap"fits will return materially different SE / p-value / CI values on the next release — same enum name, corrected semantics. Bootstrap is now ~5–30× slower per fit than the old fixed-weight shortcut (panel-size dependent; warm-start converges faster than cold-start so the slowdown is less than the 10–100× prior estimate). The PR #349 follow-on bullets below (analytical p-value dispatch, sqrt((r-1)/r) SE formula, retry-to-B contract) all carry over to the refit path unchanged. - SyntheticDiD
variance_method="bootstrap"now computes p-values from the analytical normal-theory formula using the bootstrap SE (matching R'ssynthdid::vcov()convention), rather than an empirical null-distribution formula that is not valid for bootstrap draws.is_significantandsignificance_starsare derived fromp_valueand will also change for bootstrap fits. Placebo and jackknife are unchanged. Point estimates are unaffected. - SyntheticDiD bootstrap SE formula applies the
sqrt((r-1)/r)correction matching R's synthdid and the placebo SE formula. - SyntheticDiD bootstrap now retries degenerate resamples (all-control or all-treated, or non-finite
τ_b) until exactlyn_bootstrapvalid replicates are accumulated, matching R'ssynthdid::bootstrap_sampleand Arkhangelsky et al. (2021) Algorithm 2. Previously the Python path counted attempts (with degenerate draws silently dropped), producing fewer valid replicates than requested. A bounded-attempt guard (20 × n_bootstrap) prevents pathological-input hangs. - TROP global bootstrap SE backend parity under fixed seed — Rust and Python backends now produce bit-identical bootstrap SE under the same
seed. Previously Rust'sbootstrap_trop_variance_globalseededrand_xoshiro::Xoshiro256PlusPlusper replicate while Python's fallback consumednumpy.random.default_rng(PCG64), producing ~28% SE divergence on tiny panels underseed=42. Fixed by extracting a sharedstratified_bootstrap_indiceshelper indiff_diff/bootstrap_utils.pythat pre-generates per-replicate stratified sample indices via numpy on the Python side; both backends consume the same integer arrays through the PyO3 surface. Sampling law (stratified: controls then treated, with replacement) is unchanged. Closes the bootstrap-RNG half of silent-failures audit finding #23 (grid-search half closed in PR #348; local-method methodology half closed by the two Fixed entries below). Local-method TROP also adopts the Python-canonical index contract for the RNG layer here. - TROP local-method Rust weight-matrix no longer normalized —
rust/src/trop.rs::compute_weight_matrixno longer divides time-weights or unit-weights by their respective sums before the outer product. The paper's Equation 2/3 (Athey, Imbens, Qu, Viviano 2025) and REGISTRY.md Requirements checklist (line 2037:[x] Unit weights: exp(-λ_unit × distance) (unnormalized, matching Eq. 2)) both specify raw-exponential weights; Python's_compute_observation_weightswas already REGISTRY-compliant. User-visible effect: Rust local-method ATT values may shift for any fit withlambda_nn < infinity— normalizing the weight-matrix inflated the effective nuclear-norm penalty relative to the data-fit term, changing the regularization trade-off. Forlambda_nn = infinity(factor model disabled) outputs are unchanged because uniform weight scaling leaves the minimum-norm WLS argmin invariant. Rust LOOCV-selected lambdas may also shift on this boundary; both backends now converge on the same REGISTRY-compliant selection. - TROP local-method Python
_compute_observation_weightsnow uses the function-argumentY, Dand treats all non-target units as donors — two coupled changes that bring Python structurally in line with Rust and the paper's Eq. 2/3:- Removed the
if self._precomputed is not None:branch that silently substitutedself._precomputed["Y"]/["D"]/["time_dist_matrix"](original-panel cache populated during main fit) for the function-argumentY, D. Under bootstrap,_fit_with_fixed_lambdacomputes freshY, Dfrom the resampledboot_dataand passes them in; the helper was discarding those and recomputing unit distances from the original panel, so Python's local bootstrap resampled units but reused stale unit-distance weights. Rust's bootstrap was already correct (always consumedy_boot, d_boot). - Removed the
valid_control_at_t = D[t, :] == 0target-period donor gate that zeroedω_jfor any unitjtreated at the target period (other than the target unit itself). Per REGISTRY Eq. 2/3 and Rust'scompute_weight_matrix,ω_j = exp(-λ_unit × dist(j, i))for allj ≠ i; treated-cell exclusion happens via the(1 − W_{js})factor applied inside_estimate_model. Same-cohort donors now contribute via their pre-treatment rows. Empirically the main-fit ATT is unchanged on tested fixtures because same-cohort pre-treatment observations are exactly absorbed by their own unit fixed effectalpha_jwithout propagating intomu,beta, or other units' parameters — so this change is structural alignment rather than a numerical shift in output. Users on same-cohort panels with very few controls may still see tiny differences in edge cases; the newtest_local_method_same_cohort_donor_parityregression guards the aligned behavior. Together with the normalization fix above, TROP local-method backend parity on the main-fit ATT is regime-dependent:atol=rtol=1e-14forlambda_nn=inf(no nuclear-norm regularization, uniform weight scaling leaves the WLS argmin invariant) andatol=1e-10for finitelambda_nn(FISTA inner loop + BLAS reduction ordering introduce sub-1e-10 roundoff across Rustfaervs numpy paths). Bootstrap SE parity is asserted atatol=1e-5to accommodate ~1e-7 roundoff between Rust'sestimate_modelmatrix factorization and numpy'slstsqthat accumulates across per-replicate fits; sub-1e-14 bootstrap parity is tracked as a follow-up inTODO.mdunder "unify Rust local-method solver path". Closes silent-failures audit finding #23 (local-method half; the RNG half closed in PR #354 and the grid-search half in PR #348).
- Removed the
did_had_pretest_workflow(aggregate="event_study")verdict no longer emits the "paper step 2 deferred to Phase 3 follow-up" caveat — the joint pre-trends Stute test closes that gap. The two-periodaggregate="overall"path retains the existing caveat since the joint variant does not apply to single-pre-period panels. Downstream code that greps verdict strings for the Phase 3 caveat will see it suppressed on the event-study path.- SyntheticDiD bootstrap no longer supports survey designs (capability regression in PR #351, restored in PR #355 — see Added/Changed entries directly below). The removed fixed-weight bootstrap path was the only SDID variance method that supported strata/PSU/FPC (via Rao-Wu rescaled bootstrap); the PR #351 paper-faithful refit bootstrap initially rejected all survey designs (including pweight-only) with
NotImplementedError. PR #355 restores the capability via a weighted-FW + Rao-Wu composition; the lock-out window applies only to the v3.2.x line that ships PR #351 alone (without PR #355). Composing Rao-Wu rescaled weights with Frank-Wolfe re-estimation: seedocs/methodology/REGISTRY.md§SyntheticDiDNote (survey + bootstrap composition).
- SDID
variance_method="bootstrap"survey support restored via a hybrid pairs-bootstrap + Rao-Wu rescaling composed with a weighted Frank-Wolfe kernel. Each bootstrap draw first performs the unit-level pairs-bootstrap resampling specified by Arkhangelsky et al. (2021) Algorithm 2 (boot_idx = rng.choice(n_total)), and then applies Rao-Wu rescaled per-unit weights (Rao & Wu 1988) sliced over the resampled units — NOT a standalone Rao-Wu bootstrap. New Rust kernelsc_weight_fw_weighted(and_with_convergencesibling) accepts a per-coordinatereg_weightsargument so the FW objective becomesmin ||A·ω - b||² + ζ²·Σ_j reg_w[j]·ω[j]². New Python helperscompute_sdid_unit_weights_surveyandcompute_time_weights_surveythread per-control survey weights through the two-pass sparsify-refit dispatcher (column-scaling Y byrwfor the loss,reg_weights=rwfor the penalty on the unit-weights side; weighted column-centering + row-scaling Y bysqrt(rw)for the loss with uniform reg on the time-weights side)._bootstrap_sesurvey branch composes the per-drawrw(Rao-Wu rescaling for full designs, constantw_controlfor pweight-only fits) with the weighted-FW helpers, then composesω_eff = rw·ω/Σ(rw·ω)for the SDID estimator. Coverage MC artifact extended with astratified_surveyDGP (BRFSS-style: N=40, strata=2, PSU=2/stratum); the bootstrap row's near-nominal calibration is the validation gate (target rejection ∈ [0.02, 0.10] at α=0.05). New regression tests acrosstest_methodology_sdid.py::TestBootstrapSE(single-PSU short-circuit, full-design and pweight-only succeeds-tests, zero-treated-mass retry, deterministic Rao-Wu × boot_idx slice) andtest_survey_phase5.py::TestSyntheticDiDSurvey(full-design ↔ pweight-only SE differs assertion). See REGISTRY.md §SyntheticDiDNote (survey + bootstrap composition)for the full objective and the argmin-set caveat.
- SDID bootstrap SE values under survey fits now differ numerically from the v3.2.x line that shipped PR #351 alone: the fit no longer raises
NotImplementedError, and instead returns the weighted-FW + Rao-Wu SE. Non-survey fits are unaffected (the bootstrap dispatcher routes only the survey branch through the new_surveyhelpers; non-survey fits continue to call the existingcompute_sdid_unit_weights/compute_time_weightsand stay bit-identical at rel=1e-14 on the_BASELINE["bootstrap"]regression). SDID'splaceboandjackknifepaths still rejectstrata/PSU/FPCon the v3.2.x line; full-design support for those methods lands separately in the entries below.
- SDID
variance_method="placebo"and"jackknife"now support strata/PSU/FPC designs. Closes the last SDID survey gap. All three variance methods (bootstrap from PR #355, plus placebo and jackknife here) now handle full survey designs. New private methodsSyntheticDiD._placebo_variance_se_surveyand_jackknife_se_surveyroute the full-design path through method-specific allocators:- Placebo — stratified permutation (Pesarin 2001). Each draw samples pseudo-treated indices uniformly without replacement from controls within each stratum containing actual treated units; non-treated strata contribute their controls unconditionally. The weighted Frank-Wolfe kernel from PR #355 (
compute_sdid_unit_weights_survey/compute_time_weights_survey) re-estimates ω and λ per draw with per-control survey weights threaded into both loss and regularization; post-optimization compositionω_eff = rw·ω/Σ(rw·ω). Arkhangelsky Algorithm 4 SE formula unchanged. - Jackknife — PSU-level leave-one-out with stratum aggregation (Rust & Rao 1996).
SE² = Σ_h (1-f_h)·(n_h-1)/n_h·Σ_{j∈h}(τ̂_{(h,j)} - τ̄_h)²withf_h = n_h_sampled / fpc[h](population-count FPC form). λ held fixed across LOOs; ω subsetted, composed with rw, renormalized. Strata withn_h < 2silently skipped (matches Rsurvey::svyjknwithlonely_psu="remove"/"certainty";"adjust"raisesNotImplementedError). Full-census strata (f_h ≥ 1) short-circuit to zero contribution before any LOO feasibility check.SE = 0is returned for legitimate zero variance (e.g., every stratum full-census);SE = NaNwith a targetedUserWarningis reserved for undefined cases — all strata skipped, or any delete-one replicate in a non-full-census contributing stratum is undefined (all-treated-in-one-PSU LOO, kept ω_eff / w_treated mass zero, estimator raises). Unstratified single-PSU short-circuits to NaN. - Fit-time feasibility guards (placebo):
ValueErroron stratum-level infeasibility with targeted messages distinguishing three cases — Case B (treated-containing stratum has zero controls), Case C (fewer controls than treated in a treated stratum), Case D (every treated stratum is exact-countn_c_h == n_t_h→ permutation support is 1, null distribution collapses). Partial-permutation fallback rejected because it would silently change the null-distribution semantics. - Gate relaxed: the fit-time guard at
synthetic_did.py:352-369that rejected placebo/jackknife + strata/PSU/FPC is removed. Replicate-weight designs remain rejected (separate methodology — replicate variance is closed-form and would double-count with Rao-Wu-like rescaling). Non-survey and pweight-only paths bit-identical by construction — the new code is gated onresolved_survey_unit.(strata|psu|fpc) is not None. - Coverage MC:
benchmarks/data/sdid_coverage.jsonextended with jackknife onstratified_survey. Bootstrap validates near-nominal (α=0.05 rejection = 0.058, SE/trueSD = 1.13). Jackknife reported with an anti-conservatism caveat: with only 2 PSUs per stratum the stratified jackknife formula has 1 effective DoF per stratum, a well-documented limitation of Rust & Rao (1996) —se_over_truesd ≈ 0.46on this DGP. Users needing tight SE calibration with few PSUs should prefervariance_method="bootstrap". Placebo is structurally infeasible on the existingstratified_surveyDGP (its cohort packs into one stratum with 0 never-treated units — by design a bootstrap-suited DGP); the placebo survey path is exercised via unit tests on a feasible fixture. - Regression tests across
tests/test_survey_phase5.py: two new classesTestSDIDSurveyPlaceboFullDesignandTestSDIDSurveyJackknifeFullDesign. Placebo: pseudo-treated-stratum contract, Case B / Case C front-door guards with targeted-message regression, SE-differs-from-pweight-only, deterministic dispatch. Jackknife: stratum-aggregation self-consistency, FPC magnitude regression (2-stratum handcrafted panel assertsSE_fpc == SE_nofpc · sqrt(1-f)atrtol=1e-10), single-PSU-stratum skip, unstratified short-circuit, all-strata-skipped warning + NaN, SE-differs-from-pweight-only, deterministic dispatch. Existingtest_full_design_placebo_raisesandtest_full_design_jackknife_raisesflipped to_succeedsassertions. All 19 existing pweight-only and non-survey placebo/jackknife tests pass unchanged (bit-identity preserved via the new-path gating). - Allocator asymmetry (documented in REGISTRY): placebo ignores the PSU axis (unit-level within-stratum permutation — the classical stratified permutation test; PSU-level permutation on few PSUs is near-degenerate); jackknife respects PSU (PSU-level LOO is the canonical survey jackknife). Both respect strata. See
docs/methodology/REGISTRY.md§SyntheticDiDNote (survey + placebo composition)andNote (survey + jackknife composition).
- Placebo — stratified permutation (Pesarin 2001). Each draw samples pseudo-treated indices uniformly without replacement from controls within each stratum containing actual treated units; non-treated strata contribute their controls unconditionally. The weighted Frank-Wolfe kernel from PR #355 (
3.2.0 - 2026-04-19
BusinessReportandDiagnosticReport(experimental preview) (PR #318) - practitioner-ready output layer.BusinessReport(results, ...)produces plain-English narrative summaries (.summary(),.full_report(),.export_markdown(),.to_dict()) from any of the 16 fitted result types.DiagnosticReport(results, ...)orchestrates the existing diagnostic battery (parallel trends, pre-trends power, HonestDiD sensitivity, Goodman-Bacon, heterogeneity, design-effect, EPV) plus estimator-native diagnostics for SyntheticDiD (pre_treatment_fit, weight concentration, in-time placebo, zeta sensitivity) and TROP (factor-model fit metrics). Both classes expose an AI-legibleto_dict()schema (single source of truth; prose renders from the dict). BR auto-constructs DR by default so summaries mention pre-trends, robustness, and design-effect findings in one call. Seedocs/methodology/REPORTING.mdfor methodology deviations including the no-traffic-light-gates decision, pre-trends verdict thresholds (0.05 / 0.30), and power-aware phrasing driven bycompute_pretrends_power. Both schemas are marked experimental in this release - wording, verdict thresholds, and schema shape will change; do not anchor downstream tooling on them yet.- Kernel / local-linear / nonparametric infrastructure (PRs #327, #335) - bandwidth selector, local linear regression, HC2 / Bell-McCaffrey variance helpers, and a port of R
nprobust's point-estimate path. Foundation for the upcomingHeterogeneousAdoptionDiDestimator (de Chaisemartin, Ciccia, D'Haultfœuille & Knau 2024 — "DiD with no untreated group"). Released as internal modules with full test coverage (tests/test_bandwidth_selector.py,tests/test_local_linear.py,tests/test_linalg_hc2_bm.py,tests/test_nprobust_port.py); the user-facing estimator ships in a later phase. - Cell-period IF allocator for dCDH survey variance (Class A contract) (PR #323) - replaces the group-level allocator
ψ_i = ψ_g * (w_i / W_g)with a cell-period allocatorψ_i = ψ_g * (w_i / W_{g, out_idx})on the post-period cell for the DID_l replicate-weight ATT path. Is the allocator shape that the v3.2.0 heterogeneity and bootstrap extensions below build on. Documents the post-period attribution convention in REGISTRY.md with a hand-computed row-sum identity test.
aggregate_surveystratum-PSU scaffolding precompute — the per-cell Taylor-series variance insideaggregate_surveyno longer rebuilds stratum-PSU scaffolding on every cell. A frozen_PsuScaffolding(strata codes, global PSU codes unique across strata, per-stratum counts and FPC ratios, singleton mask, static legitimate-zero counts and variance-computable flag) is precomputed once per design at the top ofaggregate_surveyand threaded through_cell_mean_varianceto a new_compute_if_variance_fastpath that replaces the per-stratum pandas groupby with two vectorizednp.bincountpasses. BRFSS-shaped 50-state × 10-year × 1M-row microdata → state-year panel drops from ~24s to sub-2s under both backends (the path is pure Python, so Python and Rust track each other). Numerical output is preserved to sub-ULP tolerance; seven-case equivalence tests (TestAggregateSurveyScaffolding) assertassert_allclose(atol=1e-14, rtol=1e-14)between fast and legacy paths across stratified+PSU+FPC, stratified no FPC, PSU-only, weights-only, and all threelonely_psumodes (remove / certainty / adjust). Replicate-weight designs continue to route throughcompute_replicate_if_varianceunchanged._compute_stratified_psu_meatis untouched — all other TSL callers (DiD / TWFE / CS / etc.) are unaffected.
- Add Zenodo DOI badge to README; upgrade the BibTeX citation block with the concept DOI (
10.5281/zenodo.19646175) and list author as Isaac Gerber (matchingCITATION.cff).CITATION.cffcarries the concept DOI as its top-leveldoi:field — Zenodo auto-mints a versioned DOI for every release, but the CFF file tracks the concept DOI only so it doesn't need a follow-up edit per release. DOI was minted by Zenodo when v3.1.3 was released. ChaisemartinDHaultfoeuilleheterogeneity + within-group-varying PSU/strata now supported under Binder TSL -fit(heterogeneity=..., survey_design=...)no longer raisesNotImplementedErrorwhen the resolved design's PSU or strata vary across the cells of a group. On the Binder TSL branch (compute_survey_if_variance), the heterogeneity WLS coefficient IF is expanded to observation level via the cell-period allocatorψ_i = ψ_g * (w_i / W_{g, out_idx})on the post-period cell — the DID_l post-period single-cell convention shipped in v3.1.x. Under PSU=group the PSU-level Binder TSL variance is byte-identical to the previous release (PSU-level aggregate telescopes toψ_g); under within-group-varying PSU, mass lands in the post-period PSU of the transition. The Rao-Wu replicate-weight branch (compute_replicate_if_variance) retains the legacy group-level allocatorψ_i = ψ_g * (w_i / W_g): replicate variance computesθ_r = sum_i ratio_ir * ψ_iat observation level and is therefore not PSU-telescoping, so the cell-period allocator would silently change the replicate SE whenever a replicate column's ratios vary within group (e.g., per-row replicate matrices). Replicate + heterogeneity fits therefore produce byte-identical SE to the previous release, and the newly-unblockedheterogeneity=+ within-group-varying PSU combination is unreachable under replicate designs by construction (SurveyDesignrejectsreplicate_weightscombined with explicitstrata/psu/fpc).ChaisemartinDHaultfoeuille.fit(survey_design=..., n_bootstrap > 0)now supports within-group-varying PSU — the PSU-level Hall-Mammen wild multiplier bootstrap has been extended from a group-level PSU map (one multiplier per group) to a cell-level PSU map (one multiplier per(g, t)cell's PSU). A dispatcher in_compute_dcdh_bootstrapdetects PSU-within-group-constant regimes (including PSU=group auto-inject and strictly-coarser PSU with within-group constancy) and routes them through the legacy group-level path so the bootstrap SE is bit-identical to the previous release (guarded by the newtest_bootstrap_se_matches_pre_pr4_baselineand the pre-existingtest_auto_inject_bit_identical_to_group_level). Under within-group-varying PSU, a group contributing cells to multiple PSUs receives independent multiplier draws per PSU — the correct Hall-Mammen wild PSU clustering at cell granularity. Multi-horizon bootstraps draw a single shared(n_bootstrap, n_psu)PSU-level weight matrix per block and broadcast per-horizon via each horizon's cell-to-PSU map, so the sup-t simultaneous confidence band remains a valid joint distribution. Closes the lastNotImplementedErrorgate in the dCDH survey contract; replicate-weight variance andn_bootstrap > 0remain mutually exclusive by construction. Scope note: panels with terminal missingness where the terminally-missing group is in a cohort whose other groups still contribute at the missing period now raise a targetedValueErroron every survey variance path that uses the cell-period allocator: Binder TSL with within-group-varying PSU, Rao-Wu replicate-weight ATT (which always uses the cell allocator per the Class A contract shipped in PR #323), and the cell-level wild PSU bootstrap. Cohort-recentering leaks centered IF mass onto cells with no positive-weight observations, which the cell-period allocator cannot attach to any observation/PSU. This closes a silent mass-drop bug the cell-period allocator introduced across all three paths in v3.1.x; pre-process the panel to remove terminal missingness (drop late-exit groups or trim to a balanced sub-panel) as the documented workaround. For Binder TSL only, using an explicitpsu=<group_col>routes through the legacy group-level allocator where the row-sum identity makes the two allocators statistically equivalent. Replicate-weight ATT and within-group-varying-PSU bootstrap have no such allocator fallback — the panel itself must be pre-processed. PSU-within-group-constant Binder TSL (including PSU=group auto-inject) is unaffected.- Performance review: practitioner-scale scenarios + benchmark harness extension (PR #333) - new
docs/performance-scenarios.mddocuments 5-7 realistic practitioner workflows (marketing lift, geo-experiment, BRFSS state-policy, dCDH reversible treatment) grounded in the practitioner docs and the paper literature, not cookie-cutter textbook data.benchmarks/speed_review/extended with practitioner-scale scripts and per-backend bit-identity baselines. Baselines refreshed against current main. Finding: the biggest leverage areas are bootstrap resampling loops and per-replicate survey-design rebuilds in the bootstrap path; documented indocs/performance-plan.mdfor follow-up optimization PRs. - Wall-clock timing tests excluded from default CI (PRs #330, #336) -
TestCallawaySantAnnaSEAccuracy.test_timing_performanceandTestPerformanceRegressionmarked@pytest.mark.slow, removing false-positive CI failures from runner-noise variance (BLAS path variation, neighbor VM contention). Tests remain runnable viapytest -m slowfor ad-hoc local benchmarking; the perf-review harness above is the principled replacement for CI-gated performance tracking.
- Silent-failures audit: axis A (PR #334) — minor solver paths numerical-precision / scale-fragility closeouts, completing the SDID extreme-Y-scale work started in v3.1.2.
- Silent-failures audit: axis C & J (PR #339) — B-spline derivative warning scope broadened;
SurveyPowerConfigstale-cache wording narrowed. - Silent-failures audit: axis E (PR #331) — row-drop counters surfaced across estimator paths so silent validator row-drops leave an explicit count on the result.
- Silent-failures audit: axis G (PR #337) — Rust vs Python backend edge-case parity tests added for rank-deficient, extreme-scale, and constant-column inputs.
- SyntheticDiD diagnostic Y-normalization parity (PR #328) — extends the PR #312 catastrophic-cancellation fix from the main fit path into
SyntheticDiDResults.in_time_placebo()and.sensitivity_to_zeta_omega(). Diagnostics now apply the sameY_shift / Y_scalenormalization the main fit uses, passzeta / Y_scaleand a normalizedmin_decreaseinto Frank-Wolfe, then rescaleatt/pre_fit_rmseback to original-Y units. - TROP bootstrap failure-rate guards (PR #324) — alternating-minimization bootstrap loops now emit a
UserWarningon silent high-failure-rate runs (LOOCV and bootstrap aggregation paths both covered); attempt-count-based warning replaces the previous observation-count denominator that could silently mask sparse runs. simulate_power()failure-count surface + narrow except clause (PR #326) — power-simulation replicate loop narrows the exception whitelist fromexcept Exceptionto estimation/data-path failures (TypeErrorand friends now propagate, not silently absorb), and surfacesn_simulation_failuresonSimulationPowerResults. Failure count included insummary()andto_dict().
3.1.3 - 2026-04-18
- Replicate-weight variance and PSU-level bootstrap for dCDH (PR #311) -
ChaisemartinDHaultfoeuillenow acceptsvariance_method="replicate"for BRR / Fay / JK1 / JKn / SDR inference, and PSU-level multiplier bootstrap whensurvey_design.psuis set. Adds df-aware inference (reduced effective df under replicate variance; propagated through delta / HonestDiD surfaces) plus group-level PSU map construction. Validated via per-cohort aggregation, shared-draw multi-horizon bootstrap alignment, and cross-surface df consistency. - Zenodo DOI auto-minting configuration (PR #321) -
.zenodo.jsonat repo root defines release metadata so the next GitHub Release automatically mints a Zenodo DOI (concept DOI + versioned DOI). Also adds a top-levelLICENSEfile for Zenodo archival.
- Silent sparse→dense lstsq fallback in
ImputationDiDandTwoStageDiD(PR #319) - when the sparse solver fails and the dense fallback runs, the estimator now emits aUserWarninginstead of silently switching paths. Regression tests assert the dense fallback SEs remain usable. - Non-convergence signaling in TROP alternating-minimization solvers (PR #317) - the global- and local-TROP solvers now emit a
UserWarningwhen the alternating-minimization loop exits without meeting tolerance, including LOOCV and bootstrap aggregation paths. Warnings aggregate at top-level call sites to avoid log spam.
/bump-versionskill updatesCITATION.cff(PR #320) - internal release-management tooling now keepsCITATION.cffversion:anddate-released:in sync with the other version surfaces. Resolves a singleRELEASE_DATEupfront (from the CHANGELOG header if pre-populated, else today's date) and threads it through all date-bearing files — fixes drift that caused v3.1.2 to ship withCITATION.cffstill pinned at 3.1.1.
3.1.2 - 2026-04-18
- SyntheticDiD catastrophic cancellation at extreme Y scale (PR #312) - the Frank-Wolfe weight solver lost precision when outcome magnitudes were very large or very small; results are now numerically stable across scales.
- Non-convergence signaling in FE imputation alternating-projection solvers (PR #314) -
ImputationDiD,TwoStageDiD, and sharedwithin_transformnow emit aUserWarningwhen the alternating-projection / weighted-demean loop exits without meeting the tolerance.max_iterandtolare documented onwithin_transform. - Non-convergence signaling in SyntheticDiD Frank-Wolfe solver (PR #315) - the numpy-path Frank-Wolfe SC weight solver now emits a
UserWarningwhen the loop exits without meetingmin_decrease. Wrapper-level andmax_iter=0regression tests added.
- Refresh
ROADMAP.mdto drop top-level phase numbering and reflect shipped state through v3.1.1 (PR #313). Absorbs dCDH into the Current State estimator list; adds Recently Shipped summary; reorganizes open work as Shipping Next / Under Consideration / AI-Agent Track / Long-term. Updatesdocs/business-strategy.md,docs/survey-roadmap.md,docs/practitioner_decision_tree.rst,docs/choosing_estimator.rst,docs/api/chaisemartin_dhaultfoeuille.rst,README.md, anddiff_diff/guides/llms-full.txtto remove stale phase-deferral language now that the deferred items have shipped. - Bump the
SyntheticDiD(lambda_reg=...)andSyntheticDiD(zeta=...)deprecation warnings' removal target fromv3.1tov4.0.0. Removing public kwargs in a patch / minor release would violate Semantic Versioning; the deprecation stays warning-only throughout the3.xline and will be removed in the next major release. Usezeta_omega/zeta_lambdainstead.
3.1.1 - 2026-04-16
- Jackknife variance estimation for SyntheticDiD -
variance_method='jackknife'implements the delete-one-unit jackknife from Arkhangelsky et al. (2021) Section 5. Supports both standard and survey-weighted jackknife with automaticpweightpropagation. Validated against Rsynthdidpackage. - LinkedIn carousel for dCDH estimator announcement (
carousel/diff-diff-dcdh-carousel.pdf)
3.1.0 - 2026-04-14
- dCDH Phase 3: Complete feature set for
ChaisemartinDHaultfoeuille- three sub-releases completing the estimator:- Phase 3a (PR #300): Placebo SE via multiplier bootstrap (resolves Phase 1 deferral), non-binary treatment support with crossing-cell detection and automatic cell dropping, R parity SE assertions tightened
- Phase 3b (PR #302): Covariate adjustment via
controlsparameter (OLS residualization, Design 2 per-period path for non-binary treatment), group-specific linear trends viatrends_linear=True(absorbs group-specific slopes before DiD), RDIDmultiplegtDYNparity tests for covariates and trends - Phase 3c (PR #303): HonestDiD sensitivity analysis integration -
honest_did()method on results with automatic event-study-to-sensitivity bridge, support trimming for non-consecutive horizons,l_vectarget specification, Delta-RM and Delta-SD smoothness bounds
- ROADMAP.md updated: dCDH Phase 3 items marked shipped
3.0.2 - 2026-04-12
ChaisemartinDHaultfoeuille(aliasDCDH) - de Chaisemartin & D'Haultfœuille estimator for non-absorbing (reversible) treatments. The only modern staggered DiD estimator that handles treatment switching on AND off. ImplementsDID_Mfrom AER 2020, validated against RDIDmultiplegtDYNv2.3.3. Ships Phases 1 and 2:- Phase 1: headline
DID_Mwith analytical SE, joiners/leavers decompositions, single-lag placebo, multiplier bootstrap, TWFE decomposition diagnostic - Phase 2: multi-horizon event study (
L_max), dynamic placebos, normalized estimator, cost-benefit aggregate (Lemma 4), sup-t simultaneous confidence bands,plot_event_study()integration
- Phase 1: headline
twowayfeweights()- standalone TWFE decomposition diagnostic (Theorem 1, AER 2020)generate_reversible_did_data()- reversible-treatment panel data generator with 7 switch patterns- Survey-aware power analysis - analytical helpers (
compute_power(),compute_mde(),compute_sample_size()) accept adeffparameter for design-effect adjustment. Simulation helpers (simulate_power,simulate_mde,simulate_sample_size) accept asurvey_config(SurveyPowerConfig) that generates data with complex survey structure and injects aSurveyDesigninto each simulated fit. aggregate_survey()second_stage_weightsparameter - choose"pweight"(default, population weights) or"aweight"(precision weights). pweight output is compatible with all survey-capable estimators; aweight is opt-in for GLS efficiency with estimators marked Full in the survey support matrix.conditional_ptparameter ongenerate_survey_did_data()- simulates scenarios where unconditional parallel trends fail but conditional PT holds after covariate adjustment- Tutorial 18: Geo-Experiment Analysis (
18_geo_experiments.ipynb) - SyntheticDiD walkthrough for marketing analytics: simulated DMA panel, 5 treated markets, fit + diagnostics + stakeholder summary - Practitioner decision tree (
docs/practitioner_decision_tree.rst) - "which method fits my business problem?" guide - Practitioner getting started guide (
docs/practitioner_getting_started.rst) - end-to-end walkthrough with terminology bridge - JOSS paper (
paper.md,paper.bib) - software paper for Journal of Open Source Software submission - CONTRIBUTORS.md - author and contributor credit
- Standalone CI Gate workflow (
.github/workflows/ci-gate.yml) - doc-only PRs no longer block on path-filtered test workflows
aggregate_survey()default second-stage weights changed fromaweight(precision) topweight(population). Users who need the old precision-weighting behavior can passsecond_stage_weights="aweight".- README "For Data Scientists" section with practitioner-facing links and
aggregate_survey()documentation - CITATION.cff updated with version and release date
- ROADMAP.md updated: B1a-d marked done, B2b marked done, B3d marked shipped, dCDH entry updated with correct citations
- Doc-only PRs no longer block indefinitely on CI Gate (standalone gate workflow runs on all PRs regardless of path filters)
aggregate_survey()docs no longer overclaim universal estimator compatibility - explicitly document aweight/pweight restrictions per the survey support matrix
3.0.1 - 2026-04-07
aggregate_survey()— new function indiff_diff.prepthat bridges individual-level survey microdata to geographic-period panels for DiD estimation. Computes design-based cell means and precision weights using domain estimation (Lumley 2004), with SRS fallback for small cells. Returns a panel DataFrame and pre-configuredSurveyDesignfor second-stage estimation. Supports both TSL and replicate-weight variance.- Python 3.14 support — upgraded PyO3 from 0.22 to 0.28, updated CI and publish workflow matrices, bumped Rust MSRV to 1.84 for faer 0.24 compatibility.
- Updated README Python support matrix to include 3.14
- Fix domain estimation zero-padding for correct design-based cell variance
- Fix SRS fallback weight normalization for scale invariance across replicate designs
- Validate numeric dtype for outcomes/covariates before aggregation (nullable dtype support)
- Validate grouping columns for NaN values
3.0.0 - 2026-04-07
v3.0 completes the survey support roadmap: all 16 estimators (15 inference-level +
BaconDecomposition diagnostic) now accept survey_design. See v2.8.0–v2.9.1 entries
for the full feature history leading to this release.
- Remove
bootstrap_weight_typeparameter from CallawaySantAnna — usebootstrap_weightsinstead (deprecated since v1.0.1) - Remove TROP
method="twostep"alias — usemethod="local"(deprecated since v2.7.2) - Remove TROP
method="joint"alias — usemethod="global"(deprecated since v2.7.2)
CallawaySantAnna(bootstrap_weight_type="mammen")→CallawaySantAnna(bootstrap_weights="mammen")TROP(method="twostep")→TROP(method="local")TROP(method="joint")→TROP(method="global")
- SyntheticDiD
lambda_regandzetaparameters formally scheduled for removal in v3.1 — usezeta_omega/zeta_lambdainstead
- Internal attribute
bootstrap_weight_typerenamed tobootstrap_weightsin bootstrap mixin and StaggeredTripleDifference for consistency - TROP
set_params()now validatesmethodagainst("local", "global")— previously only validated in__init__ - Documentation updated: all survey gap notes for WooldridgeDiD removed, ROADMAP Phase 10 items marked shipped
2.9.1 - 2026-04-06
- Survey theory document (
docs/methodology/survey-theory.md) — formal justification for design-based variance estimation with modern DiD influence functions, citing Binder (1983), Rao & Wu (1988), Shao (1996) - Research-grade survey DGP — 8 new parameters on
generate_survey_did_data():icc,weight_cv,informative_sampling,heterogeneous_te_by_strata,te_covariate_interaction,covariate_effects,strata_sizes,return_true_population_att. All backward-compatible. - R validation expansion — 4 additional estimators cross-validated against R's
survey::svyglm(): ImputationDiD, StackedDiD, SunAbraham, TripleDifference. Survey R validation coverage now 8 of 16 estimators. - LinkedIn carousel for Wooldridge ETWFE estimator announcement
- Survey tutorial rewritten: leads with "Why Survey Design Matters" section showing flat-weight vs design-based comparison with known ground truth, coverage simulation, and false pre-trend detection rates
- Documentation refresh: ROADMAP.md, llms.txt, llms-full.txt, llms-practitioner.txt, choosing_estimator.rst updated for v2.9.0 — added WooldridgeDiD and StaggeredTripleDifference, DDD flowchart branch, standardized estimator counts, qualified survey claims
- Survey roadmap updated: Phase 10a-10d marked shipped, conditional PT noted for 10e
- Fix stale "EfficientDiD covariates + survey not supported" note in choosing_estimator.rst
- Fix WooldridgeDiD described as "ASF-based" for OLS path (OLS uses direct coefficients; ASF only for logit/Poisson)
- Fix dead StaggeredTripleDifference API link in llms.txt
- Fix survey example attribute:
.design_effectnot.deffin llms-full.txt - Fix
subpopulation()example to show tuple unpacking in llms-full.txt - Remove 8 resolved items from TODO.md
2.9.0 - 2026-04-04
- WooldridgeDiD (ETWFE) estimator — Extended Two-Way Fixed Effects from Wooldridge (2025, 2023). Supports OLS, logit, and Poisson QMLE paths with ASF-based ATT and delta-method SEs. Four aggregation types (simple, group, calendar, event) matching Stata
jwdid_estat. Alias:ETWFE. (PR #216, thanks @wenddymacro) - EfficientDiD survey + covariates — doubly robust covariate path now threads survey weights through all four nuisance estimation stages (outcome regression, propensity ratio sieve, inverse propensity sieve, kernel-smoothed conditional Omega*). Previously raised
NotImplementedError. - Survey real-data validation (Phase 9) — 15 cross-validation tests against R's
surveypackage using three real federal survey datasets:- API (R
surveypackage): TSL variance with strata, FPC, subpopulations, covariates, and Fay's BRR replicates - NHANES (CDC/NCHS): TSL variance with strata + PSU + nest=TRUE, validating the ACA young adult coverage provision DiD
- RECS 2020 (U.S. EIA): JK1 replicate weight variance with 60 pre-computed replicate columns
- ATT, SE, df, and CI match R to machine precision (< 1e-10) where directly comparable; known deviations documented in REGISTRY.md (TWFE SE differs due to unit FE absorption; subpopulation df differs due to strata preservation)
- API (R
- Label-gated CI — test workflows now require
ready-for-cilabel before running, reducing wasted CI during AI review rounds. AI review workflow always runs. - Documentation dependency map (
docs/doc-deps.yaml) — maps source files to impacted documentation. New/docs-impactskill flags which docs need updating when source files change.
- WooldridgeDiD: full interacted covariate basis (D_g × X, f_t × X) for OLS path
/submit-pr,/push-pr-update,/pre-merge-check,/docs-checkskills updated for label-gated CI and doc-deps workflow
- Fix WooldridgeDiD OLS unbalanced demeaning and nonlinear never-treated identification
- Fix WooldridgeDiD Poisson dropped-cell bug and anticipation propagation
- Fix EfficientDiD IF-scale mismatch in survey aggregation and zero-weight never-treated guard
- Fix bootstrap clustering and delta-method reduced space in WooldridgeDiD
2.8.4 - 2026-04-04
- SDR replicate method (Phase 8a) — Successive Difference Replication for ACS PUMS users.
SurveyDesign(replicate_method="SDR")with variance formulaV = 4/R * sum((theta_r - theta)^2). - FPC support for ImputationDiD and TwoStageDiD (Phase 8b) — finite population correction now threaded through TSL variance for both estimators.
- Lonely PSU "adjust" in bootstrap (Phase 8d) —
lonely_psu="adjust"now works with survey-aware bootstrap (previously raisedNotImplementedError). Uses Rust & Rao (1996) grand-mean centering. - CV on estimates (Phase 8e) —
coef_varproperty on all results objects (SE/estimate). Handles edge cases (SE=0, estimate=0). - Weight trimming utility (Phase 8e) —
trim_weights(data, weight_col, upper=None, lower=None, quantile=None)inprep.pyfor capping extreme survey weights. - ImputationDiD pretrends + survey (Phase 8e) — pre-trends F-test now survey-aware using subpopulation approach for correct variance under complex designs.
- Updated ImputationDiD tutorial to demonstrate
pretrends=Trueevent study - Updated survey tutorial: narrative improvements, chart rendering fixes
- Fix survey pretrend F-test df calculation and rank-deficient survey VCV handling
- Fix
trim_weightsNaN poisoning when weight column contains missing values - Fix single-singleton PSU warning for lonely_psu="adjust"
2.8.3 - 2026-04-02
- Silent operation warnings — 8 operations that previously altered analysis results without informing the user now emit
UserWarning:- TROP lstsq → pseudo-inverse numerical fallback
- TwoStageDiD NaN masking of unidentified fixed effects (zeroed out with treatment indicator)
- TwoStageDiD always-treated unit removal (sample size change)
- CallawaySantAnna silent (g,t) pair skipping (zero treated or control observations)
- TROP missing treatment indicator fill with 0 (control)
- Rust → Python backend fallback (previously debug log only)
- Survey weight normalization (pweights/aweights rescaled to mean=1)
np.inf→ 0 never-treated convention conversion
- ImputationDiD pre-period event study coefficients — pre-treatment "effects" (should be ~0 under parallel trends) for visual pre-trends assessment, following BJS (2024) Test 1
- TwoStageDiD pre-period event study coefficients — same pre-trends extension
- Replicate weight expansion to 7 additional estimators: DifferenceInDifferences, TwoWayFixedEffects, MultiPeriodDiD, SunAbraham, StackedDiD, ImputationDiD, TwoStageDiD (coverage: 4/13 → 11/13)
- ImputationDiD pre-period coefficients use BJS Test 1 (impute Y(0) for treated units in pre-treatment periods)
- SunAbraham replicate weights use full interaction-weighted refit per replicate with cohort-level SEs
- Fix zero-weight demeaning safety in replicate weight paths
- Fix
df_surveywriteback for rank-deficient replicate designs (df=0) - Fix ImputationDiD
balance_ezero-qualifying-cohort fallback in pretrends path - Fix survey zero-mass (g,t) skip warning gap
- Fix SunAbraham positional assignment in replicate loop
2.8.2 - 2026-04-02
- EPV diagnostics for propensity score logit — events-per-variable (EPV) checks with Peduzzi convention (predictors excluding intercept) for CallawaySantAnna IPW/DR, TripleDifference IPW/DR, and StaggeredTripleDifference
epv_summary()/epv_diagnosticson post-fit results for CallawaySantAnna, TripleDifference, and StaggeredTripleDifferencediagnose_propensity()pre-estimation helper on CallawaySantAnna- EPV summary block in TripleDifference
summary()output epv_thresholdparameter for propensity score estimation — warns on low EPV (default) or escalates viarank_deficient_action="error"
- Default propensity score fallback behavior: safer defaults with method-specific warning messages
- EPV denominator uses predictor count excluding intercept (Peduzzi et al. 1996 convention)
- Fix TripleDifference survey-weighted fallback propensity score
- Fix NaN cache poisoning in propensity score estimation
- Fix
epv_summarycolumn schema on empty results - Fix SDDD EPV: use min-EPV across comparison cohorts with cache diagnostic propagation
- Fix
diagnose_propensitynp.infhandling
2.8.1 - 2026-04-01
- Survey-aware DiD tutorial (
docs/tutorials/16_survey_did.ipynb) — Phase 7c complete. Full workflow with strata, PSU, FPC, replicate weights, subpopulation analysis, and DEFF diagnostics. Includesgenerate_survey_did_data()DGP function. - Survey R cross-validation — benchmark scripts and tests comparing TSL variance against R's
survey::svyglm()for basic DiD and TWFE with full survey designs (strata, PSU, FPC). Committed JSON fixtures for CI without R. - HonestDiD methodology review and validation — 478 lines of methodology tests, paper review document, rewritten optimal FLCI with first-difference reparameterization.
- StaggeredTripleDifference survey support — full
SurveyDesignintegration with strata/PSU/FPC, replicate weights, and survey-aware bootstrap.
- HonestDiD: rewrite optimal FLCI with proper first-difference reparameterization and centrosymmetric LP optimization
- HonestDiD: use
conf_intfrom results instead of hardcoded1.96*sein event study plots - Survey tutorial cross-referenced from choosing_estimator.rst and quickstart.rst
- Fix HonestDiD identified set computation and inference (F1-F6 from Rambachan & Roth 2023)
- Fix FLCI slope count (T not T-1) and constraint formula
- Fix NaN CI misclassification as significant (P0 finding)
- Fix M=0 linear extrapolation and survey df folded nct in REGISTRY.md
- Fix replicate-weight scale invariance and BRR test fixtures
- Fix JK1 populated-PSU guard and narrow warning filter
2.8.0 - 2026-03-31
- Staggered Triple Difference estimator (Ortiz-Villavicencio & Sant'Anna 2025)
StaggeredTripleDifferenceclass with group-time ATT(g,t) for DDD designs with staggered adoption- Event study aggregation, pre-treatment placebo effects, multiplier bootstrap inference
- R benchmark validation against
triplediffpackage - DGP function
generate_staggered_ddd_data()for simulation and testing
- Survey Phase 7a: CS IPW/DR + covariates + survey
- DRDID panel nuisance-estimation IF corrections (PS + OR) under survey weights
- Survey-weighted propensity score estimation and outcome regression
- IFs account for nuisance parameter estimation uncertainty (Sant'Anna & Zhao 2020, Theorem 3.1)
- Survey Phase 7b: Repeated cross-sections
CallawaySantAnna(panel=False)for repeated cross-section surveys (BRFSS, ACS, CPS)- Cross-sectional DRDID:
regmatchesDRDID::reg_did_rc,drmatchesDRDID::drdid_rc,ipwmatchesDRDID::std_ipw_did_rc - Survey weights, covariates, and all estimation methods supported
- Survey Phase 7d: HonestDiD + survey variance
- Survey df and full event-study VCV from IF vectors propagated to sensitivity analysis
- t-distribution critical values with survey degrees of freedom
- Bootstrap/replicate designs fall back to diagonal VCV with warning
- Plotly visualization styling: thread
marker,markersize,linewidth,capsize,ci_linewidthkwargs through plotly backends (previously silently ignored) - AI agent discoverability for practitioner guide
- HonestDiD now raises
ValueErroron non-consecutive event-time grid (was warning) - HonestDiD validates full grid around reference period
- Panel IPW/DR PS correction scaling matches R's
H/n,asy_rep/n,colMeansconvention - RC IF normalization follows R's
psiconvention with explicitphiconversion
- Fix HonestDiD reference-aware pre/post split for varying-base event studies
- Fix HonestDiD
_estimate_max_pre_violationto use reference-aware pre_periods - Fix panel M2 gradient scaling for IPW/DR nuisance IF corrections
- Fix VCV index alignment for repeated cross-section aggregation
- Fix replicate-weight df propagation: return per-statistic df instead of mutating shared state
- Fix WIF population consistency: zero df
first_treatfor ineligible units - Fix bootstrap RCS cohort-mass weighting and stale event-study VCV reset
2.7.6 - 2026-03-28
- AI practitioner guardrails based on Baker et al. (2025) "Difference-in-Differences Designs: A Practitioner's Guide"
practitioner.pymodule with 8-step workflow enforcement for AI agents- Estimator-specific handlers ensuring correct diagnostic ordering (pre-trends before estimation, Bacon decomposition before estimator selection)
docs/llms.txt,docs/llms-practitioner.txt,docs/llms-full.txtfor AI agent discoverability- Evaluation rubric (
docs/practitioner-guide-evaluation.md) with correctness-aware scoring
- Survey Phase 6: Advanced features
- Survey-aware bootstrap for all 8 bootstrap-using estimators (PSU-level multiplier for CS/Imputation/TwoStage/Continuous/Efficient; Rao-Wu rescaled for SA/SyntheticDiD/TROP)
- Replicate weight variance estimation (BRR, Fay's BRR, JK1, JKn) for OLS-based and IF-based estimators
- Per-coefficient DEFF diagnostics comparing survey vs SRS variance
- Subpopulation analysis via
SurveyDesign.subpopulation()preserving full design structure - CS analytical expansion: strata/PSU/FPC for aggregated SEs via
compute_survey_if_variance() - TROP cross-classified pseudo-strata for survey-aware bootstrap
- Estimator-specific guidance for parallel trends tests and placebo checks (no shared templates)
- SDiD and TROP split into separate decision tree branches in practitioner workflow
- Fix replicate weight df calculation using pivoted QR rank with R-compatible tolerance
- Fix replicate IF variance score scaling for EfficientDiD, TripleDiff, ContinuousDiD
- Fix panel-to-unit replicate weight propagation and normalization
- Fix CS zero-mass return type and vectorized guard for survey paths
- Fix
solve_logiteffective-sample validation for zero-weight designs - Fix subpopulation mask validation and EfficientDiD bootstrap guard
2.7.5 - 2026-03-23
- Phase 4 survey support for ImputationDiD, TwoStageDiD, and CallawaySantAnna estimators
- ImputationDiD/TwoStageDiD: analytical survey inference with weights, strata, and PSU (FPC not supported; bootstrap+survey deferred)
- CallawaySantAnna: weights-only analytical IF/WIF inference matching R
did::wif()(strata/PSU/FPC deferred) - Survey-aware aggregation for group-time, event-study, and overall ATT
- EfficientDiD enhancements: doubly robust covariates path, sieve inverse propensity (Eq 3.12), conditional Omega*
- Cluster-robust SEs for EfficientDiD with last-cohort control and Hausman pretest
- Enhanced visualizations: synth weights, staircase, dose-response, group-time heatmap, plotly backend
- Local AI review skill (
/ai-review-local) with Responses API, delta-diff re-review, and cost visibility - Add
plotlyoptional dependency group (pip install diff-diff[plotly])
- Migrate AI local review from Chat Completions to Responses API
- Split TROP estimator into mixin modules (
trop_local.py,trop_global.py) for maintainability - Refactor
visualization.pyintovisualization/subpackage - Improve review script: full-file context, content-first parsing, tiered matching, fingerprint stability
- Fix CallawaySantAnna reg+cov control IF normalization and survey df calculation
- Fix TripleDifference TSL double-weighting and RA nuisance linearization with survey weights
- Fix ContinuousDiD bread normalization, fweight TSL scaling, and weighted-mass IF linearization
- Fix BaconDecomposition exact-weight survey unit_share and empty-cell guard
- Fix SunAbraham survey weight floor in overall ATT aggregation
- Fix plotly event study for non-numeric periods, heatmap masking, color parser
2.7.4 - 2026-03-21
- Survey/sampling weights support (
survey_designparameter) forDifferenceInDifferencesandTwoWayFixedEffects- Taylor-series linearization (TSL) variance estimation with stratified multi-stage designs
- Probability weights (pweight), frequency weights (fweight), and analytic weights (aweight)
- Finite population correction (FPC) support
- PSU-based clustering with lonely PSU handling
- New
diff_diff/survey.pymodule withSurveyDesignandcompute_survey_vcov
- EfficientDiD validation tests against Chen, Sant'Anna & Xie (2025) using HRS dataset
- HRS validation fixture with provenance documentation
- Shared DGP helper in
tests/helpers/edid_dgp.py
- Simulation-based power analysis for all registry-backed estimators (MDE, sample size, power curves); unregistered estimators supported via custom
data_generatorandresult_extractor
- Extend power analysis to support all registry-backed estimators with
result_extractorparameter - Update power analysis tutorial with simulation-based features
- Reject
absorb + fixed_effectscombination (FWL violation) in both survey and non-survey paths
- TWFE cluster-as-PSU injection for no-PSU survey designs
- Non-unique PSU labels across strata with
nest=False - FPC validation moved to
compute_survey_vcovfor effective PSU structure - Survey HC1 meat formula and weighted rank-deficiency handling
- Zero-SE inference, full-census FPC, fweight contract corrections
- Bootstrap+survey fallback in MultiPeriodDiD
- DDD
_snap_nfloor mismatch andn_per_cellsuppression scope
2.7.3 - 2026-03-19
- Add aarch64 Linux wheel builds to publish workflow
- Improve documentation information architecture
- Fix silent interpreter skip and consolidate Linux jobs in publish workflow
2.7.2 - 2026-03-18
- SEO infrastructure: meta tags, sitemap, llms.txt/llms-full.txt for AI discoverability
- Rename TROP
method="twostep"tomethod="local";"twostep"deprecated, removal in v3.0 - Rename internal TROP
_joint_*methods to_global_*for consistency
- Fix TROPResults schema: report unit counts not observation counts
- Fix llms-full.txt accuracy and dynamic canonical URLs
2.7.1 - 2026-03-15
- Replace BFGS logit with IRLS for propensity score estimation in CallawaySantAnna
- Reject
pscore_trim=0.0to prevent infinite IPW weights - Honor
rank_deficient_action="error"in propensity score paths - Validate
pscore_trimatfit()to guard againstset_paramsbypass - Mark slow tests (
@pytest.mark.slow) and exclude by default for faster local iteration - Use per-class slow markers in
test_trop.pyfor faster pure Python CI
- Vectorize Sun-Abraham bootstrap resampling loop for improved performance
2.7.0 - 2026-03-15
- EfficientDiD estimator (
EfficientDiD) implementing Chen, Sant'Anna & Xie (2025) efficient DiD - CallawaySantAnna event study SEs (WIF-based) and simultaneous confidence bands (sup-t)
- R comparison tests for event-study SEs and cband critical values
- Non-finite outcome validation in
EfficientDiD.fit() - CallawaySantAnna speed benchmarks with baseline results
- Estimator alias documentation in README, quickstart, and API docs
- BREAKING: TROP nuclear norm solver step size fix — The proximal gradient
threshold for the L matrix (both
method="global"andmethod="twostep"with finitelambda_nn) was over-shrinking singular values by a factor of 2. The soft-thresholding threshold was λ_nn/max(δ) when the correct value is λ_nn/(2·max(δ)), derived from the Lipschitz constant L_f=2·max(δ) of the quadratic gradient. This fix produces higher-rank L matrices and closer agreement with exact convex optimization solutions. Users with finitelambda_nnwill observe different ATT estimates. Added FISTA/Nesterov acceleration to the twostep inner solver for faster L convergence. - Add (1-W) weight masking to TROP global method, rename joint→global
- Optimize CallawaySantAnna covariate path with Cholesky and pscore caching
- Update Codex AI review model from gpt-5.2-codex to gpt-5.4
- Fix CallawaySantAnna event study SEs (missing WIF) and simultaneous confidence bands
- Fix analytical and bootstrap WIF pg scaling to use global N
- Fix TROP nuclear norm solver threshold scaling for non-uniform weights
- Fix stale coefficients in TROP global low-rank solver and NaN bootstrap poisoning
- Fix NaN-cell preservation in CallawaySantAnna balance_e aggregation
- Fix not-yet-treated cache keys and dropped-cell warning
- Fix rank-deficiency handling with Cholesky rank checks and reduced-column solve
- Fix Rust convergence criterion, n_valid_treated consistency, and NaN bootstrap SE
2.6.1 - 2026-03-08
- Short aliases for all estimators (e.g.,
DiD,TWFE,EventStudy,CS,SDiD)
- Update roadmap for v2.6.0: reflect completed work and refresh priorities
- Add ContinuousDiD to ReadTheDocs API reference and choosing guide
- Add SPT identification caveat and data requirements per review
- Add time-invariant dose requirement to data requirements
- Fix alias docs wording: clarify TROP has no alias
- Fix ContinuousDiD SE method: influence function, not delta method
- Fix methodology doc: influence functions, not delta method for ContinuousDiD SEs
- Fix dollar sign escaping in continuous DiD tutorial
- Fix continuous DiD tutorial formatting: escape dollar signs and split chart cell
- Fix methodology claims and slide numbering per PR review
2.6.0 - 2026-02-22
- Continuous DiD estimator (
ContinuousDiD) implementing Callaway, Goodman-Bacon & Sant'Anna (2024) for continuous treatment dose-response analysisContinuousDiDResultswith dose-response curves and event-study effectsDoseResponseCurvewith bootstrap p-values- Analytical and bootstrap event-study SEs
- P(D=0) warning for low-probability control groups
- Stacked DiD tutorial (Tutorial 13) with Q-weight computation walkthrough
- Clarify aggregate Q-weight computation for unbalanced panels in Stacked DiD tutorial
- Replace SunAbraham manual bootstrap stats with NaN-gated utility
- Fix not-yet-treated control mask to respect anticipation parameter in ContinuousDiD
- Guard non-finite
original_effectincompute_effect_bootstrap_stats - Fix bootstrap NaN propagation for rank-deficient cells
- Fix NaN propagation in rank-deficient spline predictions
- Guard bootstrap NaN propagation: SE/CI/p-value all NaN when SE invalid
- Fix bootstrap ACRT^{glob} centering bug
- Fix bootstrap percentile inference and analytical event-study SE scaling
- Fix control group bug and dose validation in ContinuousDiD
2.5.0 - 2026-02-19
- Stacked DiD estimator (
StackedDiD) implementing Wing, Freedman & Hollingsworth (2024) with corrective Q-weights for compositional balance across event times - Sub-experiment construction per adoption cohort with clean (never-yet-treated) controls
- IC1/IC2 trimming for compositional balance across event times
- Q-weights for aggregate, population, or sample share estimands (Table 1)
- WLS event study regression via sqrt(w) transformation
stacked_did()convenience function- R benchmark scripts for Stacked DiD validation (
benchmarks/R/benchmark_stacked_did.R) - Comprehensive test suite for Stacked DiD (
tests/test_stacked_did.py)
- NaN inference handling in pure Python mode for edge cases
2.4.3 - 2026-02-19
- Rewrite TripleDifference estimator to match R's
triplediff::ddd()— all 3 estimation methods (DR, IPW, RA) now use three-DiD decomposition with influence function SE, achieving <0.001% relative difference from R across all 24 comparisons (4 DGPs × 3 methods × 2 covariate settings) - Validate cluster column in TripleDifference for proper cluster-robust SEs
- Handle non-finite influence function propagation in TripleDifference edge cases
- Propensity score fallback uses Hessian-based SE when score optimization fails
- Improved R-squared consistency across estimation methods
- Fix low cell count warning and overlap detection in TripleDifference IPW
- Fix cluster SE computation to use functional (groupby) approach instead of loop
- Fix rank deficiency handling in TripleDifference regression adjustment
- 91 methodology verification tests for TripleDifference (
tests/test_methodology_triple_diff.py) - R benchmark scripts for triple difference validation (
benchmarks/R/benchmark_triplediff.R) - Update METHODOLOGY_REVIEW.md to reflect completed TripleDifference review
2.4.2 - 2026-02-18
- Conditional BLAS linking for Rust backend — Apple Accelerate on macOS, OpenBLAS on Linux. Pre-built wheels now use platform-optimized BLAS for matrix-vector and matrix-matrix operations across all Rust-accelerated code paths (weights, OLS, TROP). Windows continues using pure Rust (no external dependencies). Improves Rust backend performance at larger scales.
rust_backend_info()diagnostic function indiff_diff._backend— reports compile-time BLAS feature status (blas, accelerate, openblas)
- Rust SDID backend performance regression at scale — Frank-Wolfe solver was 3-10x slower than pure Python at 1k+ scale
- Gram-accelerated FW loop for time weights: precomputes A^T@A, reducing per-iteration cost from O(N×T0) to O(T0) (~100x speedup per iteration at 5k scale)
- Allocation-free FW loop for unit weights: 1 GEMV per iteration (was 3), zero heap allocations (was ~8)
- Dispatch based on problem dimensions: Gram path when T0 < N, standard path when T0 >= N
- Rust backend now faster than pure Python at all scales
2.4.1 - 2026-02-17
- Tutorial notebook for Two-Stage DiD (Gardner 2022) (
docs/tutorials/12_two_stage_did.ipynb)
- Module splits for large files: ImputationDiD, TwoStageDiD, and TROP each split into separate results and bootstrap submodules
- Migrated remaining inline inference computations to
safe_inference()utility - Replaced
@operator withnp.dot()at observation-dimension sites to avoid Apple M4 BLAS warnings - Updated TODO.md and ROADMAP.md for accuracy post-v2.4.0
- Matplotlib import guards added to tutorials 11 and 12
- Various bug fixes from code quality cleanup (diagnostics, estimators, linalg, staggered, sun_abraham, synthetic_did, triple_diff)
2.4.0 - 2026-02-16
- Gardner (2022) Two-Stage DiD estimator (
TwoStageDiD)- Two-stage estimator: (1) estimate unit+time FE on untreated obs, (2) regress residualized outcomes on treatment indicators
TwoStageDiDResultswith overall ATT, event study, group effects, per-observation treatment effectsTwoStageBootstrapResultsfor multiplier bootstrap inference on GMM influence functiontwo_stage_did()convenience function for quick estimation- Point estimates identical to ImputationDiD; different variance estimator (GMM sandwich vs. conservative)
- No finite-sample adjustments (raw asymptotic sandwich, matching R
did2s)
- Proposition 5 detection for unidentified long-run horizons without never-treated units
- Workflow improvements to reduce PR review rounds
- Zero-observation horizons/cohorts producing se=0 instead of NaN in TwoStageDiD
- Edge case fixes for TwoStageDiD (PR review feedback)
- Grep PCRE patterns updated to use POSIX character classes
2.3.2 - 2026-02-16
- Python 3.13 support with upper version cap (
>=3.9,<3.14)
- Sun-Abraham methodology review (PR #153)
- IW aggregation weights now use event-time observation counts (not group sizes)
- Normalize
np.infnever-treated encoding before treatment group detection - Add R benchmark scripts and methodology-aligned tests
- Use
rank_deficient_actionandnp.errstateinstead of broadRuntimeWarningfilter in SDID tutorial
- Sun-Abraham bootstrap NaN propagation for non-finite ATT estimates
- Sun-Abraham df_adjustment off-by-one in analytical SE computation
- CI pandas compatibility for SunAbraham bootstrap inference
- SyntheticDiD tutorial: eliminate pre-treatment fit warnings
2.3.1 - 2026-02-15
- Fix docs/PyPI version mismatch (issue #146) — RTD now builds versioned docs from source
- Fix RTD docs build failure caused by Rust/maturin compilation timeout on ReadTheDocs
- Remove Rust outer-loop variance estimation for SyntheticDiD (placebo and bootstrap)
- Fixes SE mismatch between pure Python and Rust backends (different RNG sequences)
- Fixes Rust performance regression at 1k+ scale (memory bandwidth saturation from rayon parallelism)
- Inner Frank-Wolfe weight computation still uses Rust when available
- Re-run SyntheticDiD benchmarks against R after Frank-Wolfe methodology rewrite
- Updated
docs/benchmarks.rstSDID validation results, performance tables, and known differences - ATT now matches R to < 1e-10 (previously 0.3% diff) since both use Frank-Wolfe optimizer
2.3.0 - 2026-02-09
- Borusyak-Jaravel-Spiess (2024) Imputation DiD estimator (
ImputationDiD)- Efficient imputation estimator for staggered DiD designs
- OLS on untreated observations for unit+time FE, impute counterfactual Y(0), aggregate
- Conservative variance (Theorem 3) with
aux_partitionparameter for SE tightness - Pre-trend test (Equation 9) via
results.pretrend_test() - Percentile bootstrap inference
- Influence-function bootstrap with sparse variance and weight/covariate fixes
- Absorbing-treatment validation for non-constant
first_treat - Empty event-study warning for unidentified long-run horizons
/paper-reviewskill for academic paper methodology extraction/read-feedback-reviseskill for addressing PR review comments--prflag for/review-planskill to review plans posted as PR comments--updatedflag for/review-planskill for re-reviewing revised plans- MultiPeriodDiD vs R (fixest) benchmark for cross-language validation
- Shortened test suite runtime with parallel execution and reduced iterations
- TWFE within-transformation bug identified during methodology review
- TWFE: added non-{0,1} binary time warning, ATT invariance tests, and R fixture caching
- TWFE: single-pass demeaning, HC1 test fix, fixest coeftable comparison
- MultiPeriodDiD: added unit FE and NaN guard for R comparison benchmark
- Removed tracked PDF from repo and gitignored papers directory
2.2.1 - 2026-02-07
- MultiPeriodDiD: Full event-study specification (BREAKING)
- Treatment × period interactions now created for ALL periods (pre and post), not just post-treatment
- Pre-period coefficients available for parallel trends assessment
- Default reference period changed from first to last pre-period (e=-1 convention) with FutureWarning for one release cycle
period_effectsdict now contains both pre and post period effectsto_dataframe()includesis_postcolumnsummary()output now shows pre-period effects section- t_stat uses
np.isfinite(se) and se > 0guard (consistent with other estimators)
- Time-varying treatment warning when
unitis provided and treatment varies within units (guides users toward ever-treated indicator D_i) unitparameter toMultiPeriodDiD.fit()for staggered adoption detectionreference_periodandinteraction_indicesattributes onMultiPeriodDiDResultspre_period_effectsandpost_period_effectsconvenience properties on results- Pre-period section in
summary()output with reference period indicator ValueErrorwhenreference_periodis set to a post-treatment period- Staggered adoption warning when treatment timing varies across units (with
unitparam) - Informative KeyError when accessing reference period via
get_effect()
- TROP
variance_methodparameter — Jackknife variance estimation removed. Bootstrap (the only method specified in Athey et al. 2025) is now always used. Thevariance_methodfield has also been removed fromTROPResults. - TROP
max_loocv_samplesparameter — Control observation subsampling removed from LOOCV tuning parameter selection. Equation 5 of Athey et al. (2025) explicitly sums over ALL control observations where D=0; the previous subsampling (default 100) was not specified in the paper. LOOCV now uses all control observations, making tuning fully deterministic. Inner LOOCV loops in the Rust backend are parallelized to compensate for the increased observation count.
- HonestDiD: filter non-finite period effects from MultiPeriodDiD results (prevents NaN propagation into sensitivity bounds; raises ValueError when no finite pre- or post-period effects remain)
- HonestDiD VCV extraction: now uses interaction sub-VCV instead of full regression VCV
(via
interaction_indicesperiod → column index mapping) - MultiPeriodDiD:
avg_seguard now checksnp.isfinite()(matches per-period pattern; preventsavg_t_stat=0/avg_p_value=1when variance is infinite) - HonestDiD: extraction now uses explicit pre-then-post ordering instead of sorted period labels (prevents misclassification when period labels don't sort chronologically)
- Backend-aware test parameter scaling for pure Python CI performance
- Lower TROP stratified bootstrap threshold floor from 11 to 5 for pure Python CI
2.2.0 - 2026-01-27
- Windows wheel builds using pure-Rust
faerlibrary for linear algebra (PR #115)- Eliminates external BLAS/LAPACK dependencies (no OpenBLAS or Intel MKL required)
- Enables cross-platform wheel builds for Linux, macOS, and Windows
- Simplifies installation on all platforms
- Rust backend migrated from nalgebra/ndarray to faer (PR #115)
- OLS solver now uses faer's SVD implementation
- Robust variance estimation uses faer's matrix operations
- TROP distance calculations use faer primitives
- Maintains numerical parity with existing NumPy backend
- Rust backend numerical stability improvements (PR #115)
- Improved singular matrix detection with condition number checks
- NaN propagation in variance-covariance estimation
- Fallback to Python backend on numerical instability with warning
- Underdetermined SVD handling (n < k case)
- macOS CI compatibility for Python 3.14 with
PYO3_USE_ABI3_FORWARD_COMPATIBILITY
2.1.9 - 2026-01-26
- Unified LOOCV for TROP joint method with Rust acceleration (PR #113)
- Leave-one-out cross-validation for rank and regularization parameter selection
- Rust backend provides significant speedup for LOOCV grid search
- TROP joint method Rust/Python parity (PR #113)
- Fixed valid_count bug in LOOCV computation
- Proper NaN exclusion for units with no valid pre-period data
- Zero weight assignment for units missing pre-period data
- Jackknife variance estimation fixes
- Staggered adoption validation and simultaneous adoption enforcement
- Treated-pre NaN handling improvements
- LOOCV subsampling fix for Python-only path
2.1.8 - 2026-01-25
/push-pr-updateskill for committing and pushing PR revisions- Commits local changes to current branch and pushes to remote
- Triggers AI code review automatically
- Robust handling for fork repos, unpushed commits, and upstream tracking
- TROP estimator methodology alignment (PR #110)
- Aligned with paper methodology (Equation 5, D matrix semantics)
- NaN propagation and LOOCV warnings improvements
- Rust backend test alignment with new loocv_grid_search return signature
- LOOCV cycling, D matrix validation fixes
- Final estimation infinity handling and edge case fixes
- Absorbing-state gap detection and n_post_periods fix
/submit-prskill improvements (PR #111)- Case-insensitive secret scanning with POSIX ERE regex
- Verify origin ref exists before push
- Dynamic default branch detection with fallback
- Robust handling for unpushed commits, fork repos
- Files count display in PR summary
2.1.7 - 2026-01-25
plot_event_studyreference period normalization behavior- Effects are now only normalized when
reference_periodis explicitly provided - Auto-inferred reference periods only apply hollow marker styling (no normalization)
- Reference period SE is set to NaN during normalization (constraint, not estimate)
- Updated docstring to clarify explicit vs auto-inferred behavior
- Effects are now only normalized when
- Refactored visualization tests to reuse
cs_resultsfixture for better performance
2.1.6 - 2026-01-24
- Methodology verification tests for DifferenceInDifferences estimator
- Comprehensive test suite validating all REGISTRY.md requirements
- Tests for formula interface, coefficient extraction, rank deficiency handling
- Singleton cluster variance estimation behavioral tests
- REGISTRY.md documentation improvements
- Clarified singleton cluster formula notation (u_i² X_i X_i' instead of ambiguous residual² × X'X)
- Verified DifferenceInDifferences behavior against documented requirements
2.1.5 - 2026-01-22
- METHODOLOGY_REVIEW.md tracking document for methodology review progress
- Review status summary table for all 12 estimators
- Detailed notes template for each estimator by category
- Review process guidelines with checklist and priority ordering
base_periodparameter for CallawaySantAnna pre-treatment effect computation- "varying" (default): Pre-treatment uses t-1 as base (consecutive comparisons)
- "universal": All comparisons use g-anticipation-1 as base
- Matches R
did::att_gt()base_period parameter
- Pre-merge-check skill (
/pre-merge-check) for automated PR validation- Pattern checks for NaN handling consistency
- Context-specific checklist generation
- Tutorial 02 improvements: Added pre-trends section, clarified base_period interaction with anticipation
- Not-yet-treated control group now properly excludes cohort g when computing ATT(g,t)
- Aggregation t_stat uses NaN (not 0.0) when SE is non-finite or zero
- Bootstrap inference for pre-treatment effects with
base_period="varying" - NaN propagation for empty post-treatment effects in CallawaySantAnna
- Grep word boundary pattern in pre-merge-check skill
2.1.4 - 2026-01-20
- Development checklists and workflow improvements in
CLAUDE.md- Estimator inheritance map showing class hierarchy for
get_params/set_params - Test writing guidelines for fallback paths, parameters, and warnings
- Checklists for adding parameters and warning/error handling
- Estimator inheritance map showing class hierarchy for
- R-style rank deficiency handling across all estimators
rank_deficient_actionparameter: "warn" (default), "error", or "silent"- Dropped columns have NaN coefficients (like R's
lm()) - VCoV matrix has NaN for rows/cols of dropped coefficients
- Propagated to all estimators: DifferenceInDifferences, MultiPeriodDiD, TwoWayFixedEffects, CallawaySantAnna, SunAbraham, TripleDifference, TROP, SyntheticDiD
get_params()now includesrank_deficient_actionparameter (fixes sklearn cloning)- NaN vcov fallback in Rust backend for rank-deficient matrices
- MultiPeriodDiD vcov/df computation for rank-deficient designs
- Average ATT inference for rank-deficient designs
- Rank tolerance aligned with R's
lm()default for consistent behavior
2.1.3 - 2026-01-19
- TROP estimator paper conformance issues (Athey et al. 2025)
- Control set now includes pre-treatment observations of eventually-treated units (Issue A)
- Unit distance computation excludes target period per Equation 3 (Issue B)
- Nuclear norm update uses weighted proximal gradient instead of unweighted soft-thresholding (Issue C)
- Bootstrap sampling now stratifies by treatment status per Algorithm 3 (Issue D)
- TROP Rust backend alignment with paper specification
- Weight normalization to sum to 1 (probability weights)
- Weighted proximal gradient for L update with step size η ≤ 1/max(W)
- Cleaned up unused parameters from TROP Rust API
- Removed
control_unit_idxandunit_dist_matrixfrom public functions - Per-observation distances now computed dynamically (more accurate, slightly slower)
- Removed
2.1.2 - 2026-01-19
- Consolidated DGP functions in
prep.pyfor all supported DiD designsgenerate_did_data()- Basic 2x2 DiD data generationgenerate_staggered_data()- Staggered adoption data for Callaway-Sant'Anna/Sun-Abrahamgenerate_factor_data()- Factor model data for TROP/SyntheticDiDgenerate_ddd_data()- Triple Difference (DDD) design datagenerate_panel_data()- Panel data with optional parallel trends violationsgenerate_event_study_data()- Event study data with simultaneous treatment
- Clean up development tracking files for v2.1.1 release
- Removed completed items from TODO.md (now tracked in CHANGELOG)
- Updated ROADMAP.md version numbers and removed shipped TROP section
- Updated
prep.pyline count in Large Module Files table (1338 → 1993)
2.1.1 - 2026-01-19
- Rust backend acceleration for TROP estimator delivering 5-20x overall speedup
compute_unit_distance_matrix- Parallel pairwise RMSE computation for donor matchingloocv_grid_search- Parallel leave-one-out cross-validation across 180 parameter combinationsbootstrap_trop_variance- Parallel bootstrap variance estimation- Automatic fallback to Python when Rust backend unavailable
- Logging for Rust fallback events to aid debugging
/bump-versionskill for release management- Updates version in
__init__.py,pyproject.toml, andrust/Cargo.toml - Generates CHANGELOG entries from git commits
- Adds comparison links automatically
- Updates version in
/review-prskill for code review workflow
- TROP estimator performance optimizations (Python backend)
- Vectorized distance matrix computation using NumPy broadcasting
- Extracted tuning constants to module-level for clarity
- Added
TROPTuningParamsTypedDict for parameter documentation
- Tutorial notebook validation errors in
10_trop.ipynb - Pre-existing RuntimeWarnings in CallawaySantAnna bootstrap (documented)
- TROP
pre_periodsparameter handling for edge cases
2.1.0 - 2026-01-17
- Triply Robust Panel (TROP) estimator implementing Athey, Imbens, Qu & Viviano (2025)
TROPclass combining three robustness components:- Factor model adjustment via SVD (removes unobserved confounders with factor structure)
- Synthetic control style unit weights
- SDID style time weights
TROPResultsdataclass with ATT, factors, loadings, unit/time weightstrop()convenience function for quick estimation- Automatic rank selection methods: cross-validation (
'cv'), information criterion ('ic'), elbow detection ('elbow') - Bootstrap and placebo-based variance estimation
- Full integration with existing infrastructure (exports in
__init__.py, sklearn-compatible API) - Tutorial notebook:
docs/tutorials/10_trop.ipynb - Comprehensive test suite:
tests/test_trop.py
Reference: Athey, S., Imbens, G. W., Qu, Z., & Viviano, D. (2025). "Triply Robust Panel Estimators." Working Paper. arXiv:2508.21536
2.0.3 - 2026-01-17
- Rust backend performance optimizations delivering up to 32x speedup for bootstrap operations
- Bootstrap weight generation now 16x faster on average (up to 32x for Webb distribution)
- Direct
Array2allocation eliminates intermediateVec<Vec<f64>>(~50% memory reduction) - Rayon chunk size tuning (
min_len=64) reduces parallel scheduling overhead - Webb distribution uses lookup table instead of 6-way if-else chain
- LinearRegression helper class in
linalg.pyfor code deduplication- High-level OLS wrapper with unified coefficient extraction and inference
- Used by DifferenceInDifferences, TwoWayFixedEffects, SunAbraham, TripleDifference
- Provides
InferenceResultdataclass for coefficient-level statistics
- Cholesky factorization for symmetric positive-definite matrix inversion in Rust backend
- ~2x faster than LU decomposition for well-conditioned matrices
- Automatic fallback to LU for near-singular or indefinite matrices
- Vectorized variance computation in Rust backend
- HC1 meat computation:
X' @ (X * e²)via BLAS instead of O(n×k²) loop - Score computation: broadcast multiplication instead of O(n×k) loop
- HC1 meat computation:
- Static BLAS linking options in
rust/Cargo.tomlopenblas-staticandintel-mkl-staticfeatures for standalone distribution- Eliminates runtime BLAS dependency at cost of larger binary size
2.0.2 - 2026-01-15
- CallawaySantAnna SE computation now exactly matches R's
didpackage- Fixed weight influence function (wif) formula for "simple" aggregation
- Corrected
pgcomputation: usesn_g / n_all(matching R) instead ofn_g / total_treated - Fixed wif iteration: iterates over keepers (post-treatment pairs) with individual ATT(g,t) values
- SE difference reduced from ~2.5% to <0.01% vs R's
didpackage (essentially exact match) - Point estimates unchanged; all existing tests pass
2.0.1 - 2026-01-13
- Shared within-transformation utilities in
utils.pydemean_by_group()- One-way fixed effects demeaningwithin_transform()- Two-way (unit + time) FE transformation- Reduces code duplication across
estimators.py,twfe.py,sun_abraham.py,bacon.py
- DataFrame fragmentation warning - Build columns in batch instead of iteratively
- Reverted untested Rust backend optimizations (Cholesky factorization, reduced allocations) - these will be re-added when proper testing infrastructure is available
2.0.0 - 2026-01-12
- Optional Rust backend for accelerated computation
- 4-8x speedup for SyntheticDiD and bootstrap operations
- Parallel bootstrap weight generation (Rademacher, Mammen, Webb)
- Accelerated OLS solver using OpenBLAS/MKL
- Cluster-robust variance estimation
- Synthetic control weight optimization with simplex projection
- Pre-built wheels for Linux x86_64 and macOS ARM64
- Pure Python fallback for all other platforms
diff_diff/_backend.py- Backend detection and configuration moduleHAS_RUST_BACKENDflag exported in main packageDIFF_DIFF_BACKENDenvironment variable for backend control:'auto'(default) - Use Rust if available, fall back to Python'python'- Force pure Python mode'rust'- Force Rust mode (fails if unavailable)
- Rust source code in
rust/directoryrust/src/lib.rs- PyO3 module definitionrust/src/bootstrap.rs- Parallel bootstrap weight generationrust/src/linalg.rs- OLS solver and robust variance estimationrust/src/weights.rs- Synthetic control weights and simplex projection
- Rust backend test suite -
tests/test_rust_backend.pyfor equivalence testing
- Package version bumped from 1.4.0 to 2.0.0 (major version for new backend)
- CI/CD updated to build Rust extensions with maturin
- ReadTheDocs now installs from PyPI (pre-built wheels with Rust backend)
1.4.0 - 2026-01-11
- Unified linear algebra backend (
diff_diff/linalg.py)solve_ols()- Optimized OLS solver using scipy's gelsy LAPACK drivercompute_robust_vcov()- Vectorized (clustered) robust variance-covariance- Single optimization point for all estimators; prepares for future Rust backend
- New
tests/test_linalg.pywith comprehensive tests
- Major performance improvements - All estimators now significantly faster
- BasicDiD/TWFE @ 10K: 0.835s → 0.011s (76x faster, now 4.2x faster than R)
- CallawaySantAnna @ 10K: 2.234s → 0.109s (20x faster, now 7.2x faster than R)
- All results numerically identical to previous versions
- CallawaySantAnna optimizations (
staggered.py)- Pre-computed wide-format outcome matrix and cohort masks
- Vectorized ATT(g,t) computation using numpy operations (23x faster)
- Batch bootstrap weight generation
- Vectorized multiplier bootstrap using matrix operations (26x faster)
- TWFE optimization (
twfe.py)- Cached groupby indexes for within-transformation
- All estimators migrated to unified
linalg.pybackendestimators.py,twfe.py,staggered.py,triple_diff.py,synthetic_did.py,sun_abraham.py,utils.py
- Rank-deficient design matrices: The new
gelsyLAPACK driver handles rank-deficient matrices gracefully (returning a least-norm solution) rather than raising an explicit error. Previously,DifferenceInDifferenceswould raiseValueError("Design matrix is rank-deficient"). Users relying on this error for collinearity detection should validate their design matrices separately. Results remain numerically correct for well-specified models.
1.3.1 - 2026-01-10
- SyntheticDiD placebo-based variance estimation matching R's
synthdidpackage methodology- New
variance_methodparameter with options"bootstrap"(default) and"placebo" - Placebo method implements Algorithm 4 from Arkhangelsky et al. (2021):
- Randomly permutes control unit indices
- Designates N₁ controls as pseudo-treated (matching actual treated count)
- Renormalizes original unit weights for remaining pseudo-controls
- Computes SDID estimate with renormalized weights
- Repeats for
n_bootstrapreplications - SE = sqrt((r-1)/r) × sd(estimates)
- Provides methodological parity with R's
synthdid::vcov(method = "placebo") n_bootstrapparameter now used for both bootstrap and placebo replicationsSyntheticDiDResultsnow tracksvariance_methodandn_bootstrapattributes- Results summary displays variance method and replications count
- New
Reference: Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic Difference-in-Differences. American Economic Review, 111(12), 4088-4118.
1.3.0 - 2026-01-09
- Triple Difference (DDD) estimator implementing Ortiz-Villavicencio & Sant'Anna (2025)
TripleDifferenceclass for DDD designs where treatment requires two criteria (group AND partition)TripleDifferenceResultsdataclass with ATT, SEs, cell means, and diagnosticstriple_difference()convenience function for quick estimation- Three estimation methods: regression adjustment (
reg), inverse probability weighting (ipw), and doubly robust (dr) - Proper covariate handling (unlike naive DDD implementations that difference two DiDs)
- Propensity score trimming for IPW/DR methods
- Cluster-robust standard errors support
- Tutorial notebook:
docs/tutorials/08_triple_diff.ipynb
Reference: Ortiz-Villavicencio, M., & Sant'Anna, P. H. C. (2025). "Better Understanding Triple Differences Estimators." Working Paper. arXiv:2505.09942
1.2.1 - 2026-01-08
- Expanded test coverage for edge cases:
- Wild bootstrap with very few clusters (< 5), including 2-3 cluster scenarios
- Unbalanced panels with missing periods across units
- Single treated unit scenarios for DiD and Synthetic DiD
- Perfect collinearity detection (validates clear error messages)
- CallawaySantAnna with single treatment cohort
- SyntheticDiD with insufficient pre-treatment periods
- Refactored CallawaySantAnna bootstrap: Extracted
_compute_effect_bootstrap_stats()helper method for cleaner code and reduced duplication in bootstrap statistics computation.
1.2.0 - 2026-01-07
- Pre-Trends Power Analysis (Roth 2022) for assessing informativeness of pre-trends tests
PreTrendsPowerclass for computing power and minimum detectable violation (MDV)PreTrendsPowerResultsdataclass with power, MDV, and test statisticsPreTrendsPowerCurvefor power curves across violation magnitudescompute_pretrends_power()andcompute_mdv()convenience functions- Multiple violation types:
linear,constant,last_period,custom - Integration with Honest DiD via
sensitivity_to_honest_did()method plot_pretrends_power()visualization for power curves- Tutorial notebook:
docs/tutorials/07_pretrends_power.ipynb - Full API documentation:
docs/api/pretrends.rst
Reference: Roth, J. (2022). "Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends." American Economic Review: Insights, 4(3), 305-322.
- Reference period handling in pre-trends analysis: Fixed bug where reference period was incorrectly assigned
avg_seinstead of being excluded from power calculations. Now properly excludes the omitted reference period from the joint Wald test.
1.1.1 - 2026-01-06
-
SyntheticDiD bootstrap error handling: Bootstrap now raises clear
ValueErrorwhen all iterations fail, instead of silently returning SE=0.0. Added warnings for edge cases (single successful iteration, high failure rate). -
Diagnostics module error handling: Improved error messages in
permutation_test()andleave_one_out_test()with actionable guidance. Added warnings when significant iterations fail. Enhancedrun_all_placebo_tests()to return structured error info including error type.
-
Code deduplication: Extracted wild bootstrap inference logic to shared
_run_wild_bootstrap_inference()method inDifferenceInDifferencesbase class, used by bothDifferenceInDifferencesandTwoWayFixedEffects. -
Type hints: Added missing type hints to nested functions:
compute_trend()inutils.pyneg_log_likelihood()andgradient()instaggered.pyformat_label()inprep.py
1.1.0 - 2026-01-05
- Sun-Abraham (2021) interaction-weighted estimator for staggered DiD
SunAbrahamclass implementing saturated regression approachSunAbrahamResultswith event study effects, cohort weights, and overall ATTSABootstrapResultsfor bootstrap inference (SEs, CIs, p-values)- Support for
never_treatedandnot_yet_treatedcontrol groups - Analytical and cluster-robust standard errors
- Multiplier bootstrap with Rademacher, Mammen, or Webb weights
- Integration with
plot_event_study()visualization - Useful robustness check alongside Callaway-Sant'Anna
Reference: Sun, L., & Abraham, S. (2021). "Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects." Journal of Econometrics, 225(2), 175-199.
1.0.2 - 2026-01-04
- Refactored
estimators.pyto reduce module size- Moved
TwoWayFixedEffectstodiff_diff/twfe.py - Moved
SyntheticDiDtodiff_diff/synthetic_did.py - Backward compatible re-exports maintained in
estimators.py
- Moved
- Fixed ReadTheDocs version display by importing from package
__version__
1.0.1 - 2026-01-04
- Tech debt cleanup (Tier 1 + Tier 2)
- Improved code organization and documentation
- Fixed minor issues identified in tech debt review
1.0.0 - 2026-01-04
- Goodman-Bacon decomposition for TWFE diagnostics
BaconDecompositionclass for decomposing TWFE into weighted 2x2 comparisonsComparison2x2dataclass for individual comparisons (treated_vs_never, earlier_vs_later, later_vs_earlier)BaconDecompositionResultswith weights and estimates by comparison typebacon_decompose()convenience functionplot_bacon()visualization for decomposition results- Integration via
TwoWayFixedEffects.decompose()method
- Power analysis for study design
PowerAnalysisclass for analytical power calculationsPowerResultsandSimulationPowerResultsdataclassescompute_mde(),compute_power(),compute_sample_size()convenience functionssimulate_power()for Monte Carlo simulation-based power analysisplot_power_curve()visualization for power analysis- Tutorial notebook:
docs/tutorials/06_power_analysis.ipynb
- Callaway-Sant'Anna multiplier bootstrap for inference
CSBootstrapResultswith standard errors, confidence intervals, p-values- Rademacher, Mammen, and Webb weight distributions
- Bootstrap inference for all aggregation methods
- Troubleshooting guide in documentation
- Standard error computation guide explaining SE differences across estimators
- Updated package status to Production/Stable (was Alpha)
- SyntheticDiD bootstrap now warns when >5% of iterations fail
- Silent bootstrap failures in SyntheticDiD now produce warnings
- CallawaySantAnna covariate adjustment for conditional parallel trends
- Outcome regression (
estimation_method='reg') - Inverse probability weighting (
estimation_method='ipw') - Doubly robust estimation (
estimation_method='dr') - Pass covariates via
covariatesparameter infit()
- Outcome regression (
- Honest DiD sensitivity analysis (Rambachan & Roth 2023)
HonestDiDclass for computing bounds under parallel trends violations- Relative magnitudes restriction (
DeltaRM) - bounds post-treatment violations by pre-treatment - Smoothness restriction (
DeltaSD) - bounds second differences of trend violations - Combined restrictions (
DeltaSDRM) - FLCI and C-LF confidence interval methods
- Breakdown value computation via
breakdown_value() - Sensitivity analysis over M grid via
sensitivity_analysis() HonestDiDResultsandSensitivityResultsdataclassescompute_honest_did()convenience functionplot_sensitivity()for sensitivity analysis visualizationplot_honest_event_study()for event study with honest CIs- Tutorial notebook:
docs/tutorials/05_honest_did.ipynb
- API documentation site with Sphinx
- Full API reference auto-generated from docstrings
- "Which estimator should I use?" decision guide
- Comparison with R packages (did, HonestDiD)
- Getting started / quickstart guide
- Updated mypy configuration for better numpy type compatibility
- Modernized ruff configuration to use
[tool.ruff.lint]section
- Fixed 21 ruff linting issues (import ordering, unused variables, ambiguous names)
- Fixed 94 mypy type checking issues (Optional types, numpy type casts, assertions)
- Added missing return statement in
run_placebo_test()
- Wild cluster bootstrap for valid inference with few clusters
- Rademacher weights (default, good for most cases)
- Webb's 6-point distribution (recommended for <10 clusters)
- Mammen's two-point distribution
WildBootstrapResultsdataclasswild_bootstrap_se()utility function- Integration with
DifferenceInDifferencesandTwoWayFixedEffectsviainference='wild_bootstrap'
- Placebo tests module (
diff_diff.diagnostics)placebo_timing_test()- fake treatment timing testplacebo_group_test()- fake treatment group testpermutation_test()- permutation-based inferenceleave_one_out_test()- sensitivity to individual treated unitsrun_placebo_test()- unified dispatcher for all test typesrun_all_placebo_tests()- comprehensive diagnostic suitePlaceboTestResultsdataclass
- Tutorial notebooks in
docs/tutorials/01_basic_did.ipynb- Basic 2x2 DiD, formula interface, covariates, fixed effects, wild bootstrap02_staggered_did.ipynb- Staggered adoption with Callaway-Sant'Anna03_synthetic_did.ipynb- Synthetic DiD with unit/time weights04_parallel_trends.ipynb- Parallel trends testing and diagnostics
- Comprehensive test coverage (380+ tests)
- Callaway-Sant'Anna estimator for staggered difference-in-differences
CallawaySantAnnaclass with group-time ATT(g,t) estimation- Support for
never_treatedandnot_yet_treatedcontrol groups - Aggregation methods:
simple,group,calendar,event_study CallawaySantAnnaResultswith group-time effects and aggregationsGroupTimeEffectdataclass for individual effects
- Event study visualization via
plot_event_study()- Works with
MultiPeriodDiDResults,CallawaySantAnnaResults, or DataFrames - Publication-ready formatting with customization options
- Works with
- Group effects visualization via
plot_group_effects() - Parallel trends testing utilities
check_parallel_trends()- simple slope-based testcheck_parallel_trends_robust()- Wasserstein distance testequivalence_test_trends()- TOST equivalence test
- Synthetic Difference-in-Differences (
SyntheticDiD)- Unit weight optimization for synthetic control
- Time weight computation for pre-treatment periods
- Placebo-based and bootstrap inference
SyntheticDiDResultswith weight accessors
- Multi-period DiD (
MultiPeriodDiD)- Event-study style estimation with period-specific effects
MultiPeriodDiDResultswithperiod_effectsdictionaryPeriodEffectdataclass for individual period results
- Data preparation utilities (
diff_diff.prep)generate_did_data()- synthetic data generationmake_treatment_indicator()- create treatment from categorical/numericmake_post_indicator()- create post-treatment indicatorwide_to_long()- reshape wide to long formatbalance_panel()- ensure balanced panel datavalidate_did_data()- data validationsummarize_did_data()- summary statistics by groupcreate_event_time()- event time for staggered designsaggregate_to_cohorts()- aggregate to cohort meansrank_control_units()- rank controls by similarity
- Two-Way Fixed Effects (
TwoWayFixedEffects)- Within-transformation for unit and time fixed effects
- Efficient handling of high-dimensional fixed effects via
absorb
- Fixed effects support in base
DifferenceInDifferencesfixed_effectsparameter for dummy variable approachabsorbparameter for within-transformation approach
- Cluster-robust standard errors
clusterparameter for cluster-robust inference
- Formula interface
- R-style formulas like
"outcome ~ treated * post" - Support for covariates in formulas
- R-style formulas like
- Initial release
- Basic Difference-in-Differences (
DifferenceInDifferences)- sklearn-like API with
fit()method - Column name interface for outcome, treatment, time
- Heteroskedasticity-robust (HC1) standard errors
DiDResultsdataclass with ATT, SE, p-value, confidence intervalssummary()andprint_summary()methodsto_dict()andto_dataframe()export methodsis_significantandsignificance_starsproperties
- sklearn-like API with