EMR-ACH turns LLM forecasting into a structured analysis-matrix problem. Instead of asking a model to pick a hypothesis from a question and a bag of articles in a single forward pass, we decompose the task into three small, auditable steps that mirror Heuer's Analysis of Competing Hypotheses method from the intelligence community.
Authors. Yehudit Aperstein, Ben Remez, Alexander Apartsin Paper. apartsinprojects.github.io/EMR-ACH/paper/index.html Tag.
v2.1-data-ready
Given a Forecast Dossier (FD) — a question, a fixed hypothesis set
| Matrix | Shape | Cell meaning |
|---|---|---|
|
Analysis matrix |
Does article |
|
|
Influence matrix |
How strongly does indicator |
The score for hypothesis
with diagnosticity weight
A multi-agent adversarial variant adds one advocate agent per hypothesis and lets a judge agent override the aggregation. No probability distribution, no calibration: every system in the comparison emits a single label per FD.
Pipeline: see Figure 1 in the paper.
LLMs collapse "enumerate hypotheses + score evidence + commit" into one forward pass. Even trained human analysts struggle with this; ACH was designed exactly to split it into small, auditable sub-decisions. EMR-ACH applies the same decomposition to LLMs:
- Indicator generation is contrastive. Each indicator is forced to differentiate at least two hypotheses; vague "applies-to-everything" indicators get filtered out.
- Scoring is local and concrete. "Does this article mention military exercises along the border?" is reliably answerable; "what will happen?" is not.
-
Diagnostic weighting is automatic. Indicators that don't separate hypotheses get
$d_j \approx 0$ and contribute no noise to the argmax. - Multi-agent debate is targeted. One advocate per hypothesis, one judge — not free-form back-and-forth.
The full ablation in paper/index.html §6.4 isolates each contribution.
We need a benchmark that measures forecasting capability without leaking the answer. Two structural contaminations had to go:
- Resolution leakage — questions whose resolution date sits inside the model's training cutoff.
- Article-pool leakage — retrieved articles published after the simulated forecasting decision.
The EMR-ACH benchmark is built around three FD tracks under one schema, with leakage handled by construction:
| Track | n FDs (v2.1) | Domain | Primary target |
|---|---|---|---|
forecastbench |
134 | Public-interest forecasting markets | Comply / Surprise |
gdelt-cameo |
5,975 | Geopolitical event intensity | Comply / Surprise (also Peace / Tension / Violence) |
earnings |
185 | S&P 500 earnings | Comply / Surprise (also Beat / Meet / Miss) |
Both invariants below are verified at publish time and re-asserted at evaluation:
The benchmark exists to compare EMR-ACH against a battery of nine reproducible baselines (B1-B9) on a shared pick-only response contract — full details in benchmark/evaluation/BASELINES.md. A curated gold subset under each cutoff ships fully self-contained: schema, validators, example loaders, LICENSE, CITATION inline.
Full schema: docs/FORECAST_DOSSIER.md.
EMR-ACH's evidence layer is article-level by default, but every benchmark cutoff also ships an Event Timeline Dossier: articles distilled into atomic dated facts that can be deduplicated, linked, and reasoned over independently.
| Stage | Script | What it does |
|---|---|---|
| 1 | articles_to_facts.py |
LLM extraction of (subject, predicate, time, polarity) tuples |
| 2 | etd_dedup.py |
Date-bucketed FAISS kNN canonicalisation (8 sec for 78k facts) |
| 3 | etd_link.py |
Per-cutoff linkage to FDs via primary_article_id |
Headline ablation: B3 articles-only vs B10 hybrid (articles + facts) vs B10b facts-only — paper Table 6.
Every baseline returns one hypothesis label per FD (no probability distribution). Multi-sample methods aggregate via plurality vote; ties broken by hypothesis_set order.
| ID | Method | Calls / FD | Reference |
|---|---|---|---|
| B1 | Direct prompting | 1 | Brown et al. 2020 |
| B2 | Chain-of-Thought (ACH-style) | 1 | Wei et al. 2022 |
| B3 | RAG-only | 1 | Lewis et al. 2020 |
| B4 | Self-Consistency | Wang et al. 2022 | |
| B5 | Multi-Agent Debate | Du et al. 2023 | |
| B6 | Tree of Thoughts | Yao et al. 2023 | |
| B7 | Reflexion | Shinn et al. 2023 | |
| B8 | Verbalized Confidence (deprecated) | 1 | Lin et al. 2022 |
| B9 | Heterogeneous LLM Ensemble | Jiang et al. 2023 | |
| — | Majority-class reference | 0 | this work |
# Build a benchmark cutoff (one command, full pipeline)
python scripts/build_benchmark.py --cutoff 2026-01-01 \
--benchmarks forecastbench,earnings \
--horizon-days 14 --lookback-days 30 \
--embedder openai --openai-mode batch
# Build a self-contained gold subset
python scripts/build_gold_subset.py --cutoff 2026-01-01 \
--min-articles 5 --min-distinct-days 3 --min-source-diversity 2
# Run EMR-ACH on the gold subset
python scripts/eval/emrach_on_gold.py \
--gold-dir benchmark/data/2026-01-01-h14-ccnews-gold \
--mode batch
# Smoke a baseline
cd benchmark && python -m evaluation.baselines.runner \
--method b1_direct \
--fds data/2026-01-01-gold/forecasts.jsonl \
--articles data/2026-01-01-gold/articles.jsonl \
--smoke 3 --sync
# Full B1-B9 Batch API run (≈ $0.30 for 10 methods × 80 FDs)
python -m evaluation.baselines.runner --method b1_direct --batch ...ACH/
├── paper/index.html ← HTML paper (Tables 1-7, Appendices A-G)
├── paper/figures/ ← 8 paper figures, embedded in §3 + §6
├── docs/PROJECT_SPEC.md ← single source of truth
├── docs/V2_2_REFACTOR_BACKLOG.md ← 80-item living backlog
├── docs/FORECAST_DOSSIER.md ← FD schema + invariants
├── docs/EMRACH_IMPLEMENTATION_AUDIT.md ← method-side component audit
├── benchmark/
│ ├── DATASET.md ← schema + EDA + horizon/lookback config
│ ├── configs/ ← default_config.yaml, baselines.yaml
│ ├── schema/ ← FD + article + ETD JSON Schemas
│ ├── evaluation/baselines/ ← B1..B9 + runner
│ └── data/{cutoff}/ + {cutoff}-gold/ ← published benchmark + gold subset
├── src/
│ ├── pipeline/ ← EMR-ACH method modules (indicators, retrieval, multi-agent)
│ └── common/openai_embeddings.py ← Batch API helper
├── scripts/ ← every pipeline stage as a CLI
└── tests/ ← invariants + leakage regression
| Tag | Status | What it is |
|---|---|---|
v2.1-data-ready |
shipped | 6,294 FDs, 28,945 articles, 81-FD gold. Horizon=0 retrospective. Tagged 8ffba6f. |
v2.2-h14 (in progress) |
local | Horizon=14, lookback=30, fb+earnings. Reuse-first (89 FDs) + CC-News rebuild (89 FDs, 1,209 articles, 5,956 ETD facts). EMR-ACH runs end-to-end on the v2.2 gold subset. |
v2.2-data-ready (next) |
TBD | After Phase-3 batch fills paper Tables 5 / 6 / 7 with real numbers. |
The pipeline is fully reproducible from public sources:
- forecastbench ← upstream ForecastBench repo (resolved questions only).
- gdelt-cameo ← GDELT 2.0 KG (publicly hosted).
- earnings ← yfinance + Finnhub + EDGAR + GDELT-slug + NYT/Guardian editorial.
- CC-News ← Common Crawl
data.commoncrawl.org/crawl-data/CC-NEWS/{YYYY}/{MM}/.
No GPU required. Every stage routes through OpenAI Batch API or CPU pipelines (FAISS-CPU). Total reproducibility budget per cutoff-month is under $10.
Aperstein Y., Remez B., Apartsin A.
Evidence Matrix Reasoning: Structured Hypothesis Selection via Analysis of Competing Hypotheses.
2026. https://github.com/ApartsinProjects/EMR-ACH (tag v2.1-data-ready)
Code: MIT. Third-party data (GDELT, Yahoo Finance, ForecastBench upstream, Common Crawl, NYT, Guardian) retains its original upstream licence.

{ "id": "earn_AAPL_2026-01-30", "benchmark": "earnings", "question": "Will Apple's Q1 FY26 EPS surprise be Beat / Meet / Miss?", "hypothesis_set": ["Comply", "Surprise"], "ground_truth": "Comply", "forecast_point": "2026-01-16T00:00:00Z", // resolution_date − 14d "resolution_date": "2026-01-30T20:30:00Z", "article_ids": ["art_a995…", "art_e394…", ...], // every one passes leakage "prior_state_30d": "guidance reaffirmed; consensus 2.10 EPS", "fd_type": "stability", "metadata": {"ticker": "AAPL", "x_multiclass_gt": "Meet", ...} }