Evidence Matrix Reasoning: Structured Hypothesis Selection via Analysis of Competing Hypotheses

EMR-ACH turns LLM forecasting into a structured analysis-matrix problem. Instead of asking a model to pick a hypothesis from a question and a bag of articles in a single forward pass, we decompose the task into three small, auditable steps that mirror Heuer's Analysis of Competing Hypotheses method from the intelligence community.

Authors. Yehudit Aperstein, Ben Remez, Alexander Apartsin Paper. apartsinprojects.github.io/EMR-ACH/paper/index.html Tag. v2.1-data-ready

The method

Given a Forecast Dossier (FD) — a question, a fixed hypothesis set $\mathcal{H} = {h_1, \dots, h_K}$, and a bundle of pre-event articles $\mathcal{A}$ — EMR-ACH builds two matrices:

Matrix	Shape	Cell meaning
Analysis matrix $A$	$\mathcal{A} \times \text{indicators}$	Does article $i$ manifest indicator $j$?
Influence matrix $I$	$\text{indicators} \times \mathcal{H}$	How strongly does indicator $j$ imply hypothesis $k$?

The score for hypothesis $h$ is

$$ \mathrm{Score}(h) ;=; \sum_{i}\sum_{j}, d_j \cdot A_{ij} \cdot I_{jh} $$

with diagnosticity weight $d_j$ amplifying indicators whose influence row discriminates between hypotheses (those with high variance in $I_{j\cdot}$). The final pick is

$$ \hat{h} ;=; \arg\max_{h \in \mathcal{H}} \mathrm{Score}(h). $$

A multi-agent adversarial variant adds one advocate agent per hypothesis and lets a judge agent override the aggregation. No probability distribution, no calibration: every system in the comparison emits a single label per FD.

Pipeline: see Figure 1 in the paper.

Why structured matters

LLMs collapse "enumerate hypotheses + score evidence + commit" into one forward pass. Even trained human analysts struggle with this; ACH was designed exactly to split it into small, auditable sub-decisions. EMR-ACH applies the same decomposition to LLMs:

Indicator generation is contrastive. Each indicator is forced to differentiate at least two hypotheses; vague "applies-to-everything" indicators get filtered out.
Scoring is local and concrete. "Does this article mention military exercises along the border?" is reliably answerable; "what will happen?" is not.
Diagnostic weighting is automatic. Indicators that don't separate hypotheses get $d_j \approx 0$ and contribute no noise to the argmax.
Multi-agent debate is targeted. One advocate per hypothesis, one judge — not free-form back-and-forth.

The full ablation in paper/index.html §6.4 isolates each contribution.

The benchmark (used to evaluate the method)

We need a benchmark that measures forecasting capability without leaking the answer. Two structural contaminations had to go:

Resolution leakage — questions whose resolution date sits inside the model's training cutoff.
Article-pool leakage — retrieved articles published after the simulated forecasting decision.

The EMR-ACH benchmark is built around three FD tracks under one schema, with leakage handled by construction:

Track	n FDs (v2.1)	Domain	Primary target
`forecastbench`	134	Public-interest forecasting markets	Comply / Surprise
`gdelt-cameo`	5,975	Geopolitical event intensity	Comply / Surprise (also Peace / Tension / Violence)
`earnings`	185	S&P 500 earnings	Comply / Surprise (also Beat / Meet / Miss)

Both invariants below are verified at publish time and re-asserted at evaluation:

$$ t_r - t_f = h \quad (h = 14\text{ days in v2.2}) $$ $$ \forall a \in \texttt{article_ids}: \quad \texttt{publish_date}(a) \le t_f $$

The benchmark exists to compare EMR-ACH against a battery of nine reproducible baselines (B1-B9) on a shared pick-only response contract — full details in benchmark/evaluation/BASELINES.md. A curated gold subset under each cutoff ships fully self-contained: schema, validators, example loaders, LICENSE, CITATION inline.

Forecast Dossier

{
  "id": "earn_AAPL_2026-01-30",
  "benchmark": "earnings",
  "question": "Will Apple's Q1 FY26 EPS surprise be Beat / Meet / Miss?",
  "hypothesis_set": ["Comply", "Surprise"],
  "ground_truth": "Comply",
  "forecast_point":   "2026-01-16T00:00:00Z",   // resolution_date − 14d
  "resolution_date":  "2026-01-30T20:30:00Z",
  "article_ids": ["art_a995…", "art_e394…", ...],  // every one passes leakage
  "prior_state_30d": "guidance reaffirmed; consensus 2.10 EPS",
  "fd_type": "stability",
  "metadata": {"ticker": "AAPL", "x_multiclass_gt": "Meet", ...}
}

Full schema: docs/FORECAST_DOSSIER.md.

Event Timeline Dossier (ETD)

EMR-ACH's evidence layer is article-level by default, but every benchmark cutoff also ships an Event Timeline Dossier: articles distilled into atomic dated facts that can be deduplicated, linked, and reasoned over independently.

Stage	Script	What it does
1	`articles_to_facts.py`	LLM extraction of (subject, predicate, time, polarity) tuples
2	`etd_dedup.py`	Date-bucketed FAISS kNN canonicalisation (8 sec for 78k facts)
3	`etd_link.py`	Per-cutoff linkage to FDs via `primary_article_id`

Headline ablation: B3 articles-only vs B10 hybrid (articles + facts) vs B10b facts-only — paper Table 6.

Baselines on a shared pick-only contract

Every baseline returns one hypothesis label per FD (no probability distribution). Multi-sample methods aggregate via plurality vote; ties broken by hypothesis_set order.

ID	Method	Calls / FD	Reference
B1	Direct prompting	1	Brown et al. 2020
B2	Chain-of-Thought (ACH-style)	1	Wei et al. 2022
B3	RAG-only	1	Lewis et al. 2020
B4	Self-Consistency	$n_s$	Wang et al. 2022
B5	Multi-Agent Debate	$n_a \cdot n_r$	Du et al. 2023
B6	Tree of Thoughts	$b + b^d$	Yao et al. 2023
B7	Reflexion	$1 + 2(n_i-1)$	Shinn et al. 2023
B8	Verbalized Confidence (deprecated)	1	Lin et al. 2022
B9	Heterogeneous LLM Ensemble	$\lvert c \rvert$	Jiang et al. 2023
—	Majority-class reference	0	this work

Quickstart

# Build a benchmark cutoff (one command, full pipeline)
python scripts/build_benchmark.py --cutoff 2026-01-01 \
    --benchmarks forecastbench,earnings \
    --horizon-days 14 --lookback-days 30 \
    --embedder openai --openai-mode batch

# Build a self-contained gold subset
python scripts/build_gold_subset.py --cutoff 2026-01-01 \
    --min-articles 5 --min-distinct-days 3 --min-source-diversity 2

# Run EMR-ACH on the gold subset
python scripts/eval/emrach_on_gold.py \
    --gold-dir benchmark/data/2026-01-01-h14-ccnews-gold \
    --mode batch

# Smoke a baseline
cd benchmark && python -m evaluation.baselines.runner \
    --method b1_direct \
    --fds data/2026-01-01-gold/forecasts.jsonl \
    --articles data/2026-01-01-gold/articles.jsonl \
    --smoke 3 --sync

# Full B1-B9 Batch API run (≈ $0.30 for 10 methods × 80 FDs)
python -m evaluation.baselines.runner --method b1_direct --batch ...

Repository layout

ACH/
├── paper/index.html                    ← HTML paper (Tables 1-7, Appendices A-G)
├── paper/figures/                      ← 8 paper figures, embedded in §3 + §6
├── docs/PROJECT_SPEC.md                ← single source of truth
├── docs/V2_2_REFACTOR_BACKLOG.md       ← 80-item living backlog
├── docs/FORECAST_DOSSIER.md            ← FD schema + invariants
├── docs/EMRACH_IMPLEMENTATION_AUDIT.md ← method-side component audit
├── benchmark/
│   ├── DATASET.md                      ← schema + EDA + horizon/lookback config
│   ├── configs/                        ← default_config.yaml, baselines.yaml
│   ├── schema/                         ← FD + article + ETD JSON Schemas
│   ├── evaluation/baselines/           ← B1..B9 + runner
│   └── data/{cutoff}/ + {cutoff}-gold/ ← published benchmark + gold subset
├── src/
│   ├── pipeline/                       ← EMR-ACH method modules (indicators, retrieval, multi-agent)
│   └── common/openai_embeddings.py     ← Batch API helper
├── scripts/                            ← every pipeline stage as a CLI
└── tests/                              ← invariants + leakage regression

Versions

Tag	Status	What it is
`v2.1-data-ready`	shipped	6,294 FDs, 28,945 articles, 81-FD gold. Horizon=0 retrospective. Tagged `8ffba6f`.
`v2.2-h14` (in progress)	local	Horizon=14, lookback=30, fb+earnings. Reuse-first (89 FDs) + CC-News rebuild (89 FDs, 1,209 articles, 5,956 ETD facts). EMR-ACH runs end-to-end on the v2.2 gold subset.
`v2.2-data-ready` (next)	TBD	After Phase-3 batch fills paper Tables 5 / 6 / 7 with real numbers.

Reproducing

The pipeline is fully reproducible from public sources:

forecastbench ← upstream ForecastBench repo (resolved questions only).
gdelt-cameo ← GDELT 2.0 KG (publicly hosted).
earnings ← yfinance + Finnhub + EDGAR + GDELT-slug + NYT/Guardian editorial.
CC-News ← Common Crawl data.commoncrawl.org/crawl-data/CC-NEWS/{YYYY}/{MM}/.

No GPU required. Every stage routes through OpenAI Batch API or CPU pipelines (FAISS-CPU). Total reproducibility budget per cutoff-month is under $10.

Citation

Aperstein Y., Remez B., Apartsin A.
Evidence Matrix Reasoning: Structured Hypothesis Selection via Analysis of Competing Hypotheses.
2026. https://github.com/ApartsinProjects/EMR-ACH (tag v2.1-data-ready)

Code: MIT. Third-party data (GDELT, Yahoo Finance, ForecastBench upstream, Common Crawl, NYT, Guardian) retains its original upstream licence.

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
Report		Report
analysis		analysis
benchmark		benchmark
configs		configs
data		data
docs		docs
experiments		experiments
paper		paper
prompts		prompts
results		results
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.nojekyll		.nojekyll
EXPERIMENT_PLAN.md		EXPERIMENT_PLAN.md
README.md		README.md
RESEARCH_PLAN.md		RESEARCH_PLAN.md
config.yaml		config.yaml
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evidence Matrix Reasoning: Structured Hypothesis Selection via Analysis of Competing Hypotheses

The method

Why structured matters

The benchmark (used to evaluate the method)

Forecast Dossier

Event Timeline Dossier (ETD)

Baselines on a shared pick-only contract

Quickstart

Repository layout

Versions

Reproducing

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evidence Matrix Reasoning: Structured Hypothesis Selection via Analysis of Competing Hypotheses

The method

Why structured matters

The benchmark (used to evaluate the method)

Forecast Dossier

Event Timeline Dossier (ETD)

Baselines on a shared pick-only contract

Quickstart

Repository layout

Versions

Reproducing

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages