This repository maintains the codebase of the end-to-end evaluation framework of MMDeepResearch-Bench (MMDR benchmark).
- FLAE (Formula-LLM Adaptive Evaluation): Measures report quality (readability, insightfulness, structure).
- TRACE (Trustworthy Retrieval-Aligned Citation Evaluation): Verifies citation support and claim-URL alignment.
- VEF (Visual Evidence Fidelity): A strict gatekeeper enforcing alignment between textual claims and visual evidence (PASS/FAIL).
- MOSAIC (Multimodal Support-Aligned Integrity Check): Validates consistency between generated text and visual artifacts (Charts, Diagrams, Photos).
- Smart Resume: Skips already-completed tasks to reduce time and API cost.
- Graceful Stop: Safe shutdown via CLI (
stop,exit) orCtrl+C, ensuring partial results are flushed. - Precision Debugging: Run a single case with
--quiz_firstor--quiz_index. - Multi-Provider Support: Google Gemini, Azure OpenAI, OpenRouter.
git clone https://github.com/YourUsername/MMDR.git
cd MMDRpip install -r requirements.txtcp env.txt .envExample (adjust to your providers/models):
# --- Roles ---
MMDR_REPORT_PROVIDER=gemini # gemini | azure | openrouter
MMDR_JUDGE_PROVIDER=azure # recommended: strong reasoning model
# --- Models ---
MMDR_REPORT_MODEL=gemini-1.5-pro
MMDR_JUDGE_MODEL=gpt-4o
# --- API Keys / Endpoints ---
GEMINI_API_KEY=AIza...
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://...
OPENROUTER_API_KEY=...Run the first question only to confirm API + paths:
python run_pipeline.py --quiz_firstProcess all tasks in quiz.jsonl:
python run_pipeline.py --run_id experiment_v1Re-run a single item by 1-based index:
python run_pipeline.py --quiz_index 5 --run_id debug_q5python run_pipeline.py --max_workers 4| Command | Action |
|---|---|
stop + Enter |
Safely stop after current tasks finish; saves outputs |
Ctrl+C |
Triggers the same graceful shutdown behavior |
Outputs are written to reports_runs/<RUN_ID>/:
reports_runs/experiment_v1/
├── reports/ # Markdown research reports
│ ├── Q1.md
│ └── ...
├── results/
│ └── experiment_v1.jsonl # detailed logs (scores/errors/timings)
├── summary/
│ ├── experiment_v1.json # machine-readable aggregated metrics
│ └── experiment_v1.txt # human-readable summary
└── mm/ # multimodal intermediate artifacts
The pipeline outputs three aggregate scores and one final combined score:
| Aggregate | Full Name | Sub-metrics (Leaderboard) |
|---|---|---|
| GEN | General Quality (FLAE) | Read. (Readability), Insh. (Insightfulness), Stru. (Structure), Coherence |
| EVI | Evidence Quality (TRACE) | Con. (Concordance), Cov. (Coverage), Fid. (Fidelity), Diversity |
| MM | Multimodal Quality (MOSAIC) | Sem. (Semantic), Vef. (Faithfulness), Acc. (Data Accuracy), VQA (VQA Score) |
| FINAL_MMDR | Weighted combination of above | -- |
All sub-metrics are available in the output JSON file under aggregates.{research|all}.submetrics:
submetrics.general -> general.R, general.I, general.S, general.C, ...
submetrics.evidence -> evidence.E_con, evidence.E_cov, evidence.E_fid, evidence.E_div, ...
submetrics.mm -> mm.avg_metric_by_dim.semantic, .faithful, .data_accuracy, .vqa_score, ...
For detailed computation logic, see:
scoring_general.py-- GEN (FLAE)scoring_evidence.py-- EVI (TRACE)mm_router5_aggregate.py-- MM (MOSAIC)accuracy.py-- VEF (verification gating)
If you find this codebase or the MMDR-Bench dataset useful in your research, please cite:
@misc{huang2026mmdeepresearchbenchbenchmarkmultimodaldeep,
title={MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents},
author={Peizhou Huang and Zixuan Zhong and Zhongwei Wan and Donghao Zhou and Samiul Alam and Xin Wang and Zexin Li and Zhihao Dou and Li Zhu and Jing Xiong and Chaofan Tao and Yan Xu and Dimitrios Dimitriadis and Tuo Zhang and Mi Zhang},
year={2026},
eprint={2601.12346},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.12346},
}If you run MMDR-Bench and obtain interesting results, please submit them through our Google Form:
Submit results and feedback via Google Form
We welcome reports on:
- new model results
- reproduction logs
- implementation issues
- suggestions for future benchmark extensions
This project is released under the Apache-2.0 License. See LICENSE.