Skip to content

AIoT-MLSys-Lab/MMDeepResearch-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

License Python 3.8+ Benchmark: MMDR-Bench

This repository maintains the codebase of the end-to-end evaluation framework of MMDeepResearch-Bench (MMDR benchmark).


✨ Key Features

🔬 Innovative Metrics for Grounded Research Quality

  • FLAE (Formula-LLM Adaptive Evaluation): Measures report quality (readability, insightfulness, structure).
  • TRACE (Trustworthy Retrieval-Aligned Citation Evaluation): Verifies citation support and claim-URL alignment.
    • VEF (Visual Evidence Fidelity): A strict gatekeeper enforcing alignment between textual claims and visual evidence (PASS/FAIL).
  • MOSAIC (Multimodal Support-Aligned Integrity Check): Validates consistency between generated text and visual artifacts (Charts, Diagrams, Photos).

🛠️ Engineering & Usability

  • Smart Resume: Skips already-completed tasks to reduce time and API cost.
  • Graceful Stop: Safe shutdown via CLI (stop, exit) or Ctrl+C, ensuring partial results are flushed.
  • Precision Debugging: Run a single case with --quiz_first or --quiz_index.
  • Multi-Provider Support: Google Gemini, Azure OpenAI, OpenRouter.

📦 Installation

1) Clone

git clone https://github.com/YourUsername/MMDR.git
cd MMDR

2) Install dependencies

pip install -r requirements.txt

⚙️ Configuration

1) Create .env

cp env.txt .env

2) Edit .env

Example (adjust to your providers/models):

# --- Roles ---
MMDR_REPORT_PROVIDER=gemini       # gemini | azure | openrouter
MMDR_JUDGE_PROVIDER=azure         # recommended: strong reasoning model

# --- Models ---
MMDR_REPORT_MODEL=gemini-1.5-pro
MMDR_JUDGE_MODEL=gpt-4o

# --- API Keys / Endpoints ---
GEMINI_API_KEY=AIza...
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://...
OPENROUTER_API_KEY=...

🚀 Usage

1) Quick verification (recommended first run)

Run the first question only to confirm API + paths:

python run_pipeline.py --quiz_first

2) Full batch run

Process all tasks in quiz.jsonl:

python run_pipeline.py --run_id experiment_v1

3) Targeted debugging

Re-run a single item by 1-based index:

python run_pipeline.py --quiz_index 5 --run_id debug_q5

4) Parallel mode

python run_pipeline.py --max_workers 4

🎮 Runtime Controls

Command Action
stop + Enter Safely stop after current tasks finish; saves outputs
Ctrl+C Triggers the same graceful shutdown behavior

📂 Output Structure

Outputs are written to reports_runs/<RUN_ID>/:

reports_runs/experiment_v1/
├── reports/                  # Markdown research reports
│   ├── Q1.md
│   └── ...
├── results/
│   └── experiment_v1.jsonl   # detailed logs (scores/errors/timings)
├── summary/
│   ├── experiment_v1.json    # machine-readable aggregated metrics
│   └── experiment_v1.txt     # human-readable summary
└── mm/                       # multimodal intermediate artifacts

📊 Metrics Explanation

The pipeline outputs three aggregate scores and one final combined score:

Aggregate Full Name Sub-metrics (Leaderboard)
GEN General Quality (FLAE) Read. (Readability), Insh. (Insightfulness), Stru. (Structure), Coherence
EVI Evidence Quality (TRACE) Con. (Concordance), Cov. (Coverage), Fid. (Fidelity), Diversity
MM Multimodal Quality (MOSAIC) Sem. (Semantic), Vef. (Faithfulness), Acc. (Data Accuracy), VQA (VQA Score)
FINAL_MMDR Weighted combination of above --

All sub-metrics are available in the output JSON file under aggregates.{research|all}.submetrics:

submetrics.general   ->  general.R, general.I, general.S, general.C, ...
submetrics.evidence  ->  evidence.E_con, evidence.E_cov, evidence.E_fid, evidence.E_div, ...
submetrics.mm        ->  mm.avg_metric_by_dim.semantic, .faithful, .data_accuracy, .vqa_score, ...

For detailed computation logic, see:

  • scoring_general.py -- GEN (FLAE)
  • scoring_evidence.py -- EVI (TRACE)
  • mm_router5_aggregate.py -- MM (MOSAIC)
  • accuracy.py -- VEF (verification gating)

🧾 Citation

If you find this codebase or the MMDR-Bench dataset useful in your research, please cite:

@misc{huang2026mmdeepresearchbenchbenchmarkmultimodaldeep,
      title={MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents}, 
      author={Peizhou Huang and Zixuan Zhong and Zhongwei Wan and Donghao Zhou and Samiul Alam and Xin Wang and Zexin Li and Zhihao Dou and Li Zhu and Jing Xiong and Chaofan Tao and Yan Xu and Dimitrios Dimitriadis and Tuo Zhang and Mi Zhang},
      year={2026},
      eprint={2601.12346},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.12346}, 
}

📬 Contact and Community Results

If you run MMDR-Bench and obtain interesting results, please submit them through our Google Form:

Submit results and feedback via Google Form

We welcome reports on:

  • new model results
  • reproduction logs
  • implementation issues
  • suggestions for future benchmark extensions

📜 License

This project is released under the Apache-2.0 License. See LICENSE.

About

MMDeepResearch-Bench (MMDR)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages