Skip to content
#

agent-benchmark

Here are 20 public repositories matching this topic...

ai-agents-reality-check

Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistical validation.

  • Updated Apr 2, 2026
  • Python
dojo.md

University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.

  • Updated Mar 13, 2026
  • TypeScript

AI Arena is a competitive evaluation framework where multiple AI agents answer the same set of questions under identical conditions. Their performance is scored, ranked, and tracked over time using two complementary metrics AIQ and ELO

  • Updated Apr 5, 2026
  • Python

Multimodal evaluation benchmark for AI agents in real-world field operations across 16 trades (HVAC, electrical, plumbing, roofing, solar, mining, oil & gas, marine, telecom, automotive, construction, and more). 194 cases; scores retrieval, code citation, jurisdiction, safety, trajectory, multi-turn, speed; 5-layer contamination defense.

  • Updated Apr 19, 2026
  • Python

🤖 Benchmark AI agent capabilities, bridging the gap between hype and reality with clear metrics and insights for informed development decisions.

  • Updated Apr 20, 2026
  • Python

BenchClaw - Multi-Dimensional AI Agent Benchmark. Connect any LLM agent (Claude, GPT, Gemini, Kimi, Qwen, DeepSeek...) to the P2PCLAW network and get scored on 10 dimensions + Tribunal IQ. Works as VS Code/Cursor/Windsurf extension, CLI, browser extension, Claude skill, Pinokio app, or plain copy-paste prompt.

  • Updated Apr 18, 2026
  • HTML

Improve this page

Add a description, image, and links to the agent-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more