🔴 Priority: HIGH | Type: Enhancement
1. SUMMARY
- No automated benchmarking suite exists to measure and track SQL generation accuracy against standard datasets (Spider, WikiSQL, BIRD).
- Impact: Cannot quantify accuracy claims (README states "90% on complex queries"), detect accuracy regressions, or compare model versions objectively.
2. SYSTEM CONTEXT
benchmarks/ # NEW - proposed structure
├── datasets/
│ ├── spider/ # Spider dataset (cross-domain)
│ ├── wikisql/ # WikiSQL dataset (single-table)
│ └── custom/ # Custom domain-specific tests
├── runners/
│ ├── accuracy_runner.py # Run benchmarks, collect metrics
│ └── report_generator.py # Generate reports/dashboards
├── results/
│ └── {timestamp}/ # Historical results
└── configs/
└── benchmark_config.yaml # Dataset configs, thresholds
Current state:
- README claims "90% accuracy on complex queries" with no reproducible proof
- No baseline metrics for comparison
- No regression detection in CI
3. CURRENT STATE (with code)
📄 File: README.md:244-250
## Benchmarks
| Metric | Traditional Pipeline | Agent Approach |
|--------|---------------------|----------------|
| Simple Queries | 85% accuracy | 92% accuracy |
| Complex Queries | 60% accuracy | 90% accuracy |
| Silent Failures | 15% | <1% |
Claims are not backed by reproducible benchmarks.
📄 File: docs/IMPLEMENTATION_PLAN.md (excerpt)
- Accuracy improvements via self-correction
- No formal benchmark suite defined
Implementation plan mentions accuracy but no benchmarking infrastructure.
4. PROPOSED SOLUTION
Create a comprehensive benchmarking suite that:
- Supports standard Text2SQL datasets (Spider, WikiSQL, BIRD)
- Measures execution accuracy (exact match, execution match)
- Tracks metrics over time and across model versions
- Integrates with CI for regression detection
📄 File: benchmarks/runners/accuracy_runner.py (NEW)
from dataclasses import dataclass
from pathlib import Path
import json
@dataclass
class BenchmarkResult:
dataset: str
total_queries: int
exact_match: int
execution_match: int
exact_match_pct: float
execution_match_pct: float
avg_confidence: float
avg_latency_ms: float
errors: list[dict]
class AccuracyBenchmark:
def __init__(self, agent_engine, dataset_path: Path):
self.agent = agent_engine
self.dataset = self._load_dataset(dataset_path)
async def run(self) -> BenchmarkResult:
results = []
for example in self.dataset:
result = await self._evaluate_single(example)
results.append(result)
return self._aggregate_results(results)
async def _evaluate_single(self, example: dict) -> dict:
generated = await self.agent.generate_sql(
natural_query=example["question"],
database_id=example["db_id"],
execute=False,
)
return {
"exact_match": self._normalize_sql(generated.sql) == self._normalize_sql(example["query"]),
"execution_match": await self._check_execution_match(generated.sql, example["query"], example["db_id"]),
"confidence": generated.confidence,
"latency_ms": generated.execution_time_ms,
}
📄 File: benchmarks/configs/benchmark_config.yaml (NEW)
datasets:
spider:
path: benchmarks/datasets/spider
split: dev
metrics: [exact_match, execution_match]
thresholds:
exact_match: 0.70
execution_match: 0.80
wikisql:
path: benchmarks/datasets/wikisql
split: test
metrics: [exact_match, execution_match]
thresholds:
exact_match: 0.80
execution_match: 0.85
reporting:
output_dir: benchmarks/results
format: [json, markdown, html]
compare_with_baseline: true
5. IMPLEMENTATION CHECKLIST
Phase 1: Infrastructure
Phase 2: Core Benchmark Runner
Phase 3: Reporting & CI
Phase 4: Advanced Metrics
6. FILES TO MODIFY TABLE
| File |
Lines |
Action |
Description |
benchmarks/__init__.py |
NEW |
Create |
Benchmark package |
benchmarks/runners/accuracy_runner.py |
NEW |
Create |
Core benchmark runner |
benchmarks/runners/report_generator.py |
NEW |
Create |
Report generation |
benchmarks/utils/sql_normalize.py |
NEW |
Create |
SQL normalization for comparison |
benchmarks/utils/dataset_loader.py |
NEW |
Create |
Dataset loading utilities |
benchmarks/configs/benchmark_config.yaml |
NEW |
Create |
Benchmark configuration |
scripts/download_datasets.sh |
NEW |
Create |
Dataset download script |
scripts/run_benchmarks.py |
NEW |
Create |
CLI for running benchmarks |
.github/workflows/benchmark.yml |
NEW |
Create |
Scheduled benchmark workflow |
README.md |
TBD |
Modify |
Add benchmark badge and results |
7. RISK ASSESSMENT
| Risk |
Impact |
Mitigation |
| Dataset licensing issues |
🟡 |
Use permissively licensed datasets; document licenses |
| Benchmarks are slow (full dataset) |
🟡 |
Support subset mode; run full benchmarks on schedule |
| Execution match requires real DBs |
🟡 |
Set up benchmark databases; use Docker containers |
| Results vary with model updates |
🟢 |
Track historical results; require approval for regressions |
8. RELATED CONTEXT
🔴 Priority: HIGH | Type: Enhancement
1. SUMMARY
2. SYSTEM CONTEXT
3. CURRENT STATE (with code)
📄 File:
README.md:244-250Claims are not backed by reproducible benchmarks.
📄 File:
docs/IMPLEMENTATION_PLAN.md(excerpt)Implementation plan mentions accuracy but no benchmarking infrastructure.
4. PROPOSED SOLUTION
Create a comprehensive benchmarking suite that:
📄 File:
benchmarks/runners/accuracy_runner.py(NEW)📄 File:
benchmarks/configs/benchmark_config.yaml(NEW)5. IMPLEMENTATION CHECKLIST
Phase 1: Infrastructure
benchmarks/directory structurePhase 2: Core Benchmark Runner
AccuracyBenchmarkclassPhase 3: Reporting & CI
Phase 4: Advanced Metrics
6. FILES TO MODIFY TABLE
benchmarks/__init__.pybenchmarks/runners/accuracy_runner.pybenchmarks/runners/report_generator.pybenchmarks/utils/sql_normalize.pybenchmarks/utils/dataset_loader.pybenchmarks/configs/benchmark_config.yamlscripts/download_datasets.shscripts/run_benchmarks.py.github/workflows/benchmark.ymlREADME.md7. RISK ASSESSMENT
8. RELATED CONTEXT
README.md:244-250