A real-world RL environment that trains AI agents to review Python code β finding bugs, classifying them, and suggesting fixes.
Code review is a task every software engineer performs daily. Companies like Meta, Google, and Microsoft invest billions in AI-assisted code review (Copilot, CodeRabbit, Codex). Yet there is no standardized RL environment for training code review agents. This environment fills that gap.
Real-world utility: Train agents that can catch syntax errors, logic bugs, and security vulnerabilities before code ships to production.
Agent receives Python code with a bug
β
Agent submits: bug_line, bug_type, bug_description, suggested_fix
β
Grader evaluates with decomposed sub-scores (0.0-1.0)
β
Agent gets feedback, can refine review (up to 3 attempts per episode)
| Difficulty | Task Count | What Agent Must Find | Example |
|---|---|---|---|
| Easy | 15 | Syntax errors | Missing colons, misspelled keywords, unclosed brackets |
| Medium | 15 | Logic bugs | Off-by-one errors, wrong conditions, missing edge cases |
| Hard | 10 | Security vulnerabilities | SQL injection (CWE-89), XSS (CWE-79), path traversal (CWE-22) |
Total: 40 unique code review tasks
{
"bug_line": 7,
"bug_type": "sql_injection",
"bug_description": "User input directly interpolated into SQL query via f-string",
"suggested_fix": "cursor.execute('SELECT * FROM users WHERE name = ?', (name,))"
}{
"task_id": "security_001",
"difficulty": "hard",
"description": "Find the security vulnerability in this database query function",
"code_snippet": "import sqlite3\n\ndef get_user(username):\n ...",
"feedback": "Score: 0.85/1.0. Breakdown: {line_accuracy: 1.0, ...}",
"grader_score": 0.85,
"grader_breakdown": {"line_accuracy": 1.0, "vuln_classification": 0.7, ...},
"steps_remaining": 2,
"done": false,
"reward": 0.85
}Grading is decomposed, deterministic, and hierarchically gated:
| Sub-Score | Weight | Measures |
|---|---|---|
| line_accuracy | 30% | Did the agent find the correct line? |
| type_accuracy | 30% | Did the agent classify the error type? |
| fix_quality | 40% | Did the agent suggest a working fix? |
| Sub-Score | Weight | Measures |
|---|---|---|
| line_accuracy | 20% | Did the agent find the buggy line? |
| type_accuracy | 30% | Did the agent classify the bug? |
| explanation_quality | 10% | Did the agent explain the bug clearly? |
| fix_quality | 40% | Did the agent suggest a correct fix? |
| Sub-Score | Weight | Measures |
|---|---|---|
| line_accuracy | 15% | Did the agent find the vulnerable line? |
| vuln_classification | 25% | Did the agent classify the vulnerability? |
| cwe_awareness | 10% | Did the agent reference the CWE standard? |
| explanation_quality | 10% | Did the agent explain the security impact? |
| fix_quality | 40% | Did the agent suggest a secure fix? |
Hierarchical gating: If line_accuracy = 0.0, fix_quality is multiplied by 0.2-0.3 (penalized).
- Partial progress signals at every step (not just end-of-episode)
- Multi-attempt episodes: Agent gets 3 tries to refine its review
- Best score tracked: The highest score across attempts is the episode reward
- Invalid action handling: Malformed inputs return reward=-0.1 (not a crash)
- Penalizes undesirable behavior: Empty or nonsense responses score near 0.0
Tested with seed=42 for reproducibility:
| Agent | Easy | Medium | Hard | Average |
|---|---|---|---|---|
| Heuristic (random guess) | 0.09 | 0.06 | 0.03 | 0.06 |
| GPT-4o-mini | TBD | TBD | TBD | TBD |
The heuristic baseline scores near zero, proving the environment is non-trivial and requires genuine code understanding to solve.
- Python 3.10+
uvpackage manager (recommended) orpip
git clone <repo-url>
cd code_review_env
uv sync# Start the server
uv run server
# In another terminal, run the baseline
python baseline.py --heuristic-onlydocker build -t code-review-env -f server/Dockerfile .
docker run -p 7860:7860 code-review-envexport OPENAI_API_KEY="sk-your-key-here"
python baseline.py --model gpt-4o-mini| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/reset |
POST | Start a new code review episode |
/step |
POST | Submit a code review action |
/state |
GET | Get episode metadata (no ground truth) |
/schema |
GET | Get action/observation JSON schemas |
/tasks |
GET | List all tasks + action schema |
/grader |
GET | Get grader info and scoring structure |
/baseline |
POST/GET | Run heuristic baseline and return scores |
/baseline-trigger-inference-script |
POST/GET | Alias for /baseline |
/docs |
GET | Swagger API documentation |
- β
Agents MAY call:
reset(),step(),state(),/tasks - β Agents MUST NOT call:
/baseline,/graderduring evaluation - π
state()returns episode_id and step_count ONLY β no ground truth β οΈ Invalid actions returnreward=-0.1and a helpful feedback message (no crash)
code_review_env/
βββ models.py # Pydantic Action & Observation models
βββ graders.py # Decomposed grading logic (separated from env)
βββ client.py # HTTP client for remote interaction
βββ baseline.py # Baseline inference script (heuristic + OpenAI)
βββ tasks/
β βββ syntax_errors.json # 15 easy tasks
β βββ logic_bugs.json # 15 medium tasks
β βββ security_vulns.json # 10 hard tasks
βββ server/
β βββ code_review_env_environment.py # Core reset/step/state logic
β βββ app.py # FastAPI app + REST endpoints
β βββ Dockerfile # Container definition
βββ tests/
β βββ test_environment.py # 43 comprehensive tests
βββ openenv.yaml # Environment manifest
βββ pyproject.toml # Package metadata
BSD-style license. See LICENSE file.