Skip to content

riyanmohmmeed-dev/RL-Code-Review-Arena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Code Review Agent Environment πŸ”

A real-world RL environment that trains AI agents to review Python code β€” finding bugs, classifying them, and suggesting fixes.

Why Code Review?

Code review is a task every software engineer performs daily. Companies like Meta, Google, and Microsoft invest billions in AI-assisted code review (Copilot, CodeRabbit, Codex). Yet there is no standardized RL environment for training code review agents. This environment fills that gap.

Real-world utility: Train agents that can catch syntax errors, logic bugs, and security vulnerabilities before code ships to production.

How It Works

Agent receives Python code with a bug
    ↓
Agent submits: bug_line, bug_type, bug_description, suggested_fix
    ↓
Grader evaluates with decomposed sub-scores (0.0-1.0)
    ↓
Agent gets feedback, can refine review (up to 3 attempts per episode)

Task Difficulty Progression

Difficulty Task Count What Agent Must Find Example
Easy 15 Syntax errors Missing colons, misspelled keywords, unclosed brackets
Medium 15 Logic bugs Off-by-one errors, wrong conditions, missing edge cases
Hard 10 Security vulnerabilities SQL injection (CWE-89), XSS (CWE-79), path traversal (CWE-22)

Total: 40 unique code review tasks

Action & Observation Spaces

Action (What the agent sends)

{
    "bug_line": 7,
    "bug_type": "sql_injection",
    "bug_description": "User input directly interpolated into SQL query via f-string",
    "suggested_fix": "cursor.execute('SELECT * FROM users WHERE name = ?', (name,))"
}

Observation (What the agent receives)

{
    "task_id": "security_001",
    "difficulty": "hard",
    "description": "Find the security vulnerability in this database query function",
    "code_snippet": "import sqlite3\n\ndef get_user(username):\n    ...",
    "feedback": "Score: 0.85/1.0. Breakdown: {line_accuracy: 1.0, ...}",
    "grader_score": 0.85,
    "grader_breakdown": {"line_accuracy": 1.0, "vuln_classification": 0.7, ...},
    "steps_remaining": 2,
    "done": false,
    "reward": 0.85
}

Grading System

Grading is decomposed, deterministic, and hierarchically gated:

Easy (Syntax Errors)

Sub-Score Weight Measures
line_accuracy 30% Did the agent find the correct line?
type_accuracy 30% Did the agent classify the error type?
fix_quality 40% Did the agent suggest a working fix?

Medium (Logic Bugs)

Sub-Score Weight Measures
line_accuracy 20% Did the agent find the buggy line?
type_accuracy 30% Did the agent classify the bug?
explanation_quality 10% Did the agent explain the bug clearly?
fix_quality 40% Did the agent suggest a correct fix?

Hard (Security Vulnerabilities)

Sub-Score Weight Measures
line_accuracy 15% Did the agent find the vulnerable line?
vuln_classification 25% Did the agent classify the vulnerability?
cwe_awareness 10% Did the agent reference the CWE standard?
explanation_quality 10% Did the agent explain the security impact?
fix_quality 40% Did the agent suggest a secure fix?

Hierarchical gating: If line_accuracy = 0.0, fix_quality is multiplied by 0.2-0.3 (penalized).

Reward Design

  • Partial progress signals at every step (not just end-of-episode)
  • Multi-attempt episodes: Agent gets 3 tries to refine its review
  • Best score tracked: The highest score across attempts is the episode reward
  • Invalid action handling: Malformed inputs return reward=-0.1 (not a crash)
  • Penalizes undesirable behavior: Empty or nonsense responses score near 0.0

Baseline Scores

Tested with seed=42 for reproducibility:

Agent Easy Medium Hard Average
Heuristic (random guess) 0.09 0.06 0.03 0.06
GPT-4o-mini TBD TBD TBD TBD

The heuristic baseline scores near zero, proving the environment is non-trivial and requires genuine code understanding to solve.

Setup & Usage

Prerequisites

  • Python 3.10+
  • uv package manager (recommended) or pip

Installation

git clone <repo-url>
cd code_review_env
uv sync

Run Locally

# Start the server
uv run server

# In another terminal, run the baseline
python baseline.py --heuristic-only

Run with Docker

docker build -t code-review-env -f server/Dockerfile .
docker run -p 7860:7860 code-review-env

Run Baseline with OpenAI

export OPENAI_API_KEY="sk-your-key-here"
python baseline.py --model gpt-4o-mini

API Endpoints

Endpoint Method Description
/health GET Health check
/reset POST Start a new code review episode
/step POST Submit a code review action
/state GET Get episode metadata (no ground truth)
/schema GET Get action/observation JSON schemas
/tasks GET List all tasks + action schema
/grader GET Get grader info and scoring structure
/baseline POST/GET Run heuristic baseline and return scores
/baseline-trigger-inference-script POST/GET Alias for /baseline
/docs GET Swagger API documentation

Agent Contract

  • βœ… Agents MAY call: reset(), step(), state(), /tasks
  • ❌ Agents MUST NOT call: /baseline, /grader during evaluation
  • πŸ”’ state() returns episode_id and step_count ONLY β€” no ground truth
  • ⚠️ Invalid actions return reward=-0.1 and a helpful feedback message (no crash)

Project Structure

code_review_env/
β”œβ”€β”€ models.py                  # Pydantic Action & Observation models
β”œβ”€β”€ graders.py                 # Decomposed grading logic (separated from env)
β”œβ”€β”€ client.py                  # HTTP client for remote interaction
β”œβ”€β”€ baseline.py                # Baseline inference script (heuristic + OpenAI)
β”œβ”€β”€ tasks/
β”‚   β”œβ”€β”€ syntax_errors.json     # 15 easy tasks
β”‚   β”œβ”€β”€ logic_bugs.json        # 15 medium tasks
β”‚   └── security_vulns.json    # 10 hard tasks
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ code_review_env_environment.py  # Core reset/step/state logic
β”‚   β”œβ”€β”€ app.py                 # FastAPI app + REST endpoints
β”‚   └── Dockerfile             # Container definition
β”œβ”€β”€ tests/
β”‚   └── test_environment.py    # 43 comprehensive tests
β”œβ”€β”€ openenv.yaml               # Environment manifest
└── pyproject.toml             # Package metadata

License

BSD-style license. See LICENSE file.

About

A Reinforcement Learning environment for training AI agents to detect, classify, and fix Python syntax errors, logic bugs, and security vulnerabilities. Features deterministic, decomposed grading.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors