Code Review Agent Environment 🔍

A real-world RL environment that trains AI agents to review Python code — finding bugs, classifying them, and suggesting fixes.

Why Code Review?

Code review is a task every software engineer performs daily. Companies like Meta, Google, and Microsoft invest billions in AI-assisted code review (Copilot, CodeRabbit, Codex). Yet there is no standardized RL environment for training code review agents. This environment fills that gap.

Real-world utility: Train agents that can catch syntax errors, logic bugs, and security vulnerabilities before code ships to production.

How It Works

Agent receives Python code with a bug
    ↓
Agent submits: bug_line, bug_type, bug_description, suggested_fix
    ↓
Grader evaluates with decomposed sub-scores (0.0-1.0)
    ↓
Agent gets feedback, can refine review (up to 3 attempts per episode)

Task Difficulty Progression

Difficulty	Task Count	What Agent Must Find	Example
Easy	15	Syntax errors	Missing colons, misspelled keywords, unclosed brackets
Medium	15	Logic bugs	Off-by-one errors, wrong conditions, missing edge cases
Hard	10	Security vulnerabilities	SQL injection (CWE-89), XSS (CWE-79), path traversal (CWE-22)

Total: 40 unique code review tasks

Action & Observation Spaces

Action (What the agent sends)

{
    "bug_line": 7,
    "bug_type": "sql_injection",
    "bug_description": "User input directly interpolated into SQL query via f-string",
    "suggested_fix": "cursor.execute('SELECT * FROM users WHERE name = ?', (name,))"
}

Observation (What the agent receives)

{
    "task_id": "security_001",
    "difficulty": "hard",
    "description": "Find the security vulnerability in this database query function",
    "code_snippet": "import sqlite3\n\ndef get_user(username):\n    ...",
    "feedback": "Score: 0.85/1.0. Breakdown: {line_accuracy: 1.0, ...}",
    "grader_score": 0.85,
    "grader_breakdown": {"line_accuracy": 1.0, "vuln_classification": 0.7, ...},
    "steps_remaining": 2,
    "done": false,
    "reward": 0.85
}

Grading System

Grading is decomposed, deterministic, and hierarchically gated:

Easy (Syntax Errors)

Sub-Score	Weight	Measures
line_accuracy	30%	Did the agent find the correct line?
type_accuracy	30%	Did the agent classify the error type?
fix_quality	40%	Did the agent suggest a working fix?

Medium (Logic Bugs)

Sub-Score	Weight	Measures
line_accuracy	20%	Did the agent find the buggy line?
type_accuracy	30%	Did the agent classify the bug?
explanation_quality	10%	Did the agent explain the bug clearly?
fix_quality	40%	Did the agent suggest a correct fix?

Hard (Security Vulnerabilities)

Sub-Score	Weight	Measures
line_accuracy	15%	Did the agent find the vulnerable line?
vuln_classification	25%	Did the agent classify the vulnerability?
cwe_awareness	10%	Did the agent reference the CWE standard?
explanation_quality	10%	Did the agent explain the security impact?
fix_quality	40%	Did the agent suggest a secure fix?

Hierarchical gating: If line_accuracy = 0.0, fix_quality is multiplied by 0.2-0.3 (penalized).

Reward Design

Partial progress signals at every step (not just end-of-episode)
Multi-attempt episodes: Agent gets 3 tries to refine its review
Best score tracked: The highest score across attempts is the episode reward
Invalid action handling: Malformed inputs return reward=-0.1 (not a crash)
Penalizes undesirable behavior: Empty or nonsense responses score near 0.0

Baseline Scores

Tested with seed=42 for reproducibility:

Agent	Easy	Medium	Hard	Average
Heuristic (random guess)	0.09	0.06	0.03	0.06
GPT-4o-mini	TBD	TBD	TBD	TBD

The heuristic baseline scores near zero, proving the environment is non-trivial and requires genuine code understanding to solve.

Setup & Usage

Prerequisites

Python 3.10+
uv package manager (recommended) or pip

Installation

git clone <repo-url>
cd code_review_env
uv sync

Run Locally

# Start the server
uv run server

# In another terminal, run the baseline
python baseline.py --heuristic-only

Run with Docker

docker build -t code-review-env -f server/Dockerfile .
docker run -p 7860:7860 code-review-env

Run Baseline with OpenAI

export OPENAI_API_KEY="sk-your-key-here"
python baseline.py --model gpt-4o-mini

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check
`/reset`	POST	Start a new code review episode
`/step`	POST	Submit a code review action
`/state`	GET	Get episode metadata (no ground truth)
`/schema`	GET	Get action/observation JSON schemas
`/tasks`	GET	List all tasks + action schema
`/grader`	GET	Get grader info and scoring structure
`/baseline`	POST/GET	Run heuristic baseline and return scores
`/baseline-trigger-inference-script`	POST/GET	Alias for /baseline
`/docs`	GET	Swagger API documentation

Agent Contract

✅ Agents MAY call: reset(), step(), state(), /tasks
❌ Agents MUST NOT call: /baseline, /grader during evaluation
🔒 state() returns episode_id and step_count ONLY — no ground truth
⚠️ Invalid actions return reward=-0.1 and a helpful feedback message (no crash)

Project Structure

code_review_env/
├── models.py                  # Pydantic Action & Observation models
├── graders.py                 # Decomposed grading logic (separated from env)
├── client.py                  # HTTP client for remote interaction
├── baseline.py                # Baseline inference script (heuristic + OpenAI)
├── tasks/
│   ├── syntax_errors.json     # 15 easy tasks
│   ├── logic_bugs.json        # 15 medium tasks
│   └── security_vulns.json    # 10 hard tasks
├── server/
│   ├── code_review_env_environment.py  # Core reset/step/state logic
│   ├── app.py                 # FastAPI app + REST endpoints
│   └── Dockerfile             # Container definition
├── tests/
│   └── test_environment.py    # 43 comprehensive tests
├── openenv.yaml               # Environment manifest
└── pyproject.toml             # Package metadata

License

BSD-style license. See LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Review Agent Environment 🔍

Why Code Review?

How It Works

Task Difficulty Progression

Action & Observation Spaces

Action (What the agent sends)

Observation (What the agent receives)

Grading System

Easy (Syntax Errors)

Medium (Logic Bugs)

Hard (Security Vulnerabilities)

Reward Design

Baseline Scores

Setup & Usage

Prerequisites

Installation

Run Locally

Run with Docker

Run Baseline with OpenAI

API Endpoints

Agent Contract

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
openenv_code_review_env.egg-info		openenv_code_review_env.egg-info
server		server
tasks		tasks
tests		tests
.dockerignore		.dockerignore
README.md		README.md
__init__.py		__init__.py
baseline.py		baseline.py
client.py		client.py
graders.py		graders.py
models.py		models.py
openenv.yaml		openenv.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Code Review Agent Environment 🔍

Why Code Review?

How It Works

Task Difficulty Progression

Action & Observation Spaces

Action (What the agent sends)

Observation (What the agent receives)

Grading System

Easy (Syntax Errors)

Medium (Logic Bugs)

Hard (Security Vulnerabilities)

Reward Design

Baseline Scores

Setup & Usage

Prerequisites

Installation

Run Locally

Run with Docker

Run Baseline with OpenAI

API Endpoints

Agent Contract

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages