SusVibes is a comprehensive benchmark and evaluation pipeline designed to expose the security vulnerabilities in code generated by AI agents when solving real-world software engineering tasks. This framework provides a standardized evaluation pipeline to assess the quality and security implications of agent-generated code across diverse software engineering domains.
The benchmark consists of 200 realistic coding tasks constructed from 108 existing open-source software projects. The agent's solutions are evaluated in terms of functional correctness and security with dynamic tests in execution environments. SusVibes covers a wide range of 77 weaknesses from Common Weakness Enumeration (CWEs). On average, the code base for each task has 160k lines of code, 867 files, and requires over 170 lines of code to solve, providing realistic evaluation for coding agents.
- Clone the repository:
git clone https://github.com/LeiLiLab/susvibes.git
cd susvibes- Install Python dependencies:
conda create -n sv python=3.11
conda activate sv
pip install -r requirements.txt
pip install -e .- The SusVibes dataset is placed under
datasets/directory for convenient usage:
datasets/
└── susvibes_dataset.jsonlThe SusVibes dataset contains task information with the following key fields:
instance_id: Unique identifier for each task from real-world projects, formatted asrepo-owner__repo-name_commit-idimage_name: Pre-built Docker image containing the development environment of each taskproblem_statement: Natural language description of the task for agent input- Other metadata and evaluation specifications are omitted here.
-
Prepare the environment:
- Pull Docker images specified in the
image_namefield:
docker pull <image_name>
- The project code which the task operates on is located at
/projectwithin each Docker container
- Pull Docker images specified in the
-
Execute your agent:
- Feed the
problem_statementto your agent - Let the agent generate code solutions within the containerized environment
- Feed the
-
Format predictions: Save your agent's outputs in JSONL format with the following structure:
{ "instance_id": "repo-owner__repo-name_commit-id", "model_name_or_path": "your-model-name", "model_patch": "the-implementation-patch" }
For an example guideline on how to run Kimi CLI on SusVibes, see tutorial.
Note: SusVibes evaluation can be resource intensive. Recommended hardware settings for an accurate evaluation is to have at least 400GB of free storage, 4GB of RAM and 4 CPU cores per parallel worker on an
x86_64machine.
Run the evaluation pipeline from the susvibes/ directory:
python -m susvibes.run_evaluation \
--run_id <unique_run_identifier> \
--predictions_path <path_to_predictions_jsonl> \
--max_workers 5 \
[--force] # Optional: force re-evaluation--run_id: Name of this evaluation run (defaults todefault)--predictions_path: Path to your agent's predictions file--max_workers: Number of parallel workers (adjust based on available CPU cores)--force: Force re-evaluation even if previous logs exist
The evaluation summary is written automatically to logs/run_evaluation/<run_id>/<strategy>/summary.json (where <strategy> defaults to generic); the path is printed at the end of the run.
You can use our provided datasets/examples/sample_predictions.json to verify your setup. This should give you a summary in logs/run_evaluation/test/generic/summary.json.
python -m susvibes.run_evaluation \
--run_id test \
--predictions_path datasets/examples/sample_predictions.json \
--max_workers 5 \
--forceSee the subfolder's README for more details.
SusVibes supports advanced strategies for security-enhanced evaluation (generic, self-selection, oracle, feedback-driven, sec-test). See the subfolder's README for more details.
- Docker permission errors: Ensure your user has proper Docker permissions
- Memory issues: Reduce
--max_workersif encountering OOM errors - Storage space: Ensure sufficient disk space for Docker images and logs
- Computation power: Allocate enough CPU computation to avoid unexpected evaluation timeouts.
We welcome contributions to improve SusVibes! Please see our contributing guidelines for:
- Adding new tasks
- Improving evaluation metrics
- Enhancing security analysis capabilities
- Documentation improvements
This project is licensed under the terms specified in the LICENSE file.
For questions, issues, or collaboration opportunities, please:
- Open an issue on GitHub
- Contact the maintainers at sz3296@columbia.edu or danqingw@cs.cmu.edu
We thank the open-source community for providing the diverse codebases used in our benchmark tasks.
