SusVibes: Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

🌟 Overview

SusVibes is a comprehensive benchmark and evaluation pipeline designed to expose the security vulnerabilities in code generated by AI agents when solving real-world software engineering tasks. This framework provides a standardized evaluation pipeline to assess the quality and security implications of agent-generated code across diverse software engineering domains.

The benchmark consists of 200 realistic coding tasks constructed from 108 existing open-source software projects. The agent's solutions are evaluated in terms of functional correctness and security with dynamic tests in execution environments. SusVibes covers a wide range of 77 weaknesses from Common Weakness Enumeration (CWEs). On average, the code base for each task has 160k lines of code, 867 files, and requires over 170 lines of code to solve, providing realistic evaluation for coding agents.

⚡ Installation

🐍 Clone and Set Up Environment

Clone the repository:

git clone https://github.com/LeiLiLab/susvibes.git
cd susvibes

Install Python dependencies:

conda create -n sv python=3.11
conda activate sv
pip install -r requirements.txt
pip install -e .

The SusVibes dataset is placed under datasets/ directory for convenient usage:

datasets/
└── susvibes_dataset.jsonl

📊 Overview of SusVibes' Dataset

The SusVibes dataset contains task information with the following key fields:

instance_id: Unique identifier for each task from real-world projects, formatted as repo-owner__repo-name_commit-id
image_name: Pre-built Docker image containing the development environment of each task
problem_statement: Natural language description of the task for agent input
Other metadata and evaluation specifications are omitted here.

🧪 Evaluation Guidelines

🔧 Step 1: Harness Coding Agent in Completing SusVibes' Tasks

Prepare the environment:
- Pull Docker images specified in the image_name field:
```
docker pull <image_name>
```
- The project code which the task operates on is located at /project within each Docker container
Execute your agent:
- Feed the problem_statement to your agent
- Let the agent generate code solutions within the containerized environment

Format predictions: Save your agent's outputs in JSONL format with the following structure:

{
  "instance_id": "repo-owner__repo-name_commit-id",
  "model_name_or_path": "your-model-name",
  "model_patch": "the-implementation-patch"
}

For an example guideline on how to run Kimi CLI on SusVibes, see tutorial.

📈 Step 2: Evaluation

Note: SusVibes evaluation can be resource intensive. Recommended hardware settings for an accurate evaluation is to have at least 400GB of free storage, 4GB of RAM and 4 CPU cores per parallel worker on an x86_64 machine.

Run the evaluation pipeline from the susvibes/ directory:

python -m susvibes.run_evaluation \
  --run_id <unique_run_identifier> \
  --predictions_path <path_to_predictions_jsonl> \
  --max_workers 5 \
  [--force]  # Optional: force re-evaluation

Parameters:

--run_id: Name of this evaluation run (defaults to default)
--predictions_path: Path to your agent's predictions file
--max_workers: Number of parallel workers (adjust based on available CPU cores)
--force: Force re-evaluation even if previous logs exist

The evaluation summary is written automatically to logs/run_evaluation/<run_id>/<strategy>/summary.json (where <strategy> defaults to generic); the path is printed at the end of the run.

✅ Verify Setup with Examples:

You can use our provided datasets/examples/sample_predictions.json to verify your setup. This should give you a summary in logs/run_evaluation/test/generic/summary.json.

python -m susvibes.run_evaluation \
  --run_id test \
  --predictions_path datasets/examples/sample_predictions.json \
  --max_workers 5 \
  --force

🛠️ Advanced Usage

🗂️ Task Creation at Scale

See the subfolder's README for more details.

🛡️ Advanced Strategies

SusVibes supports advanced strategies for security-enhanced evaluation (generic, self-selection, oracle, feedback-driven, sec-test). See the subfolder's README for more details.

❓ Troubleshooting

Common Issues

Docker permission errors: Ensure your user has proper Docker permissions
Memory issues: Reduce --max_workers if encountering OOM errors
Storage space: Ensure sufficient disk space for Docker images and logs
Computation power: Allocate enough CPU computation to avoid unexpected evaluation timeouts.

🤝 Contributing

We welcome contributions to improve SusVibes! Please see our contributing guidelines for:

Adding new tasks
Improving evaluation metrics
Enhancing security analysis capabilities
Documentation improvements

📄 License

This project is licensed under the terms specified in the LICENSE file.

📬 Contact

For questions, issues, or collaboration opportunities, please:

Open an issue on GitHub
Contact the maintainers at sz3296@columbia.edu or danqingw@cs.cmu.edu

🙏 Acknowledgments

We thank the open-source community for providing the diverse codebases used in our benchmark tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
assets		assets
datasets/default		datasets/default
evaluation_harness		evaluation_harness
susvibes		susvibes
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SusVibes: Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

🌟 Overview

⚡ Installation

🐍 Clone and Set Up Environment

📊 Overview of SusVibes' Dataset

🧪 Evaluation Guidelines

🔧 Step 1: Harness Coding Agent in Completing SusVibes' Tasks

📈 Step 2: Evaluation

Parameters:

✅ Verify Setup with Examples:

🛠️ Advanced Usage

🗂️ Task Creation at Scale

🛡️ Advanced Strategies

❓ Troubleshooting

Common Issues

🤝 Contributing

📄 License

📬 Contact

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SusVibes: Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

🌟 Overview

⚡ Installation

🐍 Clone and Set Up Environment

📊 Overview of SusVibes' Dataset

🧪 Evaluation Guidelines

🔧 Step 1: Harness Coding Agent in Completing SusVibes' Tasks

📈 Step 2: Evaluation

Parameters:

✅ Verify Setup with Examples:

🛠️ Advanced Usage

🗂️ Task Creation at Scale

🛡️ Advanced Strategies

❓ Troubleshooting

Common Issues

🤝 Contributing

📄 License

📬 Contact

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages