Skip to content

LeiLiLab/susvibes

Repository files navigation

SusVibes: Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

📃 Paper | 🏆 Leaderboard

python docker license

🌟 Overview

SusVibes is a comprehensive benchmark and evaluation pipeline designed to expose the security vulnerabilities in code generated by AI agents when solving real-world software engineering tasks. This framework provides a standardized evaluation pipeline to assess the quality and security implications of agent-generated code across diverse software engineering domains.

The benchmark consists of 200 realistic coding tasks constructed from 108 existing open-source software projects. The agent's solutions are evaluated in terms of functional correctness and security with dynamic tests in execution environments. SusVibes covers a wide range of 77 weaknesses from Common Weakness Enumeration (CWEs). On average, the code base for each task has 160k lines of code, 867 files, and requires over 170 lines of code to solve, providing realistic evaluation for coding agents.

SusVibes Overview

⚡ Installation

🐍 Clone and Set Up Environment

  1. Clone the repository:
git clone https://github.com/LeiLiLab/susvibes.git
cd susvibes
  1. Install Python dependencies:
conda create -n sv python=3.11
conda activate sv
pip install -r requirements.txt
pip install -e .
  1. The SusVibes dataset is placed under datasets/ directory for convenient usage:
datasets/
└── susvibes_dataset.jsonl

📊 Overview of SusVibes' Dataset

The SusVibes dataset contains task information with the following key fields:

  • instance_id: Unique identifier for each task from real-world projects, formatted as repo-owner__repo-name_commit-id
  • image_name: Pre-built Docker image containing the development environment of each task
  • problem_statement: Natural language description of the task for agent input
  • Other metadata and evaluation specifications are omitted here.

🧪 Evaluation Guidelines

🔧 Step 1: Harness Coding Agent in Completing SusVibes' Tasks

  1. Prepare the environment:

    • Pull Docker images specified in the image_name field:
    docker pull <image_name>
    • The project code which the task operates on is located at /project within each Docker container
  2. Execute your agent:

    • Feed the problem_statement to your agent
    • Let the agent generate code solutions within the containerized environment
  3. Format predictions: Save your agent's outputs in JSONL format with the following structure:

    {
      "instance_id": "repo-owner__repo-name_commit-id",
      "model_name_or_path": "your-model-name",
      "model_patch": "the-implementation-patch"
    }

For an example guideline on how to run Kimi CLI on SusVibes, see tutorial.

📈 Step 2: Evaluation

Note: SusVibes evaluation can be resource intensive. Recommended hardware settings for an accurate evaluation is to have at least 400GB of free storage, 4GB of RAM and 4 CPU cores per parallel worker on an x86_64 machine.

Run the evaluation pipeline from the susvibes/ directory:

python -m susvibes.run_evaluation \
  --run_id <unique_run_identifier> \
  --predictions_path <path_to_predictions_jsonl> \
  --max_workers 5 \
  [--force]  # Optional: force re-evaluation

Parameters:

  • --run_id: Name of this evaluation run (defaults to default)
  • --predictions_path: Path to your agent's predictions file
  • --max_workers: Number of parallel workers (adjust based on available CPU cores)
  • --force: Force re-evaluation even if previous logs exist

The evaluation summary is written automatically to logs/run_evaluation/<run_id>/<strategy>/summary.json (where <strategy> defaults to generic); the path is printed at the end of the run.

✅ Verify Setup with Examples:

You can use our provided datasets/examples/sample_predictions.json to verify your setup. This should give you a summary in logs/run_evaluation/test/generic/summary.json.

python -m susvibes.run_evaluation \
  --run_id test \
  --predictions_path datasets/examples/sample_predictions.json \
  --max_workers 5 \
  --force

🛠️ Advanced Usage

🗂️ Task Creation at Scale

See the subfolder's README for more details.

🛡️ Advanced Strategies

SusVibes supports advanced strategies for security-enhanced evaluation (generic, self-selection, oracle, feedback-driven, sec-test). See the subfolder's README for more details.

❓ Troubleshooting

Common Issues

  1. Docker permission errors: Ensure your user has proper Docker permissions
  2. Memory issues: Reduce --max_workers if encountering OOM errors
  3. Storage space: Ensure sufficient disk space for Docker images and logs
  4. Computation power: Allocate enough CPU computation to avoid unexpected evaluation timeouts.

🤝 Contributing

We welcome contributions to improve SusVibes! Please see our contributing guidelines for:

  • Adding new tasks
  • Improving evaluation metrics
  • Enhancing security analysis capabilities
  • Documentation improvements

📄 License

This project is licensed under the terms specified in the LICENSE file.

📬 Contact

For questions, issues, or collaboration opportunities, please:

🙏 Acknowledgments

We thank the open-source community for providing the diverse codebases used in our benchmark tasks.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages