Skip to content

Ayush272002/Automated-Performance-Bug-Report-Classification

Repository files navigation

Automated Performance Bug Report Classification

Bug tracking systems receive thousands of issue reports, the vast majority of which are not performance-related. Manually triaging these to identify performance bugs — memory leaks, latency regressions, CPU/GPU bottlenecks — is time-consuming and requires domain expertise.

This project trains and evaluates four classifiers on GitHub issue reports from five open-source deep learning frameworks, using a domain-aware text feature pipeline combining TF-IDF, character n-grams, title-specific features, and a hand-crafted performance keyword lexicon.

Datasets: PyTorch · TensorFlow · Keras · Apache MXNet · Caffe (3,712 reports total)

Getting Started

Python version 3.13 strictly required.

Clone the Repository

git clone https://github.com/Ayush272002/Automated-Performance-Bug-Report-Classification.git
cd Automated-Performance-Bug-Report-Classification

Install Dependencies

Using uv (recommended):

pip3 install uv
uv sync

Using pip:

python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip3 install -r requirements.txt

Usage

# Run all models on all 5 projects (30 repeats)
uv run main.py

# Run a single model alongside the baseline
uv run main.py logreg
uv run main.py linearsvc
uv run main.py cnn_w2v

# Run baseline only
uv run main.py baseline

Replace uv run with python3 if not using uv.

Results are printed to the terminal and saved to results/run_<model>.log.


Results

All three proposed models significantly outperform the Gaussian Naive Bayes baseline (p < 0.001, Â₁₂ = 1.0). Logistic Regression and LinearSVC are statistically equivalent to each other, confirming the feature pipeline is the primary driver of performance.

Project Baseline F1 LogReg F1 LinearSVC F1 CNN F1
PyTorch 0.5624 0.8197 0.7805 0.7304
TensorFlow 0.5388 0.8672 0.8596 0.8293
Keras 0.5412 0.8154 0.8059 0.7643
MXNet 0.5159 0.8167 0.7805 0.6507
Caffe 0.4611 0.8000 0.7751 0.5330

Full results including Accuracy, Precision, Recall, AUC and statistical tests are in results/run_all.log.


Project Structure

├── main.py                  # Entry point — CLI, logging, statistical tests
├── runner.py                # Shared experiment loop and metrics
├── br_classification.py     # Baseline: Gaussian Naive Bayes + TF-IDF
├── logreg.py                # Logistic Regression
├── linearsvc.py             # LinearSVC with Platt scaling
├── cnn_w2v.py               # CNN with Word2Vec embeddings
├── datasets/                # CSV datasets (one per project)
├── results/                 # Output logs and CSVs
├── report/                  # LaTeX report
├── manual/                  # User manual
└── replication/             # Replication instructions

About

This project trains and evaluates four classifiers on GitHub issue reports from five open-source deep learning frameworks, using a domain-aware text feature pipeline combining TF-IDF, character n-grams, title-specific features, and a hand-crafted performance keyword lexicon.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors