Bug tracking systems receive thousands of issue reports, the vast majority of which are not performance-related. Manually triaging these to identify performance bugs — memory leaks, latency regressions, CPU/GPU bottlenecks — is time-consuming and requires domain expertise.
This project trains and evaluates four classifiers on GitHub issue reports from five open-source deep learning frameworks, using a domain-aware text feature pipeline combining TF-IDF, character n-grams, title-specific features, and a hand-crafted performance keyword lexicon.
Datasets: PyTorch · TensorFlow · Keras · Apache MXNet · Caffe (3,712 reports total)
Python version 3.13 strictly required.
git clone https://github.com/Ayush272002/Automated-Performance-Bug-Report-Classification.git
cd Automated-Performance-Bug-Report-ClassificationUsing uv (recommended):
pip3 install uv
uv syncUsing pip:
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip3 install -r requirements.txt# Run all models on all 5 projects (30 repeats)
uv run main.py
# Run a single model alongside the baseline
uv run main.py logreg
uv run main.py linearsvc
uv run main.py cnn_w2v
# Run baseline only
uv run main.py baselineReplace
uv runwithpython3if not using uv.
Results are printed to the terminal and saved to results/run_<model>.log.
All three proposed models significantly outperform the Gaussian Naive Bayes baseline (p < 0.001, Â₁₂ = 1.0). Logistic Regression and LinearSVC are statistically equivalent to each other, confirming the feature pipeline is the primary driver of performance.
| Project | Baseline F1 | LogReg F1 | LinearSVC F1 | CNN F1 |
|---|---|---|---|---|
| PyTorch | 0.5624 | 0.8197 | 0.7805 | 0.7304 |
| TensorFlow | 0.5388 | 0.8672 | 0.8596 | 0.8293 |
| Keras | 0.5412 | 0.8154 | 0.8059 | 0.7643 |
| MXNet | 0.5159 | 0.8167 | 0.7805 | 0.6507 |
| Caffe | 0.4611 | 0.8000 | 0.7751 | 0.5330 |
Full results including Accuracy, Precision, Recall, AUC and statistical tests are in
results/run_all.log.
├── main.py # Entry point — CLI, logging, statistical tests
├── runner.py # Shared experiment loop and metrics
├── br_classification.py # Baseline: Gaussian Naive Bayes + TF-IDF
├── logreg.py # Logistic Regression
├── linearsvc.py # LinearSVC with Platt scaling
├── cnn_w2v.py # CNN with Word2Vec embeddings
├── datasets/ # CSV datasets (one per project)
├── results/ # Output logs and CSVs
├── report/ # LaTeX report
├── manual/ # User manual
└── replication/ # Replication instructions