This project fine-tunes microsoft/deberta-v3-large for utterance-level emotion classification on the MELD dataset and several synthetic / augmented variants.
It includes:
- a training / evaluation pipeline built with PyTorch Lightning
- a set of data-generation scripts using Qwen to synthesize extra MELD-style utterances
- hyperparameter sweeps driven by Weights & Biases (W&B)
- scripts to evaluate checkpoints independently with scikit-learn and to export prediction files
From the repo root (the folder containing config.py and train.py):
.
βββ config.py # Central config + label mapping
βββ data_module.py # LightningDataModule for text classification
βββ model.py # LightningModule (DeBERTa-v3-large classifier)
βββ train.py # CLI entry point for training / testing
βββ eval_dev.py # Independent evaluation / prediction script
βββ slurm_sweep.sh # Slurm wrapper for W&B sweeps (HPC)
βββ sweep_deberta.yaml # Base sweep config (MELD train split)
βββ sweep_deberta_phase2.yaml # Sweep on synthetic-only training data
βββ sweep_deberta_phase3.yaml # Sweeps on augmented splits (250/500/1000)
βββ run-sweep-dataset-explanation.txt
βββ requirements.txt # Python dependencies
βββ poetry.lock # Frozen dependency versions (optional)
βββ ensemble/
β βββ ensemble2.py
β βββ ensemble.sh
βββ datasets/
β βββ train_sent_emo.csv
β βββ dev_sent_emo.csv
β βββ test_sent_emo.csv
β βββ synth_train_sent_emo.csv
β βββ synth_train_augmented.csv
β βββ synth_train_hundo_p_take3_dedup.csv
β βββ synth_train_hundo_p_take3_double_dedup.csv
β βββ synth_train_triple_filtered.csv
β βββ train_emotions_aug_250.csv
β βββ train_emotions_aug_500.csv
β βββ train_emotions_aug_1000.csv
βββ synthetic_gen/
βββ MELD_DATA/
βββ synth_data.py # Generate synthetic utterances with Qwen
βββ synth_data_fix.py # Clean / patch first synthetic pass
βββ synth_data_fix_2.py # Further cleaning / schema fixes
βββ dedupe_synth_tfidf.py # Remove near-duplicate synthetic rows
βββ dedupe_cross_class_tfidf.py # Cross-class deduping
βββ augment_train_with_synth.py # Mix real + synthetic for new splits
βββ emotionCounter.py # Class frequency summaries
βββ meld_synth.sh / *_big_gpu*.sh # Slurm helpers for large Qwen jobs
βββ README.md (optional, if you add one later)
You can run everything either locally or on an HPC node. The basic recipe is the
same: create a venv, activate it, and install from requirements.txt.
From the repo root:
python3 -m venv .ftmb # or any name you like
source .ftmb/bin/activateOn some clusters you may need to load a Python module first, e.g.:
module load python/3.11.6 # example; use whatever your cluster providesWith the venv active:
pip install --upgrade pip
pip install -r requirements.txtThe key libraries are:
torch/torchvisionpytorch-lightningtransformersdatasetspandasscikit-learnwandbmatplotlib,tqdm
If you prefer Poetry, the pinned environment is in poetry.lock and you can
recreate it with:
poetry install(but the course instructions only require that requirements.txt works.)
Sweeps require a W&B API key:
Go google how to do that or something, idk.
All training and evaluation runs use CSV files under datasets/.
The original MELD data has been split into three CSVs:
train_sent_emo.csvdev_sent_emo.csvtest_sent_emo.csv
Each of these has at least the following columns:
Utterance: the input text (one utterance per row)Emotion: the gold label, one of
anger, disgust, fear, joy, neutral sadness, surprise
Additional columns like Dialogue_ID, Utterance_ID, Speaker may be present
but are ignored by the model.
Synthetic examples are generated by Qwen and written to:
synth_train_sent_emo.csv(raw synthetic data)- various deduplicated versions:
synth_train_hundo_p_take3_dedup.csvsynth_train_hundo_p_take3_double_dedup.csvsynth_train_triple_filtered.csv
Final, balanced(ish) training sets used for sweeps:
train_emotions_aug_250.csv- +250 synthetic examples per classtrain_emotions_aug_500.csv- +500 per class (used in the Athena sweep)train_emotions_aug_1000.csv- +1000 per class (also used in Athena)
See run-sweep-dataset-explanation.txt for the full narrative of which sweep
uses which file.
More information about TF-IDF Vectorization and deduplication based on in-class and cross-class similarity can be found in the dataset-level README located at /ADV_NLP_PROJECT/ADV_NLP_PROJECT/MELD_DATA/README.md
All of these use the same schema as the MELD splits: Utterance and Emotion
are the only required columns.
train.py is the main entry point. It accepts paths and hyperparameters via
argparse (and defaults are wired up in config.py). The most important
arguments:
--train_path,--val_path,--test_path--text_col,--label_col(default toUtteranceandEmotion)--tokenizer_name(default:microsoft/deberta-v3-large)--max_epochs,--batch_size,--max_lr,--weight_decay,--dropout--label_smoothing,--flooding_val--accelerator,--devices,--precision,--num_workers
The model itself is defined in model.py (BertClassifier), and
data_module.py defines TextClassificationDataModule.
From the repo root, with venv active and a GPU available:
python -u train.py \
--train_path datasets/train_emotions_aug_500.csv \
--val_path datasets/dev_sent_emo.csv \
--test_path datasets/test_sent_emo.csv \
--text_col Utterance \
--label_col Emotion \
--tokenizer_name microsoft/deberta-v3-large \
--accelerator gpu \
--devices 1 \
--precision bf16-mixed \
--num_workers 1 \
--batch_size 32 \
--max_epochs 30This will:
- Fine-tune DeBERTa-v3-large on the selected training set.
- Run validation and report metrics (especially
val_f1_macro). - Run a final test pass and log
test_accandtest_loss. - Save:
- best checkpoint (by
val_f1_macro) to
checkpoints/run_<id-date>/best.ckpt - test predictions and probabilities under
preds/run_<id-date>/.
- best checkpoint (by
early_stopping and ModelCheckpoint are configured in train.py using
val_f1_macro as the monitor.
However, this is CRAZY for most people with regular-type GPUs. For training on a cluster,
see section 5.2. Running the sweep on an HPC Node
Sweeps are configured as YAML files:
sweep_deberta.yaml- original MELD train splitsweep_deberta_phase2.yaml- sweeps on synthetic-only datasweep_deberta_phase3.yaml- sweeps on thetrain_emotions_aug_*splits
Each sweep file:
- specifies the metric (
val_f1_macro, goalmaximize) - defines search spaces for
max_lr,dropout,weight_decay,label_smoothing,flooding_val,max_epochs, etc. - pins dataset paths via
command: ... --train_path ... --val_path ...
From the repo root, with W&B logged in:
wandb sweep sweep_deberta_phase3.yamlW&B will print a sweep ID of the form:
tar-xvf/modernbert-meld/u1l2r0v1
Use slurm_sweep.sh to launch one or more agents. You do not need to copy
or modify Python files; Slurm just runs whatever is in the repo at submit time.
Example: start 40 runs on Athena/Hickory (one GPU):
sbatch slurm_sweep.sh tar-xvf/modernbert-meld/u1l2r0v1 40NOTE: replace {u1l2r0v1} with whatever sweep id is returned by wandb sweep sweep_deberta_phase3.yaml
slurm_sweep.sh:
- reserves a single GPU and some RAM
- sets cache directories (
HF_HOME,WANDB_DIR, etc.) on node-local scratch - runs
wandb agent --count <COUNT> <SWEEP_ID>
You can monitor jobs with:
squeue -u $USERand cancel with:
scancel <JOBID>To double-check metrics, use eval_dev.py. It loads
a checkpoint, runs it on a CSV split, and recomputes metrics with
scikit-learn.
Basic usage:
python -u eval_dev.py \
--ckpt_path checkpoints/run_1651577-20251205-120956/best.ckpt \
--data_path datasets/dev_sent_emo.csv \
--text_col Utterance \
--label_col Emotion \
--tokenizer_name microsoft/deberta-v3-large \
--batch_size 32 \
--out_prefix eval_dev_run_1651577Outputs:
eval_dev_run_1651577_metrics.json- per-class precision/recall/F1, macro/micro averages, confusion matrix, etc. (all via scikit-learn)eval_dev_run_1651577_predictions.csv- one row per utterance with:- original text
- gold label
- predicted label
- predicted probability for each class
Inside model.py, the Lightning module collects preds and labels as:
preds: integer class IDs (argmaxover softmax probabilities)labels: gold class IDs from the dataset
Both are fed directly into a clean torchmetrics F1 (macro) during training,
and eval_dev.py recomputes F1 from scratch using sklearn.metrics.f1_score
for independent verification.
The ADV_NLP_PROJ/MELD_DATA folder documents how the synthetic splits were
created. Very high-level overview:
-
Generate synthetic utterances
synth_data_fix_2.pycalls a Qwen model to generate new utterances conditioned on existing MELD context and labels, writing them tosynth_train_sent_emo.csv. -
Deduplicate
dedupe_synth_tfidf.pyremoves near-duplicate synthetic rows within the same class using TF-IDF + cosine similarity.dedupe_cross_class_tfidf.pyremoves cross-class near-duplicates (same text but different label).
-
Augment MELD training set
augment_train_with_synth.pymerges the cleaned synthetic examples withtrain_sent_emo.csvto build balanced training sets with +250, +500, or +1000 synthetic examples per class. These are written to:train_emotions_aug_250.csvtrain_emotions_aug_500.csvtrain_emotions_aug_1000.csv
-
Utility scripts
emotionCounter.pyprints per-class counts for any CSV so you can verify that the balancing worked as intended (or just like, check out what the counts are if you're curious).
- Training / sweep logs: text logs under
logs/, one.outand.errper Slurm job. - Model checkpoints: under
checkpoints/run_*. Each run directory contains:best.ckpt- best epoch byval_f1_macro- optional
last.ckptif you enable saving the last epoch
- Predictions / probabilities: under
preds/run_*:test_preds.csvtest_probs.pkl(float32 numpy array or pandas frame)
- W&B: all metrics, curves, and hyperparameters are logged to the
configured W&B project (
modernbert-meld).
-
Clone this repo to your home or scratch space.
-
Create / activate venv:
python3 -m venv .ftmb [or whatever you want to name the venv; it doesnt have to be ftmb] source .ftmb/bin/activate pip install -r requirements.txt
-
Run a small verification training (1-3 epochs) on any training split:
python -u train.py \ --train_path datasets/train_emotions_aug_250.csv \ --val_path datasets/dev_sent_emo.csv \ --test_path datasets/test_sent_emo.csv \ --text_col Utterance \ --label_col Emotion \ --max_epochs 3 \ --batch_size 16 \ --accelerator gpu --devices 1 --precision bf16-mixed
-
(Optional) Launch a sweep once the single run works:
wandb sweep sweep_deberta_phase3.yaml sbatch slurm_sweep.sh tar-xvf/modernbert-meld/<SWEEP_ID> 40
-
Evaluate best checkpoint with
eval_dev.pyto confirm F1 and run this by running
sbatch eval_dev.sh <CHECKPOINT_PATH> <DATA_PATH> <OUT_PREFIX>(paths are relative)
- Ensemble predictions
Use the csv files produced on the prior step and ensure the files are located/moved to the
ensemble/folder. First ensure you install the requirements.txt file. Then run:
sbatch ensemble.sh The code in this repository is released under the MIT License β you're free to use, modify, and distribute it with attribution.
The MELD dataset is a separate work with its own license. MELD is derived from
dialogue in the television series Friends and is distributed under the
GNU General Public License v3.0. See the
official MELD repository for the full license
text, dataset terms, and citation requirements. The MIT license on this code does
not extend to MELD or to any data derived from it (including the synthetic and
augmented splits in datasets/, which are conditioned on MELD content).
If you use MELD via this repo, please cite the original paper:
Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., & Mihalcea, R. (2019). MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. https://aclanthology.org/P19-1050/