Skip to content

NLPatVCU/EMILY

Repository files navigation

DeBERTa Text Emotion Classification (MELD + Synthetic Data)

This project fine-tunes microsoft/deberta-v3-large for utterance-level emotion classification on the MELD dataset and several synthetic / augmented variants.

It includes:

  • a training / evaluation pipeline built with PyTorch Lightning
  • a set of data-generation scripts using Qwen to synthesize extra MELD-style utterances
  • hyperparameter sweeps driven by Weights & Biases (W&B)
  • scripts to evaluate checkpoints independently with scikit-learn and to export prediction files

1. Project layout

From the repo root (the folder containing config.py and train.py):

.
β”œβ”€β”€ config.py                     # Central config + label mapping
β”œβ”€β”€ data_module.py                # LightningDataModule for text classification
β”œβ”€β”€ model.py                      # LightningModule (DeBERTa-v3-large classifier)
β”œβ”€β”€ train.py                      # CLI entry point for training / testing
β”œβ”€β”€ eval_dev.py                   # Independent evaluation / prediction script
β”œβ”€β”€ slurm_sweep.sh                # Slurm wrapper for W&B sweeps (HPC)
β”œβ”€β”€ sweep_deberta.yaml            # Base sweep config (MELD train split)
β”œβ”€β”€ sweep_deberta_phase2.yaml     # Sweep on synthetic-only training data
β”œβ”€β”€ sweep_deberta_phase3.yaml     # Sweeps on augmented splits (250/500/1000)
β”œβ”€β”€ run-sweep-dataset-explanation.txt
β”œβ”€β”€ requirements.txt              # Python dependencies
β”œβ”€β”€ poetry.lock                   # Frozen dependency versions (optional)
β”œβ”€β”€ ensemble/
β”‚   β”œβ”€β”€ ensemble2.py
β”‚   β”œβ”€β”€ ensemble.sh
β”œβ”€β”€ datasets/
β”‚   β”œβ”€β”€ train_sent_emo.csv
β”‚   β”œβ”€β”€ dev_sent_emo.csv
β”‚   β”œβ”€β”€ test_sent_emo.csv
β”‚   β”œβ”€β”€ synth_train_sent_emo.csv
β”‚   β”œβ”€β”€ synth_train_augmented.csv
β”‚   β”œβ”€β”€ synth_train_hundo_p_take3_dedup.csv
β”‚   β”œβ”€β”€ synth_train_hundo_p_take3_double_dedup.csv
β”‚   β”œβ”€β”€ synth_train_triple_filtered.csv
β”‚   β”œβ”€β”€ train_emotions_aug_250.csv
β”‚   β”œβ”€β”€ train_emotions_aug_500.csv
β”‚   └── train_emotions_aug_1000.csv
└── synthetic_gen/
    └── MELD_DATA/
        β”œβ”€β”€ synth_data.py                 # Generate synthetic utterances with Qwen
        β”œβ”€β”€ synth_data_fix.py             # Clean / patch first synthetic pass
        β”œβ”€β”€ synth_data_fix_2.py           # Further cleaning / schema fixes
        β”œβ”€β”€ dedupe_synth_tfidf.py         # Remove near-duplicate synthetic rows
        β”œβ”€β”€ dedupe_cross_class_tfidf.py   # Cross-class deduping
        β”œβ”€β”€ augment_train_with_synth.py   # Mix real + synthetic for new splits
        β”œβ”€β”€ emotionCounter.py             # Class frequency summaries
        β”œβ”€β”€ meld_synth.sh / *_big_gpu*.sh # Slurm helpers for large Qwen jobs
        └── README.md (optional, if you add one later)

2. Environment setup

You can run everything either locally or on an HPC node. The basic recipe is the same: create a venv, activate it, and install from requirements.txt.

2.1. Create and activate a venv

From the repo root:

python3 -m venv .ftmb           # or any name you like
source .ftmb/bin/activate

On some clusters you may need to load a Python module first, e.g.:

module load python/3.11.6   # example; use whatever your cluster provides

2.2. Install Python dependencies

With the venv active:

pip install --upgrade pip
pip install -r requirements.txt

The key libraries are:

  • torch / torchvision
  • pytorch-lightning
  • transformers
  • datasets
  • pandas
  • scikit-learn
  • wandb
  • matplotlib, tqdm

If you prefer Poetry, the pinned environment is in poetry.lock and you can recreate it with:

poetry install

(but the course instructions only require that requirements.txt works.)

2.3. Weights & Biases (W&B)

Sweeps require a W&B API key:

Go google how to do that or something, idk.


3. Data: input formats and derived splits

All training and evaluation runs use CSV files under datasets/.

3.1. Core MELD splits

The original MELD data has been split into three CSVs:

  • train_sent_emo.csv
  • dev_sent_emo.csv
  • test_sent_emo.csv

Each of these has at least the following columns:

  • Utterance: the input text (one utterance per row)
  • Emotion: the gold label, one of
    anger, disgust, fear, joy, neutral sadness, surprise

Additional columns like Dialogue_ID, Utterance_ID, Speaker may be present but are ignored by the model.

3.2. Synthetic and augmented splits

Synthetic examples are generated by Qwen and written to:

  • synth_train_sent_emo.csv (raw synthetic data)
  • various deduplicated versions:
    • synth_train_hundo_p_take3_dedup.csv
    • synth_train_hundo_p_take3_double_dedup.csv
    • synth_train_triple_filtered.csv

Final, balanced(ish) training sets used for sweeps:

  • train_emotions_aug_250.csv - +250 synthetic examples per class
  • train_emotions_aug_500.csv - +500 per class (used in the Athena sweep)
  • train_emotions_aug_1000.csv - +1000 per class (also used in Athena)

See run-sweep-dataset-explanation.txt for the full narrative of which sweep uses which file.

More information about TF-IDF Vectorization and deduplication based on in-class and cross-class similarity can be found in the dataset-level README located at /ADV_NLP_PROJECT/ADV_NLP_PROJECT/MELD_DATA/README.md

All of these use the same schema as the MELD splits: Utterance and Emotion are the only required columns.


4. Running the training pipeline

4.1. CLI arguments

train.py is the main entry point. It accepts paths and hyperparameters via argparse (and defaults are wired up in config.py). The most important arguments:

  • --train_path, --val_path, --test_path
  • --text_col, --label_col (default to Utterance and Emotion)
  • --tokenizer_name (default: microsoft/deberta-v3-large)
  • --max_epochs, --batch_size, --max_lr, --weight_decay, --dropout
  • --label_smoothing, --flooding_val
  • --accelerator, --devices, --precision, --num_workers

The model itself is defined in model.py (BertClassifier), and data_module.py defines TextClassificationDataModule.

4.2. Single run (no sweep)

From the repo root, with venv active and a GPU available:

python -u train.py \
  --train_path datasets/train_emotions_aug_500.csv \
  --val_path   datasets/dev_sent_emo.csv \
  --test_path  datasets/test_sent_emo.csv \
  --text_col Utterance \
  --label_col Emotion \
  --tokenizer_name microsoft/deberta-v3-large \
  --accelerator gpu \
  --devices 1 \
  --precision bf16-mixed \
  --num_workers 1 \
  --batch_size 32 \
  --max_epochs 30

This will:

  1. Fine-tune DeBERTa-v3-large on the selected training set.
  2. Run validation and report metrics (especially val_f1_macro).
  3. Run a final test pass and log test_acc and test_loss.
  4. Save:
    • best checkpoint (by val_f1_macro) to
      checkpoints/run_<id-date>/best.ckpt
    • test predictions and probabilities under
      preds/run_<id-date>/.

early_stopping and ModelCheckpoint are configured in train.py using val_f1_macro as the monitor.

However, this is CRAZY for most people with regular-type GPUs. For training on a cluster, see section 5.2. Running the sweep on an HPC Node

5. Hyperparameter sweeps (W&B + Slurm)

Sweeps are configured as YAML files:

  • sweep_deberta.yaml - original MELD train split
  • sweep_deberta_phase2.yaml - sweeps on synthetic-only data
  • sweep_deberta_phase3.yaml - sweeps on the train_emotions_aug_* splits

Each sweep file:

  • specifies the metric (val_f1_macro, goal maximize)
  • defines search spaces for max_lr, dropout, weight_decay, label_smoothing, flooding_val, max_epochs, etc.
  • pins dataset paths via command: ... --train_path ... --val_path ...

5.1. Creating a sweep

From the repo root, with W&B logged in:

wandb sweep sweep_deberta_phase3.yaml

W&B will print a sweep ID of the form:

tar-xvf/modernbert-meld/u1l2r0v1

5.2. Running the sweep on an HPC node

Use slurm_sweep.sh to launch one or more agents. You do not need to copy or modify Python files; Slurm just runs whatever is in the repo at submit time.

Example: start 40 runs on Athena/Hickory (one GPU):

sbatch slurm_sweep.sh tar-xvf/modernbert-meld/u1l2r0v1 40

NOTE: replace {u1l2r0v1} with whatever sweep id is returned by wandb sweep sweep_deberta_phase3.yaml

slurm_sweep.sh:

  • reserves a single GPU and some RAM
  • sets cache directories (HF_HOME, WANDB_DIR, etc.) on node-local scratch
  • runs wandb agent --count <COUNT> <SWEEP_ID>

You can monitor jobs with:

squeue -u $USER

and cancel with:

scancel <JOBID>

6. Independent evaluation and F1 sanity checks

To double-check metrics, use eval_dev.py. It loads a checkpoint, runs it on a CSV split, and recomputes metrics with scikit-learn.

Basic usage:

python -u eval_dev.py \
  --ckpt_path checkpoints/run_1651577-20251205-120956/best.ckpt \
  --data_path datasets/dev_sent_emo.csv \
  --text_col Utterance \
  --label_col Emotion \
  --tokenizer_name microsoft/deberta-v3-large \
  --batch_size 32 \
  --out_prefix eval_dev_run_1651577

Outputs:

  • eval_dev_run_1651577_metrics.json - per-class precision/recall/F1, macro/micro averages, confusion matrix, etc. (all via scikit-learn)
  • eval_dev_run_1651577_predictions.csv - one row per utterance with:
    • original text
    • gold label
    • predicted label
    • predicted probability for each class

Inside model.py, the Lightning module collects preds and labels as:

  • preds: integer class IDs (argmax over softmax probabilities)
  • labels: gold class IDs from the dataset

Both are fed directly into a clean torchmetrics F1 (macro) during training, and eval_dev.py recomputes F1 from scratch using sklearn.metrics.f1_score for independent verification.


7. Data-generation scripts (synthetic_gen/MELD_DATA)

The ADV_NLP_PROJ/MELD_DATA folder documents how the synthetic splits were created. Very high-level overview:

  1. Generate synthetic utterances
    synth_data_fix_2.py calls a Qwen model to generate new utterances conditioned on existing MELD context and labels, writing them to synth_train_sent_emo.csv.

  2. Deduplicate

    • dedupe_synth_tfidf.py removes near-duplicate synthetic rows within the same class using TF-IDF + cosine similarity.
    • dedupe_cross_class_tfidf.py removes cross-class near-duplicates (same text but different label).
  3. Augment MELD training set
    augment_train_with_synth.py merges the cleaned synthetic examples with train_sent_emo.csv to build balanced training sets with +250, +500, or +1000 synthetic examples per class. These are written to:

    • train_emotions_aug_250.csv
    • train_emotions_aug_500.csv
    • train_emotions_aug_1000.csv
  4. Utility scripts
    emotionCounter.py prints per-class counts for any CSV so you can verify that the balancing worked as intended (or just like, check out what the counts are if you're curious).


8. Logs and artifacts

  • Training / sweep logs: text logs under logs/, one .out and .err per Slurm job.
  • Model checkpoints: under checkpoints/run_*. Each run directory contains:
    • best.ckpt - best epoch by val_f1_macro
    • optional last.ckpt if you enable saving the last epoch
  • Predictions / probabilities: under preds/run_*:
    • test_preds.csv
    • test_probs.pkl (float32 numpy array or pandas frame)
  • W&B: all metrics, curves, and hyperparameters are logged to the configured W&B project (modernbert-meld).

9. Minimal quick-start

  1. Clone this repo to your home or scratch space.

  2. Create / activate venv:

    python3 -m venv .ftmb [or whatever you want to name the venv; it doesnt have to be ftmb]
    source .ftmb/bin/activate
    pip install -r requirements.txt
  3. Run a small verification training (1-3 epochs) on any training split:

    python -u train.py \
      --train_path datasets/train_emotions_aug_250.csv \
      --val_path   datasets/dev_sent_emo.csv \
      --test_path  datasets/test_sent_emo.csv \
      --text_col Utterance \
      --label_col Emotion \
      --max_epochs 3 \
      --batch_size 16 \
      --accelerator gpu --devices 1 --precision bf16-mixed
  4. (Optional) Launch a sweep once the single run works:

    wandb sweep sweep_deberta_phase3.yaml
    sbatch slurm_sweep.sh tar-xvf/modernbert-meld/<SWEEP_ID> 40
  5. Evaluate best checkpoint with eval_dev.py to confirm F1 and run this by running

sbatch eval_dev.sh <CHECKPOINT_PATH> <DATA_PATH> <OUT_PREFIX>

(paths are relative)

  1. Ensemble predictions Use the csv files produced on the prior step and ensure the files are located/moved to the ensemble/ folder. First ensure you install the requirements.txt file. Then run:
sbatch ensemble.sh 

10. License and dataset attribution

The code in this repository is released under the MIT License β€” you're free to use, modify, and distribute it with attribution.

The MELD dataset is a separate work with its own license. MELD is derived from dialogue in the television series Friends and is distributed under the GNU General Public License v3.0. See the official MELD repository for the full license text, dataset terms, and citation requirements. The MIT license on this code does not extend to MELD or to any data derived from it (including the synthetic and augmented splits in datasets/, which are conditioned on MELD content).

If you use MELD via this repo, please cite the original paper:

Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., & Mihalcea, R. (2019). MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. https://aclanthology.org/P19-1050/

About

EMILY: EMotion Identification with Lightning, Y'all

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors