Skip to content

jayn2u/lab_clip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

118 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lab-clip

OpenCLIP fine-tuning toolkit for text-based person re-identification (ReID). Supports contrastive training and triplet loss training with hard negative captions. Designed to be used as a person search engine.

Repository Layout

lab_clip/
  src/
    data.py       # dataset loading, ReIDDataset, TripletReIDDataset, collate functions
    losses.py     # multi-positive InfoNCE loss, symmetric CLIP loss, triplet loss
    retrieval.py  # encode_image_loader, encode_text_loader, retrieval_metrics
    training.py   # contrastive_step, evaluate, optimizer, scheduler, AverageMeter
  train.py              # single-GPU contrastive fine-tuning
  train_ddp.py          # multi-GPU DDP contrastive fine-tuning
  train_triplet.py      # single-GPU triplet loss fine-tuning
  train_ddp_triplet.py  # multi-GPU DDP triplet loss fine-tuning
  tune.py               # Optuna hyperparameter search for contrastive training
  tune_triplet.py       # Optuna hyperparameter search for triplet training
  test.py               # standard retrieval evaluation (Rank-k, mAP)
  test_triplet.py       # hard negative evaluation (triplet accuracy, margin)
  env/.env              # local environment variables (not tracked by git)
  pyproject.toml

Setup

Requires Python 3.12 and uv.

uv sync

Environment

Create env/.env with the following variables:

DATASET_ROOT="/path/to/datasets"
NEGATIVE_REID_DATASET_PATH="/path/to/negative-captions/outputs"
CUDA_VISIBLE_DEVICES=0

DATASET_ROOT must contain the following directory structure:

DATASET_ROOT/
  CUHK-PEDES/
    reid_raw.json
    reid_raw_negative_gemma4:e4b.json   # RTX 5070 Ti
    reid_raw_negative_gemma4:26b.json   # A6000
    imgs/
  RSTPReid/
    data_captions.json
    data_captions_negative_gemma4:e4b.json   # RTX 5070 Ti
    data_captions_negative_gemma4:26b.json   # A6000
    imgs/

GPU-Based Negative Annotation File Selection

Triplet training automatically detects the current GPU and selects the appropriate negative annotation file.

GPU Negative Model Annotation File Suffix
RTX 5070 Ti gemma4:e4b *_negative_gemma4:e4b.json
A6000 gemma4:26b *_negative_gemma4:26b.json

No manual path configuration is required. To add a new GPU, update GPU_NEGATIVE_MODEL_MAP and NEGATIVE_ANNOTATION_FILES in src/data.py.

Supported Datasets

Dataset Key Task
CUHK-PEDES cuhk-pedes text-based person ReID
RSTPReid rstpreid text-based person ReID

test.py additionally supports mscoco, cc12m, cifar100, and imagenet for general retrieval evaluation.

Training

Contrastive Training (Single GPU)

Fine-tunes OpenCLIP with multi-positive symmetric InfoNCE loss. Same-person captions within a batch are all treated as positives.

uv run python train.py \
  --dataset cuhk-pedes \
  --epochs 5 \
  --batch-size 64 \
  --lr 1e-5

Contrastive Training (Multi-GPU DDP)

torchrun --nproc_per_node=2 train_ddp.py \
  --dataset cuhk-pedes \
  --epochs 5 \
  --batch-size 256

Triplet Training (Single GPU)

Fine-tunes with triplet loss using image as anchor, original caption as positive, and LLM-generated hard negative caption as negative.

Loss: mean(relu(sim(img, neg) - sim(img, pos) + margin))

uv run python train_triplet.py \
  --dataset cuhk-pedes \
  --epochs 5 \
  --batch-size 64 \
  --margin 0.2

The negative annotation file is selected automatically based on the current GPU.

Triplet Training (Multi-GPU DDP)

torchrun --nproc_per_node=2 train_ddp_triplet.py \
  --dataset cuhk-pedes \
  --epochs 5 \
  --batch-size 256 \
  --margin 0.2

Common Training Arguments

Argument Default Description
--dataset required cuhk-pedes or rstpreid
--epochs 5 number of training epochs
--batch-size 64 per-GPU batch size
--lr 1e-5 AdamW learning rate
--weight-decay 0.2 AdamW weight decay
--warmup-ratio 0.05 linear warmup fraction of total steps
--accum-steps 1 gradient accumulation steps
--grad-clip-norm 1.0 gradient clipping max norm
--caption-mode all all expands each caption; random samples one per image
--margin 0.2 triplet loss margin (triplet training only)
--model-name ViT-B-16 OpenCLIP model name
--pretrained laion2b_s34b_b88k OpenCLIP pretrained weight tag
--output-dir auto artifact save directory
--resume checkpoint path to resume training
--save-every 0 save checkpoint every N epochs (0 disables)
--no-amp disable CUDA mixed precision
--no-grad-checkpointing disable gradient checkpointing

Checkpoints best.pt (best val score) and last.pt are saved under --output-dir.

Hyperparameter Tuning

Optuna-based tuning that launches training as a subprocess for each trial.

Contrastive

uv run python tune.py \
  --dataset cuhk-pedes \
  --n-trials 100 \
  --epochs 5

Triplet

uv run python tune_triplet.py \
  --dataset cuhk-pedes \
  --n-trials 100 \
  --epochs 5

Tuned parameters and their search spaces:

Parameter Search Space
batch_size [32, 64, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048]
accum_steps [1, 2, 4, 8, 16] (max effective batch = 32768)
lr [1e-6, 2e-6, 5e-6, 1e-5]
weight_decay [0.05, 0.1, 0.2, 0.3]
warmup_ratio [0.0, 0.15] continuous
grad_clip_norm [0.0, 0.5, 1.0, 5.0]
caption_mode [all, random]
margin [0.1, 0.2, 0.3, 0.5] (triplet only)

Learning rate is tuned only among small values unless --lr is set; fixed --lr values above 1e-5 are rejected. For a new study, every batch_size x accum_steps pair is enqueued once from the largest physical batch size before normal Optuna sampling. The best checkpoint is symlinked to output-root/best.pt. Use --reset-study to delete the existing study and trial artifacts under output-root before starting again.

Evaluation

Standard Retrieval (Rank-k, mAP)

Evaluates text-to-image retrieval quality over the full gallery. Use this to measure overall person search performance.

# Baseline (pretrained, no fine-tuning)
uv run python test.py --dataset cuhk-pedes --model baseline

# Fine-tuned checkpoint
uv run python test.py --dataset cuhk-pedes --model artifacts/cuhk-pedes_triplet/best.pt

Output metrics: top1, top5, top10, mAP

Hard Negative Evaluation (Triplet Accuracy)

Evaluates how well the model distinguishes a positive caption from a hard negative caption for the same image. Use this to measure localized attribute-level discrimination (e.g., "black coat" vs "white coat").

# Baseline
uv run python test_triplet.py --dataset cuhk-pedes --model baseline

# Fine-tuned checkpoint
uv run python test_triplet.py --dataset cuhk-pedes --model artifacts/cuhk-pedes_triplet/best.pt

Output metrics:

Metric Description
triplet_accuracy fraction where sim(img, pos) > sim(img, neg)
pos_sim_mean mean cosine similarity between image and positive caption
neg_sim_mean mean cosine similarity between image and negative caption
margin_mean mean of pos_sim - neg_sim
margin_std standard deviation of the margin

Recommended Evaluation Workflow

Run both evaluations for baseline and trained model to capture the full picture.

uv run python test.py           --dataset cuhk-pedes --model baseline
uv run python test_triplet.py   --dataset cuhk-pedes --model baseline

uv run python test.py           --dataset cuhk-pedes --model artifacts/cuhk-pedes_triplet/best.pt
uv run python test_triplet.py   --dataset cuhk-pedes --model artifacts/cuhk-pedes_triplet/best.pt

Results are saved as JSON under results/.

Loss Functions

Multi-Positive Symmetric InfoNCE (train.py, train_ddp.py)

All captions sharing the same person_id within a batch are treated as positives. Computed symmetrically in both image-to-text and text-to-image directions.

Triplet Loss (train_triplet.py, train_ddp_triplet.py)

loss = mean(relu(sim(image, neg_caption) - sim(image, pos_caption) + margin))

Image is the anchor. Positive and negative captions are 1-to-1 paired from the negative annotation JSON. All features are L2-normalized before similarity computation.

About

OpenAI CLIP Lab

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages