OpenCLIP fine-tuning toolkit for text-based person re-identification (ReID). Supports contrastive training and triplet loss training with hard negative captions. Designed to be used as a person search engine.
lab_clip/
src/
data.py # dataset loading, ReIDDataset, TripletReIDDataset, collate functions
losses.py # multi-positive InfoNCE loss, symmetric CLIP loss, triplet loss
retrieval.py # encode_image_loader, encode_text_loader, retrieval_metrics
training.py # contrastive_step, evaluate, optimizer, scheduler, AverageMeter
train.py # single-GPU contrastive fine-tuning
train_ddp.py # multi-GPU DDP contrastive fine-tuning
train_triplet.py # single-GPU triplet loss fine-tuning
train_ddp_triplet.py # multi-GPU DDP triplet loss fine-tuning
tune.py # Optuna hyperparameter search for contrastive training
tune_triplet.py # Optuna hyperparameter search for triplet training
test.py # standard retrieval evaluation (Rank-k, mAP)
test_triplet.py # hard negative evaluation (triplet accuracy, margin)
env/.env # local environment variables (not tracked by git)
pyproject.toml
Requires Python 3.12 and uv.
uv syncCreate env/.env with the following variables:
DATASET_ROOT="/path/to/datasets"
NEGATIVE_REID_DATASET_PATH="/path/to/negative-captions/outputs"
CUDA_VISIBLE_DEVICES=0DATASET_ROOT must contain the following directory structure:
DATASET_ROOT/
CUHK-PEDES/
reid_raw.json
reid_raw_negative_gemma4:e4b.json # RTX 5070 Ti
reid_raw_negative_gemma4:26b.json # A6000
imgs/
RSTPReid/
data_captions.json
data_captions_negative_gemma4:e4b.json # RTX 5070 Ti
data_captions_negative_gemma4:26b.json # A6000
imgs/
Triplet training automatically detects the current GPU and selects the appropriate negative annotation file.
| GPU | Negative Model | Annotation File Suffix |
|---|---|---|
| RTX 5070 Ti | gemma4:e4b | *_negative_gemma4:e4b.json |
| A6000 | gemma4:26b | *_negative_gemma4:26b.json |
No manual path configuration is required. To add a new GPU, update GPU_NEGATIVE_MODEL_MAP and NEGATIVE_ANNOTATION_FILES in src/data.py.
| Dataset | Key | Task |
|---|---|---|
| CUHK-PEDES | cuhk-pedes |
text-based person ReID |
| RSTPReid | rstpreid |
text-based person ReID |
test.py additionally supports mscoco, cc12m, cifar100, and imagenet for general retrieval evaluation.
Fine-tunes OpenCLIP with multi-positive symmetric InfoNCE loss. Same-person captions within a batch are all treated as positives.
uv run python train.py \
--dataset cuhk-pedes \
--epochs 5 \
--batch-size 64 \
--lr 1e-5torchrun --nproc_per_node=2 train_ddp.py \
--dataset cuhk-pedes \
--epochs 5 \
--batch-size 256Fine-tunes with triplet loss using image as anchor, original caption as positive, and LLM-generated hard negative caption as negative.
Loss: mean(relu(sim(img, neg) - sim(img, pos) + margin))
uv run python train_triplet.py \
--dataset cuhk-pedes \
--epochs 5 \
--batch-size 64 \
--margin 0.2The negative annotation file is selected automatically based on the current GPU.
torchrun --nproc_per_node=2 train_ddp_triplet.py \
--dataset cuhk-pedes \
--epochs 5 \
--batch-size 256 \
--margin 0.2| Argument | Default | Description |
|---|---|---|
--dataset |
required | cuhk-pedes or rstpreid |
--epochs |
5 |
number of training epochs |
--batch-size |
64 |
per-GPU batch size |
--lr |
1e-5 |
AdamW learning rate |
--weight-decay |
0.2 |
AdamW weight decay |
--warmup-ratio |
0.05 |
linear warmup fraction of total steps |
--accum-steps |
1 |
gradient accumulation steps |
--grad-clip-norm |
1.0 |
gradient clipping max norm |
--caption-mode |
all |
all expands each caption; random samples one per image |
--margin |
0.2 |
triplet loss margin (triplet training only) |
--model-name |
ViT-B-16 |
OpenCLIP model name |
--pretrained |
laion2b_s34b_b88k |
OpenCLIP pretrained weight tag |
--output-dir |
auto | artifact save directory |
--resume |
— | checkpoint path to resume training |
--save-every |
0 |
save checkpoint every N epochs (0 disables) |
--no-amp |
— | disable CUDA mixed precision |
--no-grad-checkpointing |
— | disable gradient checkpointing |
Checkpoints best.pt (best val score) and last.pt are saved under --output-dir.
Optuna-based tuning that launches training as a subprocess for each trial.
uv run python tune.py \
--dataset cuhk-pedes \
--n-trials 100 \
--epochs 5uv run python tune_triplet.py \
--dataset cuhk-pedes \
--n-trials 100 \
--epochs 5Tuned parameters and their search spaces:
| Parameter | Search Space |
|---|---|
batch_size |
[32, 64, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048] |
accum_steps |
[1, 2, 4, 8, 16] (max effective batch = 32768) |
lr |
[1e-6, 2e-6, 5e-6, 1e-5] |
weight_decay |
[0.05, 0.1, 0.2, 0.3] |
warmup_ratio |
[0.0, 0.15] continuous |
grad_clip_norm |
[0.0, 0.5, 1.0, 5.0] |
caption_mode |
[all, random] |
margin |
[0.1, 0.2, 0.3, 0.5] (triplet only) |
Learning rate is tuned only among small values unless --lr is set; fixed --lr values above 1e-5 are rejected. For a new study, every batch_size x accum_steps pair is enqueued once from the largest physical batch size before normal Optuna sampling. The best checkpoint is symlinked to output-root/best.pt. Use --reset-study to delete the existing study and trial artifacts under output-root before starting again.
Evaluates text-to-image retrieval quality over the full gallery. Use this to measure overall person search performance.
# Baseline (pretrained, no fine-tuning)
uv run python test.py --dataset cuhk-pedes --model baseline
# Fine-tuned checkpoint
uv run python test.py --dataset cuhk-pedes --model artifacts/cuhk-pedes_triplet/best.ptOutput metrics: top1, top5, top10, mAP
Evaluates how well the model distinguishes a positive caption from a hard negative caption for the same image. Use this to measure localized attribute-level discrimination (e.g., "black coat" vs "white coat").
# Baseline
uv run python test_triplet.py --dataset cuhk-pedes --model baseline
# Fine-tuned checkpoint
uv run python test_triplet.py --dataset cuhk-pedes --model artifacts/cuhk-pedes_triplet/best.ptOutput metrics:
| Metric | Description |
|---|---|
triplet_accuracy |
fraction where sim(img, pos) > sim(img, neg) |
pos_sim_mean |
mean cosine similarity between image and positive caption |
neg_sim_mean |
mean cosine similarity between image and negative caption |
margin_mean |
mean of pos_sim - neg_sim |
margin_std |
standard deviation of the margin |
Run both evaluations for baseline and trained model to capture the full picture.
uv run python test.py --dataset cuhk-pedes --model baseline
uv run python test_triplet.py --dataset cuhk-pedes --model baseline
uv run python test.py --dataset cuhk-pedes --model artifacts/cuhk-pedes_triplet/best.pt
uv run python test_triplet.py --dataset cuhk-pedes --model artifacts/cuhk-pedes_triplet/best.ptResults are saved as JSON under results/.
All captions sharing the same person_id within a batch are treated as positives. Computed symmetrically in both image-to-text and text-to-image directions.
loss = mean(relu(sim(image, neg_caption) - sim(image, pos_caption) + margin))
Image is the anchor. Positive and negative captions are 1-to-1 paired from the negative annotation JSON. All features are L2-normalized before similarity computation.