Code for the 2025 Master’s research project “Calibrating Vision–Language Models in Few–Shot Settings” (Software Engineering and IT, ÉTS Montréal). Author: Paul Merceur
CLIP‑GP explores few‑shot adaptation of CLIP using multiple text templates per class. It provides a small set of lightweight adapters that keep CLIP encoders frozen and train only small heads:
- Adapter (default): visual projection with optional learned template weighting
- GP Template Weighter: Gaussian‑Process weights over templates (RBF/Matern/Linear kernels), used alone or to initialize other adapters
- CLIP‑Adapter: 2‑layer MLP on image features (with optional GP‑derived prototypes)
- TaskRes: learn a residual on top of frozen text features (optionally initialized from GP prototypes)
- CoOp: learnable prompt tokens (context optimization)
- Tip‑Adapter / Tip‑Adapter‑F: cache‑based adapter with optional trainable linear head
All trainers report accuracy and calibration metrics (ECE, AECE), and write a compact metrics.json for downstream aggregation and plotting.
- Create and activate a Python environment (Python ≥3.8), then install dependencies:
pip install -r requirements.txt- Prepare datasets as described in
DATASETS.md. Set your dataset root via--rootor edit the scripts’DATA/DATA_ROOTvariables.
The primary entrypoint is the wrapper, which schedules single runs or grids and writes results in a consistent layout:
./scripts/run_experiment.sh <experiment_yaml> [<experiment_name>] [--devices "0,1" --jobs-per-gpu 1 --verbose]
# example
./scripts/run_experiment.sh configs/trainers/gp_small.yaml my_experiment --devices "0,1" --jobs-per-gpu 1Notes:
- This script calls the Python runner (
python -m utils.hparam_search) with your YAML. - Outputs are written under
output/<experiment>/<dataset>/<config_signature>/seed<seed>/. - Combine
--devicesand--jobs-per-gputo distribute runs across GPUs.
See the “Experiment sweeps (grids) and scheduling” section for a minimal YAML example.
Use the native CLI. You can mix YAML config files and command‑line overrides.
# Baseline Adapter (visual projection + template prototypes)
python train.py \
--root /path/to/datasets \
--dataset Caltech101 \
--shots 4 \
--backbone RN50 \
--config-file configs/trainers/baseline.yaml \
--output-dir output/demo/baseline
# Adapter with GP template weighting
python train.py \
--root /path/to/datasets \
--dataset Caltech101 \
--shots 4 \
--backbone RN50 \
--config-file configs/trainers/gp.yaml \
--use-gp --num-templates 8 --gp-lr 1e-3 --gp-beta 1e-3 \
--output-dir output/demo/gp
# CLIP‑Adapter
python train.py \
--root /path/to/datasets \
--dataset Caltech101 \
--shots 4 \
--backbone RN50 \
--config-file configs/trainers/clip_adapter.yaml \
--output-dir output/demo/clip_adapter
# TaskRes (optionally preceded by GP pre‑training of prototypes)
python train.py \
--root /path/to/datasets \
--dataset Caltech101 \
--shots 4 \
--backbone RN50 \
--config-file configs/trainers/taskres.yaml \
--output-dir output/demo/taskresNotes:
- Prefer YAMLs in
configs/trainers/to set the trainer and defaults. You can still override any field via CLI. - Important adapter options (subset):
--num-templates,--use-gp,--gp-lr,--gp-beta,--gp-num-mc-samples-train,--gp-num-mc-samples-eval,--l2-lambda,--freeze-visual-proj.
Use the experiment runner to expand grids and schedule runs evenly across GPUs. Wrapper:
# Wrapper script (recommended)
./scripts/run_experiment.sh configs/trainers/gp_small.yaml my_experiment --devices "0,1" --jobs-per-gpu 1
# Or call the runner directly
python -m utils.hparam_search \
--config-file configs/trainers/gp_small.yaml \
--devices "0,1" \
--jobs-per-gpu 1 \
--experiment-name my_experimentMinimal YAML structure (example):
name: gp_small
datasets: [caltech101, oxford_pets]
seeds: [1, 2, 3]
shots: [1, 2, 4, 8]
output_root: output
template: "{experiment}/{dataset}/{sig}/seed{seed}"
grid:
TRAINER.ADAPTER.USE_GP: [True]
TRAINER.ADAPTER.NUM_TEMPLATES: [8, 32]
OPTIM.LR: [0.001]
MODEL.BACKBONE.NAME: ["RN50"]The runner writes each trial under:
output/<experiment>/<dataset>/<config_signature>/seed<seed>/
Each run creates metrics.json, log.txt, and stores the resolved config.json.
After runs finish, aggregate and plot:
python scripts/aggregate_results.py <experiment_name> \
[--grouped] [--show-zero-shot]It reads:
output/<experiment>/<dataset>/<config_signature>/seed*/metrics.json
and prints per‑dataset summaries, averages across datasets, plus saves:
- Plots:
output/<experiment>/_plots/perf_per_shots/and.../_plots/acc_vs_ece/ - Tables (per dataset and averaged):
output/<experiment>/_tables/
Flags:
--grouped: group multiple configs into single lines usingGROUP_SUBSTRINGSinscripts/aggregate_results.py--show-zero-shot: plot zero‑shot performance as stars at shot=0
metrics.json schema (minimal):
{
"dataset": "Caltech101",
"shots": 1,
"seed": 1,
"method": "baseline|gp|clip-adapter|coop|cocoop|tipa|tipaf",
"backbone": "RN50",
"zero_shot": {"top1_acc": 0.0, "ece": 0.0, "aece": 0.0, "calibration": {...}},
"metrics": {"top1_acc": 0.0, "ece": 0.0, "aece": 0.0},
"config": {...},
"output_dir": "output/...",
"train_time_s": 0.0
}-
Adapter (default): visual projection with L2 regularization; template prototypes from multiple prompts
- Template weighting (non‑GP):
--train-template-weightsor--use-linear-template-weighting - Weight init:
--template-init-method {uniform,val_weighted,top3,minmax} - Shared weights across classes:
--shared-template-weights - Regularization:
--l2-lambda,--freeze-visual-proj
- Template weighting (non‑GP):
-
GP Template Weighter: Gaussian‑Process over per‑template logits
- Enable:
--use-gp - Kernel:
--gp-kernel-type {rbf,linear}(Matern available via config) - Hyper‑params:
--gp-lr,--gp-beta,--gp-num-mc-samples-train,--gp-num-mc-samples-eval,--gp-pca-dim
- Enable:
-
CLIP‑Adapter: 2‑layer MLP on image features blending adapted/original features
- Key flags:
ADAPTER.CLIP_ADAPTER_REDUCTION,ADAPTER.CLIP_ADAPTER_RATIO,ADAPTER.CLIP_ADAPTER_LR,ADAPTER.CLIP_ADAPTER_EPOCHS
- Key flags:
-
TaskRes: learn residuals on top of base text features
- Scale:
ADAPTER.TASKRES_RESIDUAL_SCALE - Optional GP to initialize base prototypes before residual learning
- Scale:
-
CoOp: prompt learning with learnable context tokens
- Flags:
ADAPTER.N_CTX,ADAPTER.CTX_INIT,ADAPTER.CSC
- Flags:
-
Tip‑Adapter / Tip‑Adapter‑F: cache‑based (with optional trainable linear head)
- Flags:
ADAPTER.TIP_ADAPTER_TRAINABLE,ADAPTER.TIP_ADAPTER_LR,ADAPTER.TIP_ADAPTER_EPOCHS
- Flags:
Most of these are also exposed via YAML in configs/trainers/ and can be overridden on the CLI using the OPTS style, e.g.:
... TRAINER.ADAPTER.L2_LAMBDA 0.1 TRAINER.ADAPTER.NUM_TEMPLATES 8- Frozen CLIP encoders with template‑based text prototypes
- Optional GP weighting over templates with KL regularization
- Visual projection with L2 or alternative adapters (CLIP‑Adapter, TaskRes, CoOp, Tip‑Adapter)
- Robust evaluation: top‑1 accuracy, macro‑F1 (if sklearn available), ECE, AECE
- Clean experiment runner for grids and reproducible outputs
This repository is based on the CLAP project (CVPR’24) “A Closer Look at the Few‑Shot Adaptation of Large Vision‑Language Models.” See the original code and paper materials here: CLAP on GitHub.
This version removes heavyweight dependencies and provides a minimal, PyTorch‑native training and data pipeline with simple configs and scripts.