A small, hardware-honest LoRA / QLoRA fine-tuning pipeline targeting Jetson AGX Orin 64GB on JetPack 6.2.2 / L4T r36.5.0 (CUDA 12.6, sm_87 / Ampere, aarch64).
Train a LoRA on a current model (Gemma 4, Qwen 3.x, ...), check it actually learned
something, and serve the result — all on the Orin itself. See docs/versions.md for
why each dependency is pinned the way it is.
- Jetson AGX Orin 64GB on JetPack 6.x (verified on 6.2.2 / L4T r36.5.0)
- Docker with the NVIDIA container runtime (standard on JetPack)
- Disk for the HF model cache: tens of GB for small models, ~60GB+ for a 30B-class base
# One-time: the compose file joins an external docker network (so a chat UI or
# other containers can reach the inference server by name) and binds host dirs
# for the HF cache + saved adapters (default ./.data/*, override in .env).
docker network create ai-network
mkdir -p .data/{hf-cache,adapters}
cp .env.example .env # optional -- defaults work; set HF_TOKEN for gated models
# Build the image.
docker compose build finetune
# Pre-flight: ~10s, catches arg-name drift across transformers/peft/trl
# without touching the GPU or downloading any large model.
docker compose run --rm finetune python -m tests.preflight
# Smoke train (Qwen3-0.6B, 10 steps, a few minutes).
docker compose run --rm -e MODEL=Qwen/Qwen3-0.6B finetune python -m finetune.trainSuccessful smoke-test output ends with [done] adapter saved to adapters/smoke-bf16.
That proves the chain (image -> CUDA -> torch -> transformers -> peft -> trl -> save)
end-to-end on this hardware.
# Inference smoke test: load the saved adapter, generate one completion.
# QUANTIZE must match the training run (bf16 -> bf16, hqq -> hqq).
docker compose run --rm finetune python -m finetune.inferThat proves the save round-trips: adapter loads on top of the base, runs on GPU, and emits text. It does NOT prove the fine-tune learned anything — ten smoke steps won't move the model meaningfully.
# bf16 LoRA on a small modern model (~E4B class). Fits on 64GB without quantization.
docker compose run --rm \
-e MODE=full -e MODEL=google/gemma-4-E4B-it \
finetune python -m finetune.train
# HQQ 4-bit LoRA for a large model (30B class). Requires HF_TOKEN for gated weights.
# HQQ_NBITS=8 switches the frozen base to near-lossless 8-bit (~2x the memory).
docker compose run --rm \
-e MODE=full -e QUANTIZE=hqq -e MODEL=google/gemma-4-31B-it \
finetune python -m finetune.trainAdapters land in the adapters volume under <mode>-<quantize>/ (the host
directory is set by ADAPTERS_DIR in .env, default ./.data/adapters; hqq
runs encode the bit-width, e.g. full-hqq4 / full-hqq8).
Long runs are stoppable and resumable: SAVE_STEPS=N checkpoints every N steps
and RESUME_FROM_CHECKPOINT=auto continues from the latest checkpoint with
optimizer/scheduler state intact (see Knobs).
Targeting Gemma? The staged plan to get a LoRA onto google/gemma-4-31B-it
and serve it in Q8 — plus scripts/prove_gemma.sh, a one-paste validation of the
cheapest first step (bf16 on E4B) — is in docs/gemma.md.
Training rows are JSONL with a messages list (user/assistant turns).
training_data/ is a gitignored drop-in spot for your own files — only its
.gitkeep is tracked, so nothing you put there gets committed:
docker compose run --rm \
-e MODE=full -e MODEL=... \
-e DATA=training_data/mycorpus.jsonl \
finetune python -m finetune.trainBoth DATA and OUTPUT_DIR are interpreted relative to the in-container repo
root (/workspace, the .:/workspace mount in docker-compose.yml).
Without DATA set, training uses the committed sample corpus in data/
(see The sample corpus) — the repo is fully
self-contained for development and testing.
The sample corpus trains a recognizable output format — choose-your-own-adventure scenes that end with "What does X do?" and three numbered choices — so you can tell at a glance whether training changed the model.
finetune.infer doubles as a side-by-side eval harness. With EVAL_PROMPTS
set it iterates a JSONL of prompts and writes completions to OUTPUT_FILE.
With BASELINE=1 it skips the adapter so you get the base model's answers
on the same prompts with the same seed. Diff the two files.
# 1) Baseline -- before any training. Set MODEL because no adapter exists yet.
docker compose run --rm \
-e BASELINE=1 -e MODEL=Qwen/Qwen3-0.6B \
-e EVAL_PROMPTS=data/eval_prompts.jsonl \
-e OUTPUT_FILE=data/eval_baseline.jsonl \
finetune python -m finetune.infer
# 2) Train on the corpus.
docker compose run --rm \
-e MODE=full -e MODEL=Qwen/Qwen3-0.6B \
finetune python -m finetune.train
# 3) Adapter run -- same prompts, same seed. ADAPTER_DIR carries the base id.
docker compose run --rm \
-e ADAPTER_DIR=adapters/full-bf16 \
-e EVAL_PROMPTS=data/eval_prompts.jsonl \
-e OUTPUT_FILE=data/eval_adapter.jsonl \
finetune python -m finetune.infer
# 4) Eyeball the diff.
diff -u data/eval_baseline.jsonl data/eval_adapter.jsonl | lessThe five prompts in data/eval_prompts.jsonl are intentionally distinct from
any training row, so a hit on the CYOA shape is the adapter generalizing, not
memorizing. To read more than two eval files at once,
scripts/compare_evals.py FILE1.jsonl FILE2.jsonl ... prints them aligned
per prompt (missing files are skipped).
finetune.serve is a minimal OpenAI-compatible server (stdlib only) that loads
a saved adapter on top of its base — the same loader as training, so anything
you trained, you can serve. Streams over SSE when the client asks for it.
# .env: point SERVE_ADAPTER_DIR at your adapter (set it explicitly empty to
# serve the bare base model), then bring up the persistent service:
docker compose up -d serve
curl -s localhost:8000/v1/chat/completions \
-d '{"messages": [{"role": "user", "content": "hello"}]}'Point any OpenAI client at http://localhost:8000/v1 — or, from a container on
the same ai-network (e.g. open-webui), at http://ft-serve:8000/v1. Sampling
params in the request (temperature, top_p/top_k/min_p,
repetition/frequency penalty) are passed through to generate.
Two things make it livable on a memory-tight box:
- Memory fit. A near-fitting bf16 model silently offloads a few GB to CPU
and decodes 10–25x slower; set
SERVE_GPU_MAX_MEMORY(e.g.56GiB) to force it fully on-GPU. - Fault handling. Generations are serialized (concurrent requests stack
their peak memory); a clean OOM returns 503 and the server keeps going; a
fatal device fault makes the process exit so the compose restart policy
brings up a fresh one — a poisoned CUDA context is not recoverable
in-process. Optional request bounds (max-token ceiling, prompt/context
limits -> 413) and allocator tuning are documented in
.env.example.
To take the result to Ollama (or anything llama.cpp-based),
scripts/export_ollama.sh runs the whole trip CPU-only: bake the adapter into
a standalone model (finetune.merge), convert to GGUF, quantize (default
Q4_K_M), and ollama create — with the sharp edges handled and documented
in the script header (Ollama can't apply LoRAs at load time for newer
architectures, so merging is mandatory; an uncapped context can OOM the load):
MODEL=<base id> ADAPTER_DIR=adapters/full-bf16 scripts/export_ollama.shEach step skips itself when its output already exists, so the script is
re-runnable; use finetune.merge directly if you only want the merged HF
model.
data/ ships a synthetic, self-contained corpus: ~1000 generated CYOA rows
plus a small general-instruction slice (databricks-dolly-15k) mixed in as a
rehearsal defense against catastrophic forgetting. Rebuild it (deterministic —
fixed seeds give identical files) inside the container; build_instruct.py
needs datasets + Hugging Face access, which live in the image:
docker compose run --rm finetune python scripts/build_cyoa.py # -> data/cyoa.jsonl (~1000 CYOA rows)
docker compose run --rm finetune python scripts/build_instruct.py # -> data/instruct.jsonl (~180 rows; needs HF)
docker compose run --rm finetune python scripts/build_dataset.py # -> data/sample_data.jsonl (CYOA + ~15% instruct)If build_instruct.py hasn't run (e.g. no Hugging Face access),
build_dataset.py writes a CYOA-only file and warns, so the repo stays
trainable.
Dockerfile # base image + pip pins (see docs/versions.md)
docker-compose.yml # finetune (training/one-offs) + serve (persistent server)
.env.example # every knob, documented
finetune/
config.py # env vars -> typed Config
data.py # load .jsonl -> chat-templated text
model.py # tokenizer + model with bf16 or HQQ branch
lora.py # LoRA adapter shape
train.py # entrypoint -- wires the above into SFTTrainer
infer.py # one completion, or batch eval via EVAL_PROMPTS
serve.py # OpenAI-compatible server for a saved adapter
merge.py # bake an adapter into a standalone model dir
tests/preflight.py # fast arg-drift check across transformers/peft/trl
data/cyoa.jsonl # ~1000 CYOA rows (generated by scripts/build_cyoa.py)
data/sample_data.jsonl # training file: cyoa.jsonl + instruct.jsonl, shuffled
data/eval_prompts.jsonl# 5 held-out CYOA prompts for before/after comparison
training_data/ # gitignored drop-in for your own data (only .gitkeep tracked)
scripts/build_cyoa.py # CYOA generator -> data/cyoa.jsonl
scripts/build_instruct.py # general-instruction slice -> data/instruct.jsonl
scripts/build_dataset.py # mix cyoa + instruct -> data/sample_data.jsonl
scripts/compare_evals.py # print N eval JSONL files side by side per prompt
scripts/export_ollama.sh # adapter -> merge -> GGUF -> quantize -> ollama create
scripts/prove_1p7b.sh # staged Qwen3-1.7B proof (free check -> train -> eval)
scripts/prove_gemma.sh # staged Gemma Phase 0 proof (prereq -> train -> eval)
scripts/gemma_check.py # cheap Gemma gate: token/license, template, target modules
inference/Modelfile # Ollama prompt-only setup (separate from training)
docs/versions.md # why each version is pinned the way it is
docs/gemma.md # staged Gemma bring-up plan (Phase 0 + HQQ nbits knob)
All env-driven; defaults in finetune/config.py. Serving knobs (SERVE_*)
are documented in .env.example.
| Var | Default | Values |
|---|---|---|
MODE |
smoke |
smoke (10 steps, no save) / full (epochs) |
QUANTIZE |
bf16 |
bf16 (no quant) / hqq (HQQ, bit-width via HQQ_NBITS) |
HQQ_NBITS |
4 |
4 / 8 — HQQ base bit-width (only used when QUANTIZE=hqq) |
MODEL |
per-mode default | any HF causal-LM repo |
DATA |
data/sample_data.jsonl |
path to JSONL with messages rows |
OUTPUT_DIR |
adapters/<mode>-<quantize> |
adapter save dir |
LEARNING_RATE |
1e-4 |
SFT learning rate |
LORA_R |
16 |
LoRA rank |
LORA_ALPHA |
16 |
LoRA alpha (alpha/r = 1.0) |
LORA_DROPOUT |
0.05 |
LoRA dropout |
MAX_LENGTH |
per-mode (smoke 512 / full 2048) | cap sequence length to trim peak memory |
PER_DEVICE_TRAIN_BATCH_SIZE |
1 |
per-forward micro-batch |
GRADIENT_ACCUMULATION_STEPS |
8 (full) |
effective batch = micro-batch × this |
MAX_STEPS |
— | cap a full run at N optimizer steps |
SAVE_STEPS |
per-epoch | checkpoint every N steps instead (long runs) |
RESUME_FROM_CHECKPOINT |
— | auto = latest checkpoint in OUTPUT_DIR, or a path |
GRADIENT_CHECKPOINTING |
on for hqq |
1/0 to force on/off (e.g. on for a big bf16) |
HF_TOKEN |
— | required for gated models (Gemma, Llama) |
- No bitsandbytes. No working sm_87 wheel exists on PyPI; the one pre-built
dustynv/bitsandbytesimage is too old for currenttransformers. HQQ fills the quantization role (4- or 8-bit viaHQQ_NBITS) -- pure Python kernels, no sm_87 build chain. attn_implementation="eager". Flash-attn aarch64 wheels are unreliable and SDPA occasionally hits sm_87 kernel gaps. Eager is slower but works.r36.4image on anr36.5host is supported (verified on this hardware). NVIDIA has not published anr36.5container line yet.- Qwen3 thinking mode is disabled in both training and inference.
enable_thinking=Falseis passed toapply_chat_templateinfinetune.dataandfinetune.infer. With pre-written assistant turns the template wouldn't synthesize<think>blocks at training time anyway, but inference is the hot path where Qwen3 otherwise prefixes its output with a reasoning trace the CYOA task has no use for.