Finetuning — LoRA pipeline for Jetson AGX Orin

A small, hardware-honest LoRA / QLoRA fine-tuning pipeline targeting Jetson AGX Orin 64GB on JetPack 6.2.2 / L4T r36.5.0 (CUDA 12.6, sm_87 / Ampere, aarch64).

Train a LoRA on a current model (Gemma 4, Qwen 3.x, ...), check it actually learned something, and serve the result — all on the Orin itself. See docs/versions.md for why each dependency is pinned the way it is.

What you need

Jetson AGX Orin 64GB on JetPack 6.x (verified on 6.2.2 / L4T r36.5.0)
Docker with the NVIDIA container runtime (standard on JetPack)
Disk for the HF model cache: tens of GB for small models, ~60GB+ for a 30B-class base

Quick start

# One-time: the compose file joins an external docker network (so a chat UI or
# other containers can reach the inference server by name) and binds host dirs
# for the HF cache + saved adapters (default ./.data/*, override in .env).
docker network create ai-network
mkdir -p .data/{hf-cache,adapters}
cp .env.example .env     # optional -- defaults work; set HF_TOKEN for gated models

# Build the image.
docker compose build finetune

# Pre-flight: ~10s, catches arg-name drift across transformers/peft/trl
# without touching the GPU or downloading any large model.
docker compose run --rm finetune python -m tests.preflight

# Smoke train (Qwen3-0.6B, 10 steps, a few minutes).
docker compose run --rm -e MODEL=Qwen/Qwen3-0.6B finetune python -m finetune.train

Successful smoke-test output ends with [done] adapter saved to adapters/smoke-bf16. That proves the chain (image -> CUDA -> torch -> transformers -> peft -> trl -> save) end-to-end on this hardware.

# Inference smoke test: load the saved adapter, generate one completion.
# QUANTIZE must match the training run (bf16 -> bf16, hqq -> hqq).
docker compose run --rm finetune python -m finetune.infer

That proves the save round-trips: adapter loads on top of the base, runs on GPU, and emits text. It does NOT prove the fine-tune learned anything — ten smoke steps won't move the model meaningfully.

Real runs

# bf16 LoRA on a small modern model (~E4B class). Fits on 64GB without quantization.
docker compose run --rm \
  -e MODE=full -e MODEL=google/gemma-4-E4B-it \
  finetune python -m finetune.train

# HQQ 4-bit LoRA for a large model (30B class). Requires HF_TOKEN for gated weights.
# HQQ_NBITS=8 switches the frozen base to near-lossless 8-bit (~2x the memory).
docker compose run --rm \
  -e MODE=full -e QUANTIZE=hqq -e MODEL=google/gemma-4-31B-it \
  finetune python -m finetune.train

Adapters land in the adapters volume under <mode>-<quantize>/ (the host directory is set by ADAPTERS_DIR in .env, default ./.data/adapters; hqq runs encode the bit-width, e.g. full-hqq4 / full-hqq8).

Long runs are stoppable and resumable: SAVE_STEPS=N checkpoints every N steps and RESUME_FROM_CHECKPOINT=auto continues from the latest checkpoint with optimizer/scheduler state intact (see Knobs).

Targeting Gemma? The staged plan to get a LoRA onto google/gemma-4-31B-it and serve it in Q8 — plus scripts/prove_gemma.sh, a one-paste validation of the cheapest first step (bf16 on E4B) — is in docs/gemma.md.

Your own data

Training rows are JSONL with a messages list (user/assistant turns). training_data/ is a gitignored drop-in spot for your own files — only its .gitkeep is tracked, so nothing you put there gets committed:

docker compose run --rm \
  -e MODE=full -e MODEL=... \
  -e DATA=training_data/mycorpus.jsonl \
  finetune python -m finetune.train

Both DATA and OUTPUT_DIR are interpreted relative to the in-container repo root (/workspace, the .:/workspace mount in docker-compose.yml).

Without DATA set, training uses the committed sample corpus in data/ (see The sample corpus) — the repo is fully self-contained for development and testing.

Before / after: did the fine-tune actually teach anything?

The sample corpus trains a recognizable output format — choose-your-own-adventure scenes that end with "What does X do?" and three numbered choices — so you can tell at a glance whether training changed the model.

finetune.infer doubles as a side-by-side eval harness. With EVAL_PROMPTS set it iterates a JSONL of prompts and writes completions to OUTPUT_FILE. With BASELINE=1 it skips the adapter so you get the base model's answers on the same prompts with the same seed. Diff the two files.

# 1) Baseline -- before any training. Set MODEL because no adapter exists yet.
docker compose run --rm \
  -e BASELINE=1 -e MODEL=Qwen/Qwen3-0.6B \
  -e EVAL_PROMPTS=data/eval_prompts.jsonl \
  -e OUTPUT_FILE=data/eval_baseline.jsonl \
  finetune python -m finetune.infer

# 2) Train on the corpus.
docker compose run --rm \
  -e MODE=full -e MODEL=Qwen/Qwen3-0.6B \
  finetune python -m finetune.train

# 3) Adapter run -- same prompts, same seed. ADAPTER_DIR carries the base id.
docker compose run --rm \
  -e ADAPTER_DIR=adapters/full-bf16 \
  -e EVAL_PROMPTS=data/eval_prompts.jsonl \
  -e OUTPUT_FILE=data/eval_adapter.jsonl \
  finetune python -m finetune.infer

# 4) Eyeball the diff.
diff -u data/eval_baseline.jsonl data/eval_adapter.jsonl | less

The five prompts in data/eval_prompts.jsonl are intentionally distinct from any training row, so a hit on the CYOA shape is the adapter generalizing, not memorizing. To read more than two eval files at once, scripts/compare_evals.py FILE1.jsonl FILE2.jsonl ... prints them aligned per prompt (missing files are skipped).

Serve the result

finetune.serve is a minimal OpenAI-compatible server (stdlib only) that loads a saved adapter on top of its base — the same loader as training, so anything you trained, you can serve. Streams over SSE when the client asks for it.

# .env: point SERVE_ADAPTER_DIR at your adapter (set it explicitly empty to
# serve the bare base model), then bring up the persistent service:
docker compose up -d serve

curl -s localhost:8000/v1/chat/completions \
  -d '{"messages": [{"role": "user", "content": "hello"}]}'

Point any OpenAI client at http://localhost:8000/v1 — or, from a container on the same ai-network (e.g. open-webui), at http://ft-serve:8000/v1. Sampling params in the request (temperature, top_p/top_k/min_p, repetition/frequency penalty) are passed through to generate.

Two things make it livable on a memory-tight box:

Memory fit. A near-fitting bf16 model silently offloads a few GB to CPU and decodes 10–25x slower; set SERVE_GPU_MAX_MEMORY (e.g. 56GiB) to force it fully on-GPU.
Fault handling. Generations are serialized (concurrent requests stack their peak memory); a clean OOM returns 503 and the server keeps going; a fatal device fault makes the process exit so the compose restart policy brings up a fresh one — a poisoned CUDA context is not recoverable in-process. Optional request bounds (max-token ceiling, prompt/context limits -> 413) and allocator tuning are documented in .env.example.

To take the result to Ollama (or anything llama.cpp-based), scripts/export_ollama.sh runs the whole trip CPU-only: bake the adapter into a standalone model (finetune.merge), convert to GGUF, quantize (default Q4_K_M), and ollama create — with the sharp edges handled and documented in the script header (Ollama can't apply LoRAs at load time for newer architectures, so merging is mandatory; an uncapped context can OOM the load):

MODEL=<base id> ADAPTER_DIR=adapters/full-bf16 scripts/export_ollama.sh

Each step skips itself when its output already exists, so the script is re-runnable; use finetune.merge directly if you only want the merged HF model.

The sample corpus

data/ ships a synthetic, self-contained corpus: ~1000 generated CYOA rows plus a small general-instruction slice (databricks-dolly-15k) mixed in as a rehearsal defense against catastrophic forgetting. Rebuild it (deterministic — fixed seeds give identical files) inside the container; build_instruct.py needs datasets + Hugging Face access, which live in the image:

docker compose run --rm finetune python scripts/build_cyoa.py      # -> data/cyoa.jsonl       (~1000 CYOA rows)
docker compose run --rm finetune python scripts/build_instruct.py  # -> data/instruct.jsonl   (~180 rows; needs HF)
docker compose run --rm finetune python scripts/build_dataset.py   # -> data/sample_data.jsonl (CYOA + ~15% instruct)

If build_instruct.py hasn't run (e.g. no Hugging Face access), build_dataset.py writes a CYOA-only file and warns, so the repo stays trainable.

Layout

Dockerfile             # base image + pip pins (see docs/versions.md)
docker-compose.yml     # finetune (training/one-offs) + serve (persistent server)
.env.example           # every knob, documented
finetune/
  config.py            # env vars -> typed Config
  data.py              # load .jsonl -> chat-templated text
  model.py             # tokenizer + model with bf16 or HQQ branch
  lora.py              # LoRA adapter shape
  train.py             # entrypoint -- wires the above into SFTTrainer
  infer.py             # one completion, or batch eval via EVAL_PROMPTS
  serve.py             # OpenAI-compatible server for a saved adapter
  merge.py             # bake an adapter into a standalone model dir
tests/preflight.py     # fast arg-drift check across transformers/peft/trl
data/cyoa.jsonl        # ~1000 CYOA rows (generated by scripts/build_cyoa.py)
data/sample_data.jsonl # training file: cyoa.jsonl + instruct.jsonl, shuffled
data/eval_prompts.jsonl# 5 held-out CYOA prompts for before/after comparison
training_data/         # gitignored drop-in for your own data (only .gitkeep tracked)
scripts/build_cyoa.py  # CYOA generator -> data/cyoa.jsonl
scripts/build_instruct.py # general-instruction slice -> data/instruct.jsonl
scripts/build_dataset.py  # mix cyoa + instruct -> data/sample_data.jsonl
scripts/compare_evals.py  # print N eval JSONL files side by side per prompt
scripts/export_ollama.sh  # adapter -> merge -> GGUF -> quantize -> ollama create
scripts/prove_1p7b.sh     # staged Qwen3-1.7B proof (free check -> train -> eval)
scripts/prove_gemma.sh    # staged Gemma Phase 0 proof (prereq -> train -> eval)
scripts/gemma_check.py    # cheap Gemma gate: token/license, template, target modules
inference/Modelfile    # Ollama prompt-only setup (separate from training)
docs/versions.md       # why each version is pinned the way it is
docs/gemma.md          # staged Gemma bring-up plan (Phase 0 + HQQ nbits knob)

Knobs

All env-driven; defaults in finetune/config.py. Serving knobs (SERVE_*) are documented in .env.example.

Var	Default	Values
`MODE`	`smoke`	`smoke` (10 steps, no save) / `full` (epochs)
`QUANTIZE`	`bf16`	`bf16` (no quant) / `hqq` (HQQ, bit-width via `HQQ_NBITS`)
`HQQ_NBITS`	`4`	`4` / `8` — HQQ base bit-width (only used when `QUANTIZE=hqq`)
`MODEL`	per-mode default	any HF causal-LM repo
`DATA`	`data/sample_data.jsonl`	path to JSONL with `messages` rows
`OUTPUT_DIR`	`adapters/<mode>-<quantize>`	adapter save dir
`LEARNING_RATE`	`1e-4`	SFT learning rate
`LORA_R`	`16`	LoRA rank
`LORA_ALPHA`	`16`	LoRA alpha (alpha/r = 1.0)
`LORA_DROPOUT`	`0.05`	LoRA dropout
`MAX_LENGTH`	per-mode (smoke 512 / full 2048)	cap sequence length to trim peak memory
`PER_DEVICE_TRAIN_BATCH_SIZE`	`1`	per-forward micro-batch
`GRADIENT_ACCUMULATION_STEPS`	`8` (full)	effective batch = micro-batch × this
`MAX_STEPS`	—	cap a `full` run at N optimizer steps
`SAVE_STEPS`	per-epoch	checkpoint every N steps instead (long runs)
`RESUME_FROM_CHECKPOINT`	—	`auto` = latest checkpoint in `OUTPUT_DIR`, or a path
`GRADIENT_CHECKPOINTING`	on for `hqq`	`1`/`0` to force on/off (e.g. on for a big bf16)
`HF_TOKEN`	—	required for gated models (Gemma, Llama)

Notes

No bitsandbytes. No working sm_87 wheel exists on PyPI; the one pre-built dustynv/bitsandbytes image is too old for current transformers. HQQ fills the quantization role (4- or 8-bit via HQQ_NBITS) -- pure Python kernels, no sm_87 build chain.
attn_implementation="eager". Flash-attn aarch64 wheels are unreliable and SDPA occasionally hits sm_87 kernel gaps. Eager is slower but works.
r36.4 image on an r36.5 host is supported (verified on this hardware). NVIDIA has not published an r36.5 container line yet.
Qwen3 thinking mode is disabled in both training and inference. enable_thinking=False is passed to apply_chat_template in finetune.data and finetune.infer. With pre-written assistant turns the template wouldn't synthesize <think> blocks at training time anyway, but inference is the hot path where Qwen3 otherwise prefixes its output with a reasoning trace the CYOA task has no use for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finetuning — LoRA pipeline for Jetson AGX Orin

What you need

Quick start

Real runs

Your own data

Before / after: did the fine-tune actually teach anything?

Serve the result

The sample corpus

Layout

Knobs

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
docs		docs
finetune		finetune
inference		inference
scripts		scripts
tests		tests
training_data		training_data
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Finetuning — LoRA pipeline for Jetson AGX Orin

What you need

Quick start

Real runs

Your own data

Before / after: did the fine-tune actually teach anything?

Serve the result

The sample corpus

Layout

Knobs

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages