Skip to content

Vortiago/Finetuning

Repository files navigation

Finetuning — LoRA pipeline for Jetson AGX Orin

A small, hardware-honest LoRA / QLoRA fine-tuning pipeline targeting Jetson AGX Orin 64GB on JetPack 6.2.2 / L4T r36.5.0 (CUDA 12.6, sm_87 / Ampere, aarch64).

Train a LoRA on a current model (Gemma 4, Qwen 3.x, ...), check it actually learned something, and serve the result — all on the Orin itself. See docs/versions.md for why each dependency is pinned the way it is.

What you need

  • Jetson AGX Orin 64GB on JetPack 6.x (verified on 6.2.2 / L4T r36.5.0)
  • Docker with the NVIDIA container runtime (standard on JetPack)
  • Disk for the HF model cache: tens of GB for small models, ~60GB+ for a 30B-class base

Quick start

# One-time: the compose file joins an external docker network (so a chat UI or
# other containers can reach the inference server by name) and binds host dirs
# for the HF cache + saved adapters (default ./.data/*, override in .env).
docker network create ai-network
mkdir -p .data/{hf-cache,adapters}
cp .env.example .env     # optional -- defaults work; set HF_TOKEN for gated models

# Build the image.
docker compose build finetune

# Pre-flight: ~10s, catches arg-name drift across transformers/peft/trl
# without touching the GPU or downloading any large model.
docker compose run --rm finetune python -m tests.preflight

# Smoke train (Qwen3-0.6B, 10 steps, a few minutes).
docker compose run --rm -e MODEL=Qwen/Qwen3-0.6B finetune python -m finetune.train

Successful smoke-test output ends with [done] adapter saved to adapters/smoke-bf16. That proves the chain (image -> CUDA -> torch -> transformers -> peft -> trl -> save) end-to-end on this hardware.

# Inference smoke test: load the saved adapter, generate one completion.
# QUANTIZE must match the training run (bf16 -> bf16, hqq -> hqq).
docker compose run --rm finetune python -m finetune.infer

That proves the save round-trips: adapter loads on top of the base, runs on GPU, and emits text. It does NOT prove the fine-tune learned anything — ten smoke steps won't move the model meaningfully.

Real runs

# bf16 LoRA on a small modern model (~E4B class). Fits on 64GB without quantization.
docker compose run --rm \
  -e MODE=full -e MODEL=google/gemma-4-E4B-it \
  finetune python -m finetune.train

# HQQ 4-bit LoRA for a large model (30B class). Requires HF_TOKEN for gated weights.
# HQQ_NBITS=8 switches the frozen base to near-lossless 8-bit (~2x the memory).
docker compose run --rm \
  -e MODE=full -e QUANTIZE=hqq -e MODEL=google/gemma-4-31B-it \
  finetune python -m finetune.train

Adapters land in the adapters volume under <mode>-<quantize>/ (the host directory is set by ADAPTERS_DIR in .env, default ./.data/adapters; hqq runs encode the bit-width, e.g. full-hqq4 / full-hqq8).

Long runs are stoppable and resumable: SAVE_STEPS=N checkpoints every N steps and RESUME_FROM_CHECKPOINT=auto continues from the latest checkpoint with optimizer/scheduler state intact (see Knobs).

Targeting Gemma? The staged plan to get a LoRA onto google/gemma-4-31B-it and serve it in Q8 — plus scripts/prove_gemma.sh, a one-paste validation of the cheapest first step (bf16 on E4B) — is in docs/gemma.md.

Your own data

Training rows are JSONL with a messages list (user/assistant turns). training_data/ is a gitignored drop-in spot for your own files — only its .gitkeep is tracked, so nothing you put there gets committed:

docker compose run --rm \
  -e MODE=full -e MODEL=... \
  -e DATA=training_data/mycorpus.jsonl \
  finetune python -m finetune.train

Both DATA and OUTPUT_DIR are interpreted relative to the in-container repo root (/workspace, the .:/workspace mount in docker-compose.yml).

Without DATA set, training uses the committed sample corpus in data/ (see The sample corpus) — the repo is fully self-contained for development and testing.

Before / after: did the fine-tune actually teach anything?

The sample corpus trains a recognizable output format — choose-your-own-adventure scenes that end with "What does X do?" and three numbered choices — so you can tell at a glance whether training changed the model.

finetune.infer doubles as a side-by-side eval harness. With EVAL_PROMPTS set it iterates a JSONL of prompts and writes completions to OUTPUT_FILE. With BASELINE=1 it skips the adapter so you get the base model's answers on the same prompts with the same seed. Diff the two files.

# 1) Baseline -- before any training. Set MODEL because no adapter exists yet.
docker compose run --rm \
  -e BASELINE=1 -e MODEL=Qwen/Qwen3-0.6B \
  -e EVAL_PROMPTS=data/eval_prompts.jsonl \
  -e OUTPUT_FILE=data/eval_baseline.jsonl \
  finetune python -m finetune.infer

# 2) Train on the corpus.
docker compose run --rm \
  -e MODE=full -e MODEL=Qwen/Qwen3-0.6B \
  finetune python -m finetune.train

# 3) Adapter run -- same prompts, same seed. ADAPTER_DIR carries the base id.
docker compose run --rm \
  -e ADAPTER_DIR=adapters/full-bf16 \
  -e EVAL_PROMPTS=data/eval_prompts.jsonl \
  -e OUTPUT_FILE=data/eval_adapter.jsonl \
  finetune python -m finetune.infer

# 4) Eyeball the diff.
diff -u data/eval_baseline.jsonl data/eval_adapter.jsonl | less

The five prompts in data/eval_prompts.jsonl are intentionally distinct from any training row, so a hit on the CYOA shape is the adapter generalizing, not memorizing. To read more than two eval files at once, scripts/compare_evals.py FILE1.jsonl FILE2.jsonl ... prints them aligned per prompt (missing files are skipped).

Serve the result

finetune.serve is a minimal OpenAI-compatible server (stdlib only) that loads a saved adapter on top of its base — the same loader as training, so anything you trained, you can serve. Streams over SSE when the client asks for it.

# .env: point SERVE_ADAPTER_DIR at your adapter (set it explicitly empty to
# serve the bare base model), then bring up the persistent service:
docker compose up -d serve

curl -s localhost:8000/v1/chat/completions \
  -d '{"messages": [{"role": "user", "content": "hello"}]}'

Point any OpenAI client at http://localhost:8000/v1 — or, from a container on the same ai-network (e.g. open-webui), at http://ft-serve:8000/v1. Sampling params in the request (temperature, top_p/top_k/min_p, repetition/frequency penalty) are passed through to generate.

Two things make it livable on a memory-tight box:

  • Memory fit. A near-fitting bf16 model silently offloads a few GB to CPU and decodes 10–25x slower; set SERVE_GPU_MAX_MEMORY (e.g. 56GiB) to force it fully on-GPU.
  • Fault handling. Generations are serialized (concurrent requests stack their peak memory); a clean OOM returns 503 and the server keeps going; a fatal device fault makes the process exit so the compose restart policy brings up a fresh one — a poisoned CUDA context is not recoverable in-process. Optional request bounds (max-token ceiling, prompt/context limits -> 413) and allocator tuning are documented in .env.example.

To take the result to Ollama (or anything llama.cpp-based), scripts/export_ollama.sh runs the whole trip CPU-only: bake the adapter into a standalone model (finetune.merge), convert to GGUF, quantize (default Q4_K_M), and ollama create — with the sharp edges handled and documented in the script header (Ollama can't apply LoRAs at load time for newer architectures, so merging is mandatory; an uncapped context can OOM the load):

MODEL=<base id> ADAPTER_DIR=adapters/full-bf16 scripts/export_ollama.sh

Each step skips itself when its output already exists, so the script is re-runnable; use finetune.merge directly if you only want the merged HF model.

The sample corpus

data/ ships a synthetic, self-contained corpus: ~1000 generated CYOA rows plus a small general-instruction slice (databricks-dolly-15k) mixed in as a rehearsal defense against catastrophic forgetting. Rebuild it (deterministic — fixed seeds give identical files) inside the container; build_instruct.py needs datasets + Hugging Face access, which live in the image:

docker compose run --rm finetune python scripts/build_cyoa.py      # -> data/cyoa.jsonl       (~1000 CYOA rows)
docker compose run --rm finetune python scripts/build_instruct.py  # -> data/instruct.jsonl   (~180 rows; needs HF)
docker compose run --rm finetune python scripts/build_dataset.py   # -> data/sample_data.jsonl (CYOA + ~15% instruct)

If build_instruct.py hasn't run (e.g. no Hugging Face access), build_dataset.py writes a CYOA-only file and warns, so the repo stays trainable.

Layout

Dockerfile             # base image + pip pins (see docs/versions.md)
docker-compose.yml     # finetune (training/one-offs) + serve (persistent server)
.env.example           # every knob, documented
finetune/
  config.py            # env vars -> typed Config
  data.py              # load .jsonl -> chat-templated text
  model.py             # tokenizer + model with bf16 or HQQ branch
  lora.py              # LoRA adapter shape
  train.py             # entrypoint -- wires the above into SFTTrainer
  infer.py             # one completion, or batch eval via EVAL_PROMPTS
  serve.py             # OpenAI-compatible server for a saved adapter
  merge.py             # bake an adapter into a standalone model dir
tests/preflight.py     # fast arg-drift check across transformers/peft/trl
data/cyoa.jsonl        # ~1000 CYOA rows (generated by scripts/build_cyoa.py)
data/sample_data.jsonl # training file: cyoa.jsonl + instruct.jsonl, shuffled
data/eval_prompts.jsonl# 5 held-out CYOA prompts for before/after comparison
training_data/         # gitignored drop-in for your own data (only .gitkeep tracked)
scripts/build_cyoa.py  # CYOA generator -> data/cyoa.jsonl
scripts/build_instruct.py # general-instruction slice -> data/instruct.jsonl
scripts/build_dataset.py  # mix cyoa + instruct -> data/sample_data.jsonl
scripts/compare_evals.py  # print N eval JSONL files side by side per prompt
scripts/export_ollama.sh  # adapter -> merge -> GGUF -> quantize -> ollama create
scripts/prove_1p7b.sh     # staged Qwen3-1.7B proof (free check -> train -> eval)
scripts/prove_gemma.sh    # staged Gemma Phase 0 proof (prereq -> train -> eval)
scripts/gemma_check.py    # cheap Gemma gate: token/license, template, target modules
inference/Modelfile    # Ollama prompt-only setup (separate from training)
docs/versions.md       # why each version is pinned the way it is
docs/gemma.md          # staged Gemma bring-up plan (Phase 0 + HQQ nbits knob)

Knobs

All env-driven; defaults in finetune/config.py. Serving knobs (SERVE_*) are documented in .env.example.

Var Default Values
MODE smoke smoke (10 steps, no save) / full (epochs)
QUANTIZE bf16 bf16 (no quant) / hqq (HQQ, bit-width via HQQ_NBITS)
HQQ_NBITS 4 4 / 8 — HQQ base bit-width (only used when QUANTIZE=hqq)
MODEL per-mode default any HF causal-LM repo
DATA data/sample_data.jsonl path to JSONL with messages rows
OUTPUT_DIR adapters/<mode>-<quantize> adapter save dir
LEARNING_RATE 1e-4 SFT learning rate
LORA_R 16 LoRA rank
LORA_ALPHA 16 LoRA alpha (alpha/r = 1.0)
LORA_DROPOUT 0.05 LoRA dropout
MAX_LENGTH per-mode (smoke 512 / full 2048) cap sequence length to trim peak memory
PER_DEVICE_TRAIN_BATCH_SIZE 1 per-forward micro-batch
GRADIENT_ACCUMULATION_STEPS 8 (full) effective batch = micro-batch × this
MAX_STEPS cap a full run at N optimizer steps
SAVE_STEPS per-epoch checkpoint every N steps instead (long runs)
RESUME_FROM_CHECKPOINT auto = latest checkpoint in OUTPUT_DIR, or a path
GRADIENT_CHECKPOINTING on for hqq 1/0 to force on/off (e.g. on for a big bf16)
HF_TOKEN required for gated models (Gemma, Llama)

Notes

  • No bitsandbytes. No working sm_87 wheel exists on PyPI; the one pre-built dustynv/bitsandbytes image is too old for current transformers. HQQ fills the quantization role (4- or 8-bit via HQQ_NBITS) -- pure Python kernels, no sm_87 build chain.
  • attn_implementation="eager". Flash-attn aarch64 wheels are unreliable and SDPA occasionally hits sm_87 kernel gaps. Eager is slower but works.
  • r36.4 image on an r36.5 host is supported (verified on this hardware). NVIDIA has not published an r36.5 container line yet.
  • Qwen3 thinking mode is disabled in both training and inference. enable_thinking=False is passed to apply_chat_template in finetune.data and finetune.infer. With pre-written assistant turns the template wouldn't synthesize <think> blocks at training time anyway, but inference is the hot path where Qwen3 otherwise prefixes its output with a reasoning trace the CYOA task has no use for.

About

Repo containing my project to learn how to finetune a model on the Nvidia Jetson AGX Orin 64GB

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors