05 — Ship a Fine-Tune That Earns Its Keep (Week 6)¶

Mission¶

LoRA fine-tune a small open model (1–8B) on a task where a tuned small model plausibly beats prompting — a domain-specific extraction/classification, your exact output format, a style/voice transform — on a free GPU. It must beat its own base model on your eval, get honestly compared to a frontier model on cost/quality/latency, and ship to the Hugging Face Hub with a real model card. Then quantize it and run it locally.

Why this rung¶

Two reasons. First, judgment: "should we fine-tune?" is a question you'll be asked for the rest of your career, and people who have actually done it — built the dataset, watched the loss curve, seen the eval move or not move — answer it credibly; people who haven't, recite blog posts. Second, the full pass (data → train → eval → quantize → publish → run locally) demystifies the open-model stack in one week: after this, the Hub stops being a website and becomes infrastructure you use.

Pick the task because prompting struggles or overpays, not despite it: you need exact adherence to a quirky format; the task is narrow and called at volume where frontier pricing stings; or the knowledge is in your 500 examples, not the pretraining data.

The mental model¶

The decision model first, because it's the judgment you're buying this week: prompting rents behavior, RAG supplies knowledge, fine-tuning installs behavior. Fresh or per-user facts want retrieval — a fine-tune bakes knowledge in at training time and it stales immediately. Broad reasoning wants the biggest model you can rent. Fine-tuning earns its keep in a specific corridor: a narrow, stable behavior — your exact output format, your domain's labeling scheme, a voice — executed at volume, where a small model's economics (cheap, fast, local, private) beat a frontier model's generality. If you can't say which of those pressures applies, you don't need a fine-tune yet; that sentence alone will make you sound senior in most rooms.

LoRA, mechanically, is a steering attachment on frozen weights: instead of moving all few-billion parameters, you train small low-rank matrices alongside them — around 1% of the parameters — that nudge the model's behavior toward your data. That's why this fits on a free GPU, why the artifact you publish can be megabytes instead of gigabytes, and why the base model's general abilities mostly survive the surgery.

The deepest reframe of the week: the dataset is the program. You are programming by example; the model becomes whatever your 500 pairs demonstrate, including their tics and their blind spots. Ten mislabeled examples in 500 is a 2% bug rate you installed. This is why dataset review is most of the week's real work, why diversity of inputs beats raw count, and why the frozen held-out split is sacred — it's the only honest mirror you have. Training metrics won't save you: loss reliably goes down while the capability you actually wanted goes sideways. The loss curve tells you training happened; only your eval tells you whether it worked.

The gotcha — the seductive failure is evaluating on data that leaked from training: split before you look, freeze the test set, and never iterate against it. And when the three-way eval lands, expect the frontier model to still win on raw quality — that's not defeat. The tuned model's case is the economics column: equal-enough quality at a fraction of the cost and latency, running on your hardware. Read the verdict from the whole table, not the quality row alone.

The path¶

Start here (the first hour): open Unsloth's notebook for your chosen base model on Colab or Kaggle and run it unmodified on its demo data. Training must be demystified plumbing by lunch, so the week's real work — your dataset — gets the days it deserves.

Default pick (if you haven't chosen in 30 minutes): your Week-2 task. You already have real inputs, a golden set, and a quality bar, and the fine-tune's pitch writes itself: match the big model's accuracy on this one narrow task at a fraction of the cost.

Build order — each step feeds the next:

[ ] Mon — task + data plan. Write the one-paragraph case for why this task wants a fine-tune (format? volume? latency? privacy?). Hand-write 20 examples to fix the format; freeze the eval prompt.
[ ] Tue — build the dataset. Scale to 300–1,000 pairs: real inputs, outputs drafted by a frontier model, you review every pair — fix or delete, never keep a mediocre one. (Hint: review in batches of 50; standards drift, and batch boundaries reset them.)
[ ] Wed — freeze the split, train. Hold out ≥50 pairs untouched, QLoRA on the rest. Keep the loss curves; don't worship them.
[ ] Thu — the three-way eval. Base vs tuned vs frontier on the frozen split via your eval template. If tuned ≤ base, suspect the dataset before the hyperparameters — reread 50 random pairs first.
[ ] Fri — quantize + go local. Export GGUF, run under Ollama, re-run the eval, record what quantization cost.
[ ] Sat — publish the artifact. Adapter/GGUF to the Hub with a model card that states task, data provenance and license, the numbers, and failure modes.
[ ] Sun — the verdict. The cost/quality/latency table, the when-to-use paragraph, build-log entry.

Spec — must-haves¶

[ ] The dataset: 300–1,000 instruction→response pairs for your task. Real inputs; outputs drafted by a frontier model where helpful, then reviewed by you — dataset quality is 80% of this week's outcome and most of its actual work. Held-out test split (≥50 pairs) frozen before training.
[ ] A small open base model (e.g. a 1–8B Llama or Qwen class instruct model) trained with QLoRA on a free GPU (Colab or Kaggle) — Unsloth or HF PEFT/TRL.
[ ] The three-way eval on the held-out split, via your eval template: base model (prompted) vs your fine-tune vs a frontier model (prompted).
[ ] Quantized artifact (GGUF) running locally under Ollama — same eval run against the quantized version to show what quantization cost.
[ ] Published to the HF Hub: adapter (and/or merged GGUF) + a model card that states the task, the data's provenance and license, the eval numbers, and known failure modes.
[ ] Training run logged (loss curves kept — you'll want them for the write-up).

Eval bar¶

Fine-tune beats base model on the held-out set — this is the hard gate.
The frontier comparison table in the README: quality, cost per 1K calls, local-vs-API latency — and a one-paragraph verdict: when is the tuned model the right pick, when is it not. ("The frontier model still wins on quality" is a fine, honest verdict — the economics column is where small models earn their keep.)
Quantization delta reported (eval before/after GGUF).

JIT learning — pull when stuck¶

Unsloth docs — QLoRA fine-tunes that actually fit on free-tier GPUs; start from their notebook for your chosen base model and edit, don't write from scratch.
HF LLM course — fine-tuning chapter — supervised fine-tuning and chat templates, ecosystem-native (~45 min).
HF PEFT docs — what LoRA actually does (the conceptual guide is short and good) and every knob you're tempted to touch.
Karpathy — A recipe for training neural networks — written for training from scratch, but the debugging discipline (overfit one batch first; be suspicious of good news) is exactly what saves you when the loss curve lies (~25 min).
HF Hub — model cards — what a real card contains; yours will be better than most because it will have honest evals in it.

Key ideas¶

Prompting rents behavior; RAG supplies knowledge; fine-tuning installs behavior.
Fine-tune for narrow, stable, high-volume behavior — not for fresh facts or broad reasoning.
LoRA trains a ~1% steering attachment on frozen weights — hence free-GPU feasible, tiny artifact.
The dataset is the program: every bad example is a bug you install at 100% confidence.
Freeze the held-out split before training; loss going down proves nothing about your task.
The tuned model usually wins on economics, not raw quality — judge the whole table.
Quantization is a quality trade you measure, not a free compression step.

Check yourself¶

A teammate wants to fine-tune a model "so it knows our product docs." What do you say?
Why is a frozen test split non-negotiable, and what specifically goes wrong without it?
Your fine-tune beats base but trails the frontier model by 4 points at 1/20th the cost. Ship it or not — and what usage fact decides?

Publish¶

The Hub artifact (adapter/GGUF + card) — this is the headline deliverable.
The repo: dataset-building code (not necessarily the raw data — check its license/PII), training notebook, the three-way eval table, loss curves.
Build-log entry.

Stretch¶

DPO pass on top of your SFT model using preference pairs mined from its own eval failures (loser = model output, winner = your correction) — even 100 pairs teaches the mechanics.
Distill: regenerate your dataset's outputs with the frontier model at higher quality, retrain, and measure whether better teachers beat more data.

Proof¶

"I've fine-tuned, quantized, published, and locally deployed an open model that beats its base on a held-out eval — and I can tell you precisely when it's the right call versus prompting a frontier model, with the cost table to back it."