Skip to content

07 — The Eval Gauntlet (Week 8)

Mission

The consolidation week — and the differentiator. Turn the eval template you've been dragging through six ships into a reusable evaluation toolkit, retrofit tracing and evals across everything you've shipped, run a cross-model comparison over your whole portfolio, and publish the flagship write-up: "What I measured building 6 AI products in 7 weeks."

Why this rung

Evals are the moat. Every other skill in this program is being commoditized by better models and better frameworks; knowing whether an AI system works — building the sets, choosing the metrics, distrusting the judge, catching the regression — compounds instead. It's also the credibility layer of the whole portfolio: six ships that each state their numbers, evaluated the same way, is a story approximately nobody else can tell. This week converts six one-off harnesses into one systematic practice — the difference between having run evals and being someone who evals.

The mental model

An eval is a disagreement, compressed: somewhere a human would say "that output is wrong," and an eval is that judgment captured in a form cheap enough to run ten thousand times. Seen that way, evals are to AI systems what tests are to code — with one deep difference: the system under test is stochastic, so you're measuring distributions, not points. A single run is an anecdote; a pass rate with enough cases to mean something is a measurement. Everything in the toolkit follows from taking that seriously.

There are exactly three instruments, and the craft is spending them correctly. Assertions — exact match, schema checks, "contains the ID" — are free, fast, and narrow: use them for everything they can reach. LLM judges scale human-ish judgment to places assertions can't reach (is this summary faithful? is this answer helpful?) — but a judge is itself a model, which means it is itself an unvalidated instrument until you calibrate it. Humans are ground truth and ruinously expensive: spend them where the other two disagree or where the stakes concentrate. This is why judge validation is the week's non-skippable step — hand-label thirty cases, measure agreement, and suddenly every judged number you've published for six weeks has a confidence attached instead of an asterisk. Judges have known pathologies — they prefer longer answers, they prefer the first option shown, they prefer text that sounds like themselves — and none of these are exotic: they're the default behavior until your calibration catches them.

The practitioner translation: metrics are for catching regressions; error analysis is for getting better. A score tells you that something moved; only reading the failures tells you why and what to build next. The highest-leverage habit in this entire program is the unglamorous one — look at your data, twenty failures at a time, and let the patterns assign next week's work.

The gotcha — Goodhart comes for every eval: the moment you optimize against a fixed set, you start fitting its quirks instead of the task, and the score inflates away from reality. Golden sets need refresh — feed them from production traces and fresh failures, and treat a suspiciously perfect score not as victory but as a stale test set asking to be replaced.

The path

Start here (the first hour): new repo, your Week-1 eval template copied in as the seed, and one existing ship's case set running through it unchanged. The toolkit begins as extraction, not invention. (No default pick this week — the project is your own portfolio.)

Build order — each step feeds the next:

  1. [ ] Mon — extract the toolkit. Case loader, runners, scorers, cost/latency capture, one-command report, generalized from six weeks of copies. (Hint: port your two weirdest harnesses first — RAG hit-rate and agent scenarios; an abstraction that survives those is real.)
  2. [ ] Tue — calibrate the judge. Hand-label 30 cases drawn from your ships, measure judge–human agreement, iterate the judge prompt until the number is defensible, and record it — every judged metric you've published now inherits this credibility.
  3. [ ] Wed — retrofit ships 1–3. One command each; traces wired wherever missing.
  4. [ ] Thu — retrofit ships 4–6. Same bar. (These two days exist to find toolkit bugs before the gauntlet does.)
  5. [ ] Fri — run the gauntlet. 3 models × 6 ships, one big table. Write the surprises down the moment they land — they're Sunday's material.
  6. [ ] Sat — regression gates. Eval-in-CI on your two most-used ships; scorecards linked from every ship README.
  7. [ ] Sun — the flagship write-up. The table, three surprises, the judge-validation story — written for a stranger. This is the portfolio's centerpiece.

Spec — must-haves

  • [ ] The toolkit, extracted into its own repo/package: case-set loader, runners (single-call and agent-loop), scorers (exact, judged, and the custom ones you built for RAG hit-rate and agent completion), cost/latency capture, and a one-command report. Docs good enough that a stranger evals their project with it.
  • [ ] Judge validation — the step almost everyone skips: hand-label 30 cases yourself, measure judge–human agreement, and tune the judge prompt until agreement is reported and defensible. Your judged metrics from earlier weeks inherit credibility from this number.
  • [ ] Retrofit: every ship (1–6) runs under the toolkit with one command; traces (Langfuse or your JSONL) wired into any ship still missing them.
  • [ ] The cross-model gauntlet: ≥3 models (one frontier, one cheap hosted, one local) run across every ship's eval set. One big table: quality / cost / latency per model per ship.
  • [ ] Regression protection: eval-in-CI on at least your two most-used ships — a PR that tanks quality fails visibly.
  • [ ] The flagship write-up: the gauntlet table, three surprises the numbers gave you, where the cheap model was secretly fine, where it absolutely wasn't, and what judge validation changed. Written for a stranger, not a diary.

Eval bar

  • One command, per ship, produces its current scorecard — demonstrated in the toolkit README.
  • Judge–human agreement measured on ≥30 hand-labeled cases and reported.
  • The gauntlet table is complete (no "didn't get to" cells across 3 models × 6 ships), and each ship's README now cites its scorecard.
  • The write-up is published and contains real numbers a reader could act on.

JIT learning — pull when stuck

  • Hamel Husain — Your AI product needs evals — the reread; the LLM-as-judge and error-analysis sections are this week's spine (~30 min).
  • Applied LLMs — the evaluation & monitoring passages, from teams who ran this in production; good for calibrating what's worth automating (~20 min).
  • Langfuse docs — datasets + scores if you want the hosted layer under your toolkit rather than files.
  • OpenAI cookbook — evals — search "evals": worked judge-prompt examples worth stealing patterns from, whatever API you use.

Key ideas

  • An eval is a human disagreement compressed into something cheap enough to run at scale.
  • Stochastic systems are measured as distributions; a single run is an anecdote.
  • Three instruments — assertions (free), judges (scalable, biased), humans (ground truth, dear) — spent in that order.
  • A judge is a model: uncalibrated, it's an opinion; validated against your labels, it's an instrument.
  • Judge pathologies are the default (length, position, self-preference), not the exception.
  • Metrics catch regressions; error analysis drives improvement. Look at your data.
  • Goodhart eats fixed golden sets — refresh from production traces; distrust perfect scores.

Check yourself

  • Your judge gives your RAG app 92% faithfulness. What's the first question to ask before publishing that number?
  • Where do assertions stop and judges become necessary? Give an example of each from your own ships.
  • A model update lifts your golden-set score 5 points. Name two explanations that aren't "it got better," and how you'd rule them out.

Publish

  • The toolkit repo — over time, plausibly the most-starred thing you ship this program.
  • The flagship write-up, linked from every ship README and the build log.
  • Updated scorecards across ships 1–6.

Stretch

  • Publish one of your eval sets (the RAG 50 or the agent scenarios) as a dataset on the HF Hub with a datasheet — eval sets are scarcer and more valuable than model checkpoints.
  • Add drift detection: re-run the gauntlet on a schedule (your Week-4 agent can drive it) and alert when a provider silently changes behavior under you. It happens; catching it live is a great follow-up post.

Proof

"Everything I ship carries evals from a toolkit I built — validated judges, cost and latency alongside quality, regression gates in CI — and I've published a cross-model gauntlet over my whole portfolio."