00 — Arm Yourself (Days 1–7)¶

Mission¶

By Sunday your own workflow is AI-native: an agentic coding tool is your daily driver — driven spec-first and in self-verifying loops, not chat — a personal MCP server extends it with your context, an eval-harness template exists that every later ship will reuse, and one real personal workflow is automated end-to-end. The deliverable is your stack, running — plus the write-up.

Why this rung¶

Everything above this rung is built with this rung. The single biggest gap between the median engineer and the top 1% is not knowledge — it's that the top 1% have restructured how they work so a model is in the loop of everything. You can't develop taste for what models can do from blog posts; you develop it from a hundred reps on your own real work. This week buys the reps machine.

The mental model¶

A frontier model in an agent harness is best understood as a brilliant, amnesiac colleague with a jagged skill profile — superhuman at some tasks, strangely bad at adjacent ones, and holding zero memory of you or your project beyond what's in front of it right now. Every session, everything it knows about your world is context you supplied: the instructions file, the tools it can call, the files it has read, the conversation so far. That reframes the whole game. You are not "using a tool"; you are managing what a very capable amnesiac gets to see — which is why a good per-repo instructions file quietly outperforms any amount of clever prompt phrasing, and why the skill ceiling here is context curation, not wording.

MCP is the practitioner translation of that idea: a standard port for context and capabilities. Before it, every assistant needed a bespoke integration to reach your notes, your database, your APIs; MCP makes a capability something you build once and plug into any agent that speaks the protocol. When you build your server this week, you're not doing an integration exercise — you're converting private context, the thing that makes your agent better than a stranger's, into a reusable part.

With delegation, the unit of work moves up a level: specify → delegate → review, in a loop. The load-bearing skill is the review, and it has a trap in it: model output is optimized to look right, so stylistic confidence carries no information about correctness. The practical rule is verification asymmetry — aggressively delegate tasks whose results are cheap to check (tests pass, the file renders, the number reconciles), and keep on a short leash anything whose failure is silent. And because the colleague is stochastic, "does it work?" becomes a rate, not a boolean — which is why Day 5's eval template measures pass rates over case sets, the same way you'd never assess a flaky test by running it once.

Two techniques turn that loop from a habit into a method, and they're the highest-leverage things in this whole week. The first is spec-driven development: instead of prompting your way to code conversationally, you write the spec — the intent, the interface, the constraints, the acceptance check — first, and drive the agent against it. The spec becomes the source of truth you review the diff against ("does this match what I asked?"), and when something's wrong you fix the spec and regenerate, not patch the output. This is why plan-mode and spec-first tools exist: the spec is where your judgment lives, and it's cheap to change while code is dear. The second is the agentic loop proper — the agent writes, runs the test, reads the failure, and fixes, iterating without you in the seat. The unlock isn't the model; it's giving it a cheap, honest success signal it can check itself against (a test, a type-check, a linter, a script that exits non-zero). A task with a self-verifiable signal can be looped and left; a task without one must be babysat. Half of "make AI build for you" is engineering that signal so the loop can close on its own.

The gotcha — the pattern that keeps the median engineer median is using the agent as a faster Stack Overflow: one-shot question, copy the answer, move on. That captures almost none of the value. The compounding move is spec → delegate → self-verifying loop → review at the boundary. If by Day 7 your sessions still look like Q&A rather than "here's the spec, here's the test, go," the week hasn't done its job yet.

Start here (the first hour)¶

By the end of hour one: your daily driver is installed and has completed one real task in one of your real repos. Pick the tool now — Claude Code if you want the terminal-native default, Cursor if you live in an IDE; don't research this, either is fine. Then: install, open a repo you actually work in, write a 10-line instructions file (what the project is, how to run tests, one convention you care about), and hand it a small real task ("add a --verbose flag", "fix this flaky test"). Watch what it does differently because the instructions file exists. That loop — instruct, delegate, review — is the whole week in miniature.

Default picks (take these if you haven't chosen your own within 30 minutes):

MCP server (days 3–4): a host-recon server over your own machine or homelab — three tools: list_services (running units / listening ports), tail_log (a named journald/syslog source), check_config (read a config file, e.g. sshd_config). You already have the box; you'll reach for these in every later ops session.
Workflow (days 6–7): the advisory watch — the agent pulls a security feed relevant to your stack (a distro security list, or the NVD recent-CVEs JSON), filters to the packages/services you actually run, and writes one triaged digest.

The week¶

Days 1–2 — agentic coding, done the top-1% way. Pick one (Claude Code, Cursor, Codex — whatever you'll actually live in) and go past autocomplete into the two methods: (a) spec-driven — for at least one real task, write the spec/plan first (use plan mode or a SPEC.md), have the agent implement against it, and review the diff against the spec; (b) the self-verifying loop — give a task a test or check it can run itself, then let the agent write→run→read→fix until green while you watch, not type. Add custom instructions (CLAUDE.md/rules) to one real repo. Do all your work this way for two days, including the uncomfortable parts, and notice which tasks loop cleanly and which need babysitting — that boundary is the lesson.
Days 3–4 — your first MCP server. Build a small MCP server exposing something you personally query often — your host's services and open ports, a log source, your homelab/config files, a cloud account's resource inventory, your runbooks. Wire it into your coding agent and use it in anger. Tool design lesson #1 happens here: your first tool descriptions will be bad, the agent will misuse them, and fixing that teaches more than any post on the subject.
Day 5 — the eval template. Build the harness skeleton you'll reuse all program: a cases.jsonl of input → expected pairs, a runner that calls a model, a scorer (exact match + an LLM-as-judge variant), and a one-line report (pass rate, cost, latency). Plain pytest or a bare script — no framework. ~150 lines total.
Days 6–7 — automate one real workflow + write it up. Pick a recurring chore from your actual life/job (triaging something, a weekly report, renaming/filing, monitoring a page) and make the agent do it end-to-end. Then publish write-up #1: "my AI stack, day 7" — what's in it, what surprised you, one concrete number (time saved, reps run).

Spec — must-haves¶

[ ] One agentic coding tool configured as your daily driver, with per-repo instructions.
[ ] One task shipped spec-first (a written spec/plan the agent implemented against) and one task driven by a self-verifying loop (agent iterated to a green check on its own) — both noted in write-up #1.
[ ] A working MCP server (any language; stdio is fine) with ≥3 tools, connected and used in real sessions.
[ ] The eval template repo: runner, two scorers (exact + judge), cost/latency capture.
[ ] One personal workflow fully automated and actually run on real inputs.
[ ] Write-up #1 published (blog, GitHub README, or a public gist — anywhere linkable).

Eval bar¶

MCP server: invoked successfully in ≥5 distinct real sessions; zero manual copy-paste of the context it serves.
Eval template: runs a 10-case toy set end-to-end and prints pass rate, total cost, p50 latency.
Automated workflow: produces correct output on 3 consecutive real runs.

JIT learning — pull when stuck¶

Claude Code docs — skim Quickstart, then live in "Common workflows"; the memory/instructions page is the highest-leverage 10 minutes.
GitHub Spec Kit — a concrete, tool-agnostic take on spec-driven development; read the README's workflow and steal the spec structure even if you don't adopt the tool (~20 min).
Anthropic — Claude Code best practices — the "explore, plan, code, commit" and test-driven-loop sections are exactly Days 1–2's two methods, in practice (~20 min).
MCP — Getting started — the protocol in one page; then follow the server quickstart for your language.
Anthropic — Building effective agents — read once now for vocabulary (workflows vs agents); you'll reread it in Week 4 (~20 min).
Hamel Husain — Your AI product needs evals — the canonical case for Day 5; read the sections on looking at your data and LLM-as-judge (~30 min).

Key ideas¶

The model is a stateless, jagged-skilled colleague: everything it knows per session, you supplied.
The core skill is context management — instructions files and MCP beat prompt wording.
MCP = build a capability once, plug it into every agent that speaks the protocol.
The work loop is specify → delegate → review; review is where quality is made.
Spec-driven development: write the spec first, drive against it, fix the spec not the output.
The agentic loop closes on its own only when the task has a cheap, honest success signal.
Verification asymmetry: delegate what's cheap to check; leash what fails silently.
Stochastic systems are measured in rates, not booleans — hence the eval template.

Check yourself¶

Why does a per-repo instructions file usually beat a cleverly worded prompt?
What property of a task lets an agent loop on it unattended — and what do you build when that property is missing?
Why fix the spec and regenerate rather than hand-patch the code the agent produced?
Your eval passes 8/10 cases twice in a row. What do you know, and what don't you?

Publish¶

The MCP server: its own public repo with a README a stranger could install from.
The eval template: its own public repo (you'll extend it for 8 weeks).
Write-up #1.

Stretch¶

Add a second MCP server that writes somewhere (files a note, opens an issue) — write tools force you to think about permissioning and blast radius.
Run two agent sessions in parallel on independent tasks and keep both on the rails — parallel-agent management is a distinct skill and it starts here.

Proof¶

"My daily workflow runs through an agent I've extended with my own MCP tools, and I have an eval harness I apply to everything I build."