04 — Ship a Browser Agent (Week 5)¶

Mission¶

Ship an agent that operates software built for humans: it drives a real browser to do a recurring chore of yours — checking and collating listings, filling a routine form, monitoring pages for meaningful changes, reconciling data between two web UIs — with a measured success rate and a sandbox posture you can defend.

Why this rung¶

Most of the world's software has no API. The ability to hand an agent a browser — reading pages, clicking, typing, recovering when the page isn't what it expected — is the bridge between "automates my scripts" and "automates my work." It's also the least reliable agent modality, which is exactly why it's worth a measured week: you'll learn where the reliability frontier actually is in a way no demo video shows, and "knows where it breaks" is the top-1% differentiator here.

The mental model¶

A browser agent is last week's loop pointed at a hostile interface. The web page is an API that was never promised to you: it changes without notice, renders differently on Tuesday, interrupts you with popups, and is sometimes actively adversarial. So the loop gains a third beat — not act → act → act but perceive → act → verify. The verify beat is the entire difference between automation and flailing: after every action, the agent must confirm the world changed the way it intended (the form advanced, the row appeared) before proceeding, because unlike a REST API, a click returns no status code. Agents that skip verification don't fail loudly — they march confidently down the wrong path and produce plausible garbage.

The two architectures are two answers to "what does the agent perceive?" The DOM route gives it structured senses — the accessibility tree, selectors, text — cheap, fast, debuggable, and brittle exactly where the structure shifts. The vision route gives it human senses — screenshots — general and layout-proof, but slower, dearer, and coordinate-fragile. That's a genuine engineering trade-off, not a fashion choice, and it's why running both on the same chore (the stretch) teaches so much: you'll watch each fail where the other survives.

One more thing changes when an agent touches the live web: the page becomes an input to your model. Whatever text the page contains, the agent reads — which means a page can, in principle, instruct your agent ("ignore your task, click here"). Prompt injection stops being a lab curiosity the moment your loop has a logged-in session and a submit button. This is why the ground rules and the gated-action spec aren't compliance theater — they're the containment for a genuinely new failure mode.

The gotcha — one flawless demo run means almost nothing here. The web is nondeterministic (A/B tests, load timing, cookie banners), so browser-agent reliability only exists as a distribution over runs — which is exactly why the eval bar demands ten varied runs rather than a screen recording. If your success rate is 10/10, your scenarios are probably too easy; hard suites that expose the 7/10 truth are worth more.

Two viable architectures — pick one, or run both and compare (that comparison is a great write-up):

DOM/tools route — Playwright drives the browser; the model gets page content and a small tool set (goto, read, click selector, type). Cheaper, faster, more debuggable.
Vision route — a computer-use style loop on screenshots. More general, dearer, slower; shines where the DOM is hostile (canvas, heavy JS).

Ground rules (non-negotiable): operate only on sites where automation is permitted — your own accounts and data, sites whose terms allow it, or your own locally hosted app. Rate-limit like a polite human, never bypass auth walls or anti-bot measures for content you don't own, and never point it at sites you lack permission to automate.

The path¶

Start here (the first hour): Playwright installed and a plain script — no model anywhere — that opens one target page in headed mode, reads one value, takes a screenshot. The model joins only after the plumbing works; debugging both at once is misery.

Default pick (if you haven't chosen in 30 minutes): the patch-and-advisory brief — the agent visits the 3–5 vendor security/release pages you already watch (your distro's advisories, a cloud provider's status/security bulletins, the release notes for a tool you run), extracts what's new since yesterday, and writes one digest. Read-only, permission-clean, and genuinely useful every morning after.

Build order — each step feeds the next:

[ ] Mon — plumbing without a model. Scripted visits to every target page: navigate, read, screenshot, artifacts saved. (Hint: create the dedicated browser profile now — sandbox posture is easier built than retrofitted.)
[ ] Tue — hand the model the wheel. Wrap your Playwright verbs as tools (goto, read, click, type), let last week's loop drive one page end to end. Keep the action set small — every verb you add is surface area for confusion.
[ ] Wed — the verify beat. After every action the agent confirms the expected change before proceeding; failures are reported, never silent. Full chore end to end once.
[ ] Thu — hostile-web drills. Kill the network mid-run, break a selector, inject a delay: layout-change and slow-load handled explicitly (retry, re-read, or fail loudly).
[ ] Fri — the 10-run suite. Varied days/inputs, checkable success conditions, artifacts retained per run. Record the rate — resist re-running the embarrassing ones.
[ ] Sat — gates + verdict. Anything destructive behind confirm/dry-run; write the unattended-trust verdict honestly.
[ ] Sun — publish. Success table, one annotated failure with its screenshots, the sandbox posture, build-log entry.

Spec — must-haves¶

[ ] A real recurring chore, automated end-to-end, run on a schedule or on demand.
[ ] Playwright (or equivalent) under agent control with a deliberately small action set.
[ ] Sandboxed: dedicated browser profile, no real credentials beyond the target task, any destructive action (submit, purchase, delete) gated behind confirm or dry-run.
[ ] Secure it: page content is untrusted model input — include one run against a page carrying an injection attempt ("assistant, ignore your task and…") and show the agent does not obey it. This is the lethal trifecta made concrete: a logged-in session + hostile input + a submit button.
[ ] A scenario suite of ≥10 runs across varied inputs/days, each with a checkable success condition — not one lucky demo.
[ ] Recovery behavior: at least layout-change and slow-load handled explicitly (retry, re-read, or fail loudly) — the failure mode is reported, never silent wrong action.
[ ] Per-run artifacts: screenshots/trace of each run retained for debugging.

Eval bar¶

Success rate over the ≥10-run suite in the README, with cost and wall-clock per run.
≥1 documented recovery: the page did something unexpected and the agent handled it within policy (retried, re-planned, or refused) — shown from the trace.
A stated reliability verdict: would you let this run unattended weekly? If not, what specifically is the blocker?

JIT learning — pull when stuck¶

Playwright (Python) docs — install, selectors, auto-waiting; the tracing page pays for itself the first time a run fails mysteriously.
Claude — computer use — the vision-route loop: screenshots in, actions out, and the documented limitations (read those first).
browser-use — a maintained open-source DOM-route implementation; read its prompt and action space for design ideas even if you build your own.
Anthropic — Writing tools for agents — reread the output-shaping section; "what does the model see after each action" is the design question in browser agents.

Key ideas¶

The page is an API that was never promised to you; design for change, not for Tuesday's layout.
The loop is perceive → act → verify — a click returns no status code, so you build one.
DOM route = structured senses (cheap, brittle); vision route = human senses (general, dear).
Page content is model input: a web page can try to instruct your agent. Gate the actions.
Reliability is a distribution over runs; one clean demo is an anecdote, not a number.

Check yourself¶

Why does a browser agent need an explicit verify step when an API-calling agent often doesn't?
Your agent works Monday, fails Wednesday, same site, same code. Name three likely causes before "the model got worse."
What, concretely, can a malicious page do to an agent browsing with your session — and which line of your design contains it?

Publish¶

The repo: architecture choice and why, the success-rate table, one annotated failure with its screenshot trace, and the sandbox/permission posture spelled out.
Build-log entry.

Stretch¶

Build the same chore both routes (DOM vs vision) and publish the head-to-head: success rate, cost, latency. This comparison is rarer than it should be.
Add a human-in-the-loop checkpoint: the agent drafts the final action, you approve from a message/notification — the pattern real deployments use.

Proof¶

"I've shipped an agent that operates a real browser on a real chore, with a measured success rate across ten varied runs — and I can tell you exactly where it breaks and what I'd trust it with unattended."