The 11PM Instruction

At 11:04 PM on a Tuesday I typed one paragraph into a terminal: build a voice note app, web client and Android client, features should actually work, write tests that actually run, stop when the smoke tests pass or when you’ve burned 90 minutes. I hit enter, closed the laptop lid halfway so the fan would still breathe, and went to bed.

I woke up at 6:40 AM to a project folder that had not existed the night before. A web client at /web that booted on npm run dev, recorded audio, saved transcripts, and filtered by date. An Android client at /android with a Gradle wrapper that actually built. Seven planned features. Three compiled. One — exactly one — worked the way I’d asked.

I sat there with coffee for about ten minutes trying to figure out if I was happy or horrified. Both, it turned out. Happy that a machine had written and committed code while I was unconscious. Horrified that three of the “finished” features were quietly broken, and one of them had broken the core record-and-save flow that had worked in the previous run. The orchestrator had marked the run as successful. The tests were green. The app was subtly wrecked.

Here is the thing about autonomous coding agents in 2026: most of them still need babysitting. You start a run, you watch the tool calls scroll past, you intervene every few minutes when the agent proposes something stupid or gets stuck on a typo. The word “autonomous” is doing a lot of work that it does not deserve. What I wanted was something narrower and meaner: start an agent, go to sleep, and wake up to a real project. Not a demo. Not a scaffold. A thing I could open and use.

Getting there took four real runs, about a dozen runs I’d rather forget, and a growing list of rules that each started life as a postmortem. This is how Sleep Mode actually works, what it gets wrong, and what I’ve learned about the gap between “the tests pass” and “the thing works.”

What an Agent Loop Actually Is

Let me demystify this before going deeper, because the marketing has made a mess of the terminology.

“An AI agent is an LLM wrecking its environment in a loop.” — Solomon Hykes (via Simon Willison)

That is, unironically, the cleanest definition I know. Simon Willison’s companion phrasing is “runs tools in a loop to achieve a goal,” which is the same idea with the threat level dialed down. The mechanical version, stripped of vibes, is five steps:

  1. Prepare context. Collect the prompt, any state from previous turns, the list of available tools, and the goal.
  2. Call the model. Ship the context to the LLM. The model returns either text or a tool call.
  3. Handle the response. If it’s a plain message, display it. If it’s a tool call, parse it.
  4. Execute the tool. Run the bash command, read the file, hit the API, spawn the subagent. Capture stdout, stderr, and exit code.
  5. Feed the result back. Append the tool result to context. Go to step 1.

That’s it. That’s an agent loop. Everything clever — planning, reasoning, “thinking,” self-correction — happens inside the model on the next turn because the previous tool result is now part of the context. The loop is dumb on purpose. The intelligence lives in the pattern of tool calls, not in the loop that runs them.

The word “in a loop” is doing the heavy lifting. A one-shot LLM call is a stateless function. A loop gives you an agentic loop: perceive (tool result), reason (model call), act (new tool call), observe (next tool result). That perceive-reason-act-observe cycle is where autonomy lives, and it’s also where every failure mode I hit lives.

Most “autonomous” agents fail on one of three things. They have no termination condition, so they spin forever or quit too early. They have no real feedback signal, so the loop is flying blind — the model has no idea whether the last action worked. Or they have no recovery protocol, so a single bad tool call poisons the rest of the run. Sleep Mode is basically an attempt to fix those three things with a harness: the infrastructure around the model that handles state persistence, tool orchestration, and recovery when things go sideways.

The harness is the product. The model is the engine. Getting that distinction right was the first thing I had to learn.

The Architecture of Sleep Mode

Sleep Mode is a hub-and-spoke orchestrator. One top-level agent — the orchestrator — owns the run from start to finish, and it delegates bounded tasks to subagents that return structured results and then die. The orchestrator never touches the model directly for research or deep analysis; it spawns a subagent, gets back a report, and decides what to do next.

“From implementer to manager, from coder to conductor.” — Addy Osmani

That quote feels marketing-ish until you actually build one of these things and realize the orchestrator’s job is almost entirely management: allocating budget, deciding who does what, verifying the output, and keeping the plan alive across context compaction events.

The run moves through eight phases.

  [0 Parse] -> [1 Context & Deps] -> [2 Analyse] -> [3 Research]
                                                       |
                                                       v
  [7 Iterate] <- [6 Test & Verify] <- [5 Execute] <- [4 Plan]

Phase 0 — Parse. The orchestrator reads the raw instruction and extracts the time budget, token budget, target tech stack, and a rough complexity estimate (Simple / Medium / Complex). Nothing intelligent here — it’s mostly regex plus a classification call.

Phase 1 — Context & Dependencies. This is the only interaction window in the entire run. The orchestrator collects every missing parameter it will ever need and every dependency it will ever touch, then surfaces them in a single structured question. If all dependencies are green and nothing is missing, it skips the question entirely and goes straight to Phase 2. After Phase 1 ends, the orchestrator is fully autonomous. No more questions. If it has to guess, it has to live with the guess.

Phase 2 — Analyse. A subagent reads the prompt plus any attached style references and returns a structured analysis document: the aim in one sentence, a feature list tagged CORE / ENHANCED / POLISH, pass/fail criteria, confirmed tech stack, and proposed folder structure. CORE features are must-ship. ENHANCED are should-ship. POLISH is only if there’s time left over, which there almost never is.

Phase 3 — Research. Parallel subagents go out in parallel. One fetches current library versions and docs so I don’t get code against a 2023 API. One looks at architecture patterns for the target stack. On Complex projects, a third subagent does a deep-dive on whatever the trickiest unknown is — Web Speech API quirks, Room migrations, RS485 timing, whatever. They all return summaries, not transcripts.

Phase 4 — Plan. A planning subagent reads the analysis plus the research and emits a numbered execution plan. Each step has a file path, a scope tag, a dependency list, a parallelization hint, and a quality gate. The plan is markdown. I can read it. That’s important — if I wake up mid-run and want to know what it’s doing, I open plan.md.

Phase 5 — Execute. The orchestrator works the plan. Quality gates after each batch of steps. Git commit at every milestone. A time check before each step — if the remaining budget can’t cover the next step plus a test pass plus a buffer, the orchestrator skips ahead to Phase 6 instead of starting something it can’t finish.

Phase 6 — Test & Verify. Build verification first (npm run build, ./gradlew assembleDebug, python -c "import app"), then smoke tests, then the full test plan from Phase 2. Results get written to test_report.md.

Phase 7 — Iterate. If something failed, diagnose, research if needed, fix, retest. Iteration cap is proportional to complexity: Simple = 3, Medium = 5, Complex = 8. When the cap is hit, the orchestrator stops and writes an honest status report, even if the project is broken.

State persistence is the spine of the whole thing. A file called state.md at the root of the run holds: run ID, project path, current phase, current step, list of key decisions, CORE/ENHANCED/POLISH scope status, files created so far, and known issues. Every time the orchestrator finishes a step, it rewrites state.md. When context gets compacted mid-run — and on a 90-minute run it absolutely will — the recovery protocol re-reads state.md on the next turn and rebuilds its working memory from that file. Git commits do the same job at a coarser grain: if it’s in git, it survived compaction.

Budgets are allocated proportionally. This table lives at the top of the orchestrator’s system prompt:

ComplexityAnalyseResearchPlanExecuteTest + IterateBuffer
Simple5%10%10%50%20%5%
Medium5%15%10%45%20%5%
Complex5%20%15%35%20%5%

The numbers aren’t sacred. They’re the result of tuning. Early runs put 80% on execute and crashed into the wrong problems. Shifting weight into research and planning was the single biggest quality win.

The 14 Rules (and Where They Came From)

These rules aren’t a clean set I designed up front. Each one came from a run that taught me something I didn’t want to learn twice. I’m going to cover the five most interesting in detail. Here are the other nine in one-line form, so you can see the full shape:

  1. Always create a fresh project folder, never reuse an existing one — runs that share a directory contaminate each other’s git history.
  2. Destructive actions are only allowed inside the project folder, and every deletion gets logged with a reason.
  3. Be efficient — time and token budgets are upper limits, not targets. Finish early when you can.
  4. Always log every phase to disk, because the log is your memory after context compaction.
  5. Always git commit at every successful build milestone, not just at the end.
  6. Always read state.md first when resuming after compaction. State is the spine.
  7. The first phase is the only interaction window — after Phase 1, no more questions to the user.
  8. Never use the harness’s built-in plan mode — all planning happens in subagents via the Agent tool.
  9. Never stop mid-execution. If something fails, iterate. Stop only when time expires or all tests pass.

The five below are the ones with the most interesting failure stories.

Rule 11: NEVER break existing functionality. This came from Run 2, the voice note features run. The orchestrator added four new features to an app that already worked. Every new feature had its own tests. Every test passed. The build was green. I shipped it. The next morning I tried to record a note and nothing happened. The core record-and-save flow was dead. What happened: three of the new features also wanted access to the microphone via the Web Speech API, and the new MediaRecorder instantiation was competing with the speech recognizer for the same stream. The tests passed because each feature was tested in isolation, against a mocked media stream. The real browser serialized access and the first feature to grab the mic won — and it wasn’t the record button. Rule 11 now forces the orchestrator to re-run the previous run’s smoke tests before marking a new run complete.

Rule 12: “Features” means FUNCTIONAL features, not UI polish. Run 2 again. I’d asked for “more features.” I got: glassmorphism on the card backgrounds, hover animations on every button, a color theme picker with six palettes, and a subtle parallax on the hero section. Zero new functional capabilities. The orchestrator had interpreted “features” the way a visual designer might. Rule 12 now spells it out: filters, integrations, language support, export formats, sync, offline mode — YES. Glassmorphism, hover micro-interactions, theme pickers — NO, unless explicitly requested.

Rule 13: Multi-platform projects: work on ALL platforms or ask which one. Different run, same week. The project had a web client and an Android client. I asked for a filter feature. The orchestrator updated the web side beautifully. The Android side got nothing. Not even a TODO. I only noticed because I happened to open Android Studio that afternoon. Rule 13 now forces the orchestrator, at plan time, to either list changes for every platform or explicitly ask (during Phase 1) which platform is in scope.

Rule 14: Use allocated time proportionally — at least 20% on research and planning. Run 2 spent about 2 minutes on research on a 90-minute budget. The features it shipped were aimed at the wrong target because it never actually read how the Web Speech API interacts with MediaRecorder. Speed is not quality. Rule 14 now hard-enforces a floor: you cannot leave Phase 3 early. If research finishes in 90 seconds on a 15-minute research budget, spin up another subagent for deeper investigation. The budget is a floor, not a ceiling.

Rule 9: NEVER repeat the same failed approach. This one is about the iteration loop in Phase 7. In one early run, a test kept failing on a single file. The orchestrator kept applying the same patch with trivial variations — change a variable name, add a try/except, revert, re-apply. Six iterations, same patch, same failure. Rule 9 now says: after two failures on the same approach, pivot strategy. After three, escalate — which in autonomous mode means write the failure to state.md as a known issue, mark the feature partial, and move on rather than burning the rest of the budget.

Every rule is a postmortem compressed into one sentence. The document has 14 now. I expect it to have 20 by summer.

The Hard Parts — Failure Modes I Actually Hit

Honest section. None of this is solved.

Context compaction drift. On a long run, the model’s context window gets compacted, and the original aim can drift. I’ve seen a run that started as “add filters to the voice note app” end up refactoring the auth layer because the mid-run summary lost the scope tag. The fix is state.md plus the recovery protocol: on every turn after a compaction event, the orchestrator is required to re-read state.md and restate the current aim and current step before doing anything else. It mostly works. It is not bulletproof.

Silent failures. The voice note story above. Tests green, user flow broken. This is the failure class I trust least because the feedback signal lies to you — the orchestrator thinks it succeeded. The partial fix is explicit manual smoke test descriptions for anything that touches a browser API, a sensor, audio, video, or real network I/O. The test plan now says things like “open the app in Chrome, click record, speak for 3 seconds, click stop, confirm a new entry appears in the list with non-empty transcript text.” The orchestrator can’t run that test, but it can write a summary of what it would expect, and I can run it in 30 seconds when I wake up. The real fix — some kind of automated browser smoke test — is on my list.

Wrong feature interpretation. The glassmorphism story. This one I actually did solve, via Rule 12 and an explicit FUNCTIONAL definition in the analyse phase. I haven’t seen a repeat since.

Shallow research. The 2-minute research phase that shipped a feature aimed at the wrong target. Rule 14 and the budget floor fixed the common case. I still occasionally get research that’s broad but not deep — the subagent reads five docs and summarizes all of them instead of drilling into the one that matters.

Infinite loops on subjective tasks. This is the class I haven’t solved. If I ask for “better error messages” or “cleaner code,” the orchestrator has no feedback signal. Tests pass either way. The model’s own judgment about whether the new version is “better” isn’t reliable, and worse, it tends to be confidently wrong. I currently just refuse to run Sleep Mode on anything subjective. The open problem: how do you give an agent a feedback signal for work where the ground truth is taste?

Wrong-target errors. Real example: an agent in an early run decided the “main” branch was called develop (because the repo also had a develop branch from a contributor) and committed a feature to the wrong branch. I didn’t notice for a day. The fix, after I made the same class of mistake a third time, was an explicit confirmation step in Phase 1 that echoes the target back: “I will write to project X at path Y on branch Z. Proceed?” If the orchestrator cannot answer X, Y, and Z with certainty, it asks. I should have added this rule after the first time, not the third. That’s on me.

Watching Sleep Mode run for a few hours teaches you something the marketing doesn’t. It’s more like a junior engineer with infinite patience and shallow judgment than a senior engineer who can think for themselves. It is very good at executing a well-scoped plan. It is not good at deciding what the plan should be when the requirements are vague. The rules and the gates exist to catch what the judgment misses.

Quality Gates — How It Knows When to Stop

Tests are the non-negotiable feedback signal. Without them, the loop is flying blind and every termination condition becomes guessing. With them, the orchestrator has something real to react to.

The gates, in order:

Build verification. Whatever the stack demands. npm run build for web. ./gradlew assembleDebug for Android. python -c "import app" for Python. If the build fails, execution halts and the loop goes into Phase 7 immediately. No proceeding with a broken build, ever.

Smoke tests. The cheapest possible check that the artifact is alive. CLI tool returns 0 on --help. Web dev server returns HTTP 200 on /. APK exists and is non-empty. Python module imports without error. These run after every milestone. They’re not comprehensive. They’re just “is the thing alive.”

Full test plan. The pass/fail criteria from Phase 2, executed in order. This is where unit tests, integration tests, and anything the planning phase specified actually run.

Manual test descriptions. For the silent-failure class — anything that touches a browser API, sensor, audio, or real device I/O — the orchestrator can’t run the test, but it can write one. These go into test_report.md as a checklist I run when I wake up.

Budget ceilings. Time and token budgets are hard ceilings, not suggestions. The orchestrator checks remaining budget before every major step. If there isn’t enough budget to finish the step plus a quality gate plus buffer, it skips forward to Phase 6 rather than starting something it can’t finish. Half-finished features are worse than no features.

Iteration limits. Simple = 3, Medium = 5, Complex = 8. When the cap is hit, the orchestrator stops and writes an honest status report.

Git commits. The durable proof. Every milestone is a commit with a real message. If the run crashes halfway, I still have every milestone up to that point. The principle is simple: if it’s in git, it survives compaction. The feedback loop is the product, not the agent.

The Numbers

Four real runs. Not cherry-picked.

Run 1 — Voice Note App. 8 minutes / 60-minute budget. PASS. Stack: Python + FastAPI + HTMX + Tailwind. Scope: single-page voice note app with recording, transcription, list view, delete. 0 iterations — the first pass worked end to end. Why it worked: clean target architecture, well-known stack, tight scope, and tests that actually covered the critical paths. This is what Sleep Mode looks like when everything lines up.

Run 2 — Voice Note Features. 12 minutes / 90-minute budget. FAIL. Same stack, building on Run 1. Four new features. Tests green. Build green. User flow broken because the new features were competing for the microphone via the Web Speech API. 1 iteration — and the iteration fixed a cosmetic issue, not the real bug, because the real bug was invisible to the tests. This run produced Rule 11 and Rule 12. It’s the most valuable failure I’ve had, because it taught me that “all tests pass” is not the same as “the thing works.”

Run 3 — Voice Note Features v2. 29 minutes / 90-minute budget. PASS. Same Python/FastAPI stack on the web side, plus a Kotlin/Compose client on Android. 1 iteration — a sort order tiebreaker that was inconsistent between platforms. Notable: this run took more than twice as long as Run 2 and produced a working result, where Run 2 was fast and broken. Doing it right takes longer than doing it fast. The research phase alone took 6 minutes.

Run 4 — Hardware Protocol Labs. 35 minutes / 90-minute budget. PASS. Stack: pure Markdown with embedded Python, Bash, and Cisco IOS code blocks. Scope: 7 guides, 28 labs, 110+ tasks. Classified as Complex. 0 iterations. I want to be honest about why this one was a clean pass: documentation stacks have no build failures. There’s nothing to crash. The quality gates reduced to “does the markdown render” and “do the embedded code blocks parse.” This is Sleep Mode running with the difficulty turned down — and that’s worth knowing as much as the wins on harder runs are. The lesson isn’t “Sleep Mode is great at everything.” It’s “Sleep Mode is great when the failure modes are bounded, and documentation has the most bounded failure modes of anything I’ve thrown at it.” When I run it on real code, I’m closer to Run 2’s failure rate than to Run 4’s clean pass.

The pattern across four runs is clear. Failed runs teach the rules. Successful runs produce the artifacts. Both are necessary. Anyone telling you their agent has a 100% success rate either isn’t running it on hard problems or isn’t looking closely at the output.

When to Use It (and When Not To)

Use it when:

  • Success criteria are clear and measurable. “App builds and smoke tests pass” is measurable. “Code is clean” is not.
  • The task is trial-and-error friendly. You can throw away the result and start over at zero real cost.
  • You have strong tests, or the project is small enough that you can manually verify the whole thing in five minutes.
  • You’re willing to wake up to a known-bad version and either fix it or roll it back. Not every run is a win.

Don’t use it when:

  • The decisions are subjective. Design choices, taste calls, “make it feel better” prompts — the orchestrator has no feedback signal and will confidently do the wrong thing.
  • There’s no clear stop condition. If you can’t write the pass/fail criteria in a sentence, Sleep Mode can’t either.
  • It’s a production system. The blast radius is too big. Sleep Mode is for projects and sandboxes, not live infrastructure.
  • The work requires coordination across multiple systems the orchestrator can’t see. Inter-service contracts, shared infrastructure, things that need a human in the loop.
  • You can’t manually verify the result in five minutes. If verification takes an hour, you’ve moved the bottleneck from building to checking, and checking is the expensive part.

The orchestrator is, to borrow a phrase I keep coming back to, perpetually confused, constantly making mistakes, but never giving up. That’s not a bug — that’s the design. The confusion is bounded by the rules. The mistakes are caught by the gates. The not-giving-up is what lets it finish a project while I’m asleep.

What’s Next

More runs, more rules. I expect the rule list to keep growing for at least another dozen runs before it stabilizes. Every failure that teaches me something new earns a rule. Every rule is a tax on future runs, but the tax is small and the payoff is compounding.

Architecturally, the next real change is cleaner separation between the orchestrator runtime and per-run state. Right now, a lot of the orchestrator’s behavior is hardcoded into the system prompt, which means updating the orchestrator means re-prompting. I want a proper runtime that can load rules from disk, hot-reload them between runs, and version them like any other piece of infrastructure. Related: better resume support after hard crashes. Right now a run that dies mid-Phase-5 can be restarted from the last git commit, but it’s manual. I want sleep-mode resume <run-id> to just work.

The honest open question I keep circling is this: when does the orchestrator itself become more complex than the thing it builds? Run 4 was 7 guides, 28 labs, 110 tasks — a substantial project by any measure. The orchestrator that built it is probably already bigger than the project it built. At some point I’m going to be spending more time maintaining the harness than I’m saving by using it. I don’t know where that line is yet.

If you’re reading this and you’ve solved the silent failure class — the tests-pass-but-thing-broken problem — I would really like to hear about it. That’s the one I trust least, and I suspect the answer isn’t better tests but a different kind of feedback signal entirely. Something closer to “actually drive the app and watch what happens” than “assert on return values.” I don’t know what that looks like yet.

The version of me that started Sleep Mode wanted an AI that would do the work for me. The version of me that runs it now knows the work was never the bottleneck — the feedback was.