Tim Trailor
Essay

The thirteen-hour outage and the doorman

How a Claude responder spawned itself into a kernel panic, and the structural fix: a kill switch outside the failure domain, plus a single shared doorman that gates every Claude-spawning subsystem.

On Friday night the Mac Mini that runs my personal AI setup ran a 1,400-process spawn cascade for thirteen hours, exhausted memory, starved the kernel watchdog, and panicked. The cause was an autonomous Claude responder that I had built to fix problems for me while I slept. It did the opposite. The recovery has shipped today, and the structural lesson is worth writing down.

What happened

The conversation server runs a 10-minute health check. When the check failed, it called an alert endpoint, which spawned a Claude subprocess to investigate. Each Claude tool call inside that subprocess fired the project’s hook chain. Two of those hooks themselves spawned Claude subprocesses for tier-3 review and Haiku-driven prompt enrichment. Another path spawned Claude for window-title inference per terminal pane.

None of these spawns was a problem on its own. Each returned within seconds. The problem was that the trigger source kept firing. A flapping warning produced a slightly different fail-set hash on each cycle, defeating the dedupe layer. The investigator timed out under load. Each timed-out investigator left fifty orphaned subprocesses behind. Over thirteen hours the responder fired ten times. Peak: 1,400+ processes, load 500, 15GB of 16GB RAM used, 11GB compressed, kernel_task at 197% sustained.

The kernel watchdog process did not get scheduled often enough to keep a heartbeat. The kernel panicked.

The structural lesson

Software-layer safety gates that live inside the process tree they are trying to protect will fail when the tree goes haywire. The conversation server had a load gate. The responder had an internal timeout. The hooks had concurrency caps. None of them mattered, because the system that was supposed to detect the runaway was the same system the runaway had broken.

The fix is layered defence outside the failure domain.

What shipped today

Three structural caps, plus a kill switch, plus a single shared doorman.

Layer 1, kill-switch watcher. A long-running launchd-managed daemon (KeepAlive=true) polls ps -u $UID every five seconds. Above 1,000 user processes it touches ~/.claude/kill_switch, sends SIGTERM to every tier-3 reviewer, alert responder, and autonomous Claude subprocess, and waits 30 seconds before SIGKILL on the stragglers. The watcher itself reads sysctl directly without forking, so its detection path does not depend on the spawn-starved mechanics it is trying to save. The kill switch is the file itself; even if the watcher dies, every hook checks for the file in its first three lines.

The kill switch lives at ~/.claude/kill_switch, not /tmp/. /tmp/ is wiped on reboot, which would have been a fail-open bug after the panic. There is also a one-shot LaunchAgent that runs at boot and auto-trips the kill switch if a new .panic file appeared since the last clean shutdown. The system comes back up in its safest state, not its most vulnerable.

I had originally planned to use kern.maxprocperuid lowered from 2,666 to 1,000. macOS Tahoe System Integrity Protection blocks launchctl limit maxproc even via sudo. Disabling SIP loses too many other security features, so the cap is now watcher-enforced rather than kernel-enforced. The trade-off is bounded: at the watcher’s 5-second poll interval and Pattern 36’s max spawn rate (~50 subprocesses per timeout), the worst-case overshoot is roughly 250 extra processes. Still well below the 1,400 that crashed me.

Layer 2, alert-trigger rate limit. The /internal/alert-fired endpoint is now capped at 2 per hour and 5 per 24 hours. APNs push to my phone always fires; only the responder spawn is rate-limited. This cuts the cascade off at the source. The trigger is decoupled from the worker.

Layer 3, reviewer self-throttle. The tier-3 reviewer that judges ambiguous Bash commands has caps of 2 concurrent per session, 10 per 10 minutes per session, 3 system-wide. Beyond cap it returns ASK directly, escalating to a tap on my phone. The graceful degradation is human-in-the-loop, not silent failure.

The doorman. Every Claude-spawning subsystem now goes through one shared gate function. Five checks in order: kill switch, system health (process count, load average, memory pressure), concurrency (5-slot pool), system rate limit, per-parent rate limit. Plus a hard queue cap of 20.

The doorman is the single entry point for both /autonomous (when I tell Claude to keep working while I sleep) and the auto-investigation path (which replaces the old responder with something rate-limited and queued and kill-switchable). One pool, two-tier priority where Tim-initiated jumps auto-triggered, and gates re-evaluate at both enqueue and dequeue. The latter matters because the system can degrade between when a task arrives and when a free slot opens up.

Per-task wall-clock budget: SIGTERM at 28 minutes, SIGKILL at 30. The slot frees no matter what. Otherwise hung tasks deadlock the pool.

Auto-investigation, redesigned

The old responder spawned a 900-second Claude agent on every alert. The new version spawns through the doorman, writes a markdown proposal at ~/code/proposals/<id>.md, and embeds a verification probe (a shell command that exits 0 when the issue is resolved). A 30-minute LaunchAgent re-runs the probe; three consecutive passes auto-resolves the proposal. There is also a SessionEnd hook that scans recent commits for fixes #alert-XYZ patterns and closes the linked proposal.

Idempotency: same fail-set hash within 24 hours updates the existing proposal rather than creating a new one. The duplicate-investigation race that threw the original responder is closed at the database level, in a SQLite IMMEDIATE transaction so concurrent alert handlers serialise.

What got it past me

I review my own work with a code-reviewer subagent plus Gemini 2.5 Pro plus GPT-5.4 in parallel. The two independent models caught issues my own pass missed:

  1. The first version had a file-descriptor leak in the spawn helper. Each launched subprocess kept a duplicated file descriptor in the parent watcher process. Over hours, the watcher would exhaust its descriptor table and silently stop spawning while looking healthy.
  2. The first version of the reaper used kill(pid, 0) to test whether a process was alive. If the operating system recycled the pid for an unrelated process, the slot leaked permanently. The fix is to verify the CLAUDE_AUTONOMOUS_ID environment variable on the live process, not just that the pid exists.
  3. The first version of the proposal store had a check-then-act race on idempotency. Two concurrent alerts with the same fail-set both saw no existing record and both created new rows.

Two rounds of review caught twelve issues my own pass had missed. All twelve absorbed.

The reply rule that fell out

While I was building this, my phone got hit with about fifty push notifications a day from a single warns-only fail-set. The dedupe layer was suppressing investigations but not pushes. I extended the dedupe to include PUSHED_TO_TIM status. First push fires; later flaps within two hours are suppressed. The first time the new code ran, the next health-check produced one log line: failset-dedupe drop. No push.

Three hours of accumulated noise gone in a single deploy.

What this is the template for

Pattern 37 in my lessons file is the structural minimum for any Claude-spawning subsystem:

  1. Kill switch outside the failure domain.
  2. Trigger source rate-limited independently of the worker.
  3. Workers self-throttle with sliding-window awareness and human escalation.
  4. Persistent kill-switch path, not /tmp.
  5. Reaping of existing runaways, not just blocking new spawns.
  6. Reboot-time auto-trip on prior panic.
  7. Single shared doorman with the gates re-evaluated at dequeue, not just at enqueue.
  8. Per-task wall-clock budget enforced from inside the worker.
  9. Control-plane storage separated from telemetry storage.
  10. Memory pressure gate complementing process count.
  11. Synthetic load tests that verify recoverability under cap-hit conditions, not just that the cap fired.
  12. Permissions back to default before restoring Claude-spawning hooks.

Don’t ship a Claude-spawning subsystem missing any of those. They are not nice-to-haves.

What is still on the list

The old monthly dry-run verifier daemon for the teardown scripts. The cost line in the daily digest, broken out by Haiku versus Sonnet versus subscription window utilisation. A weekly trend chart that surfaces the 7-day envelope of process count and APNs delivery rate. None of those are safety; they are visibility.

The doorman has been live on the Mac Mini for two hours. The phone is quiet. The same protections rolled out to the personal laptop tonight, and the work laptop receives them the next time I open Claude Code there.