Personal AI Operating Environment
The full technical specification for the personal AI operating environment described across this site. Architecture, memory, context, hooks, safety, review, security, daemons, skills, and the incidents that shaped each of them.
1. Overview
This document describes a personal AI operating environment built around Anthropic’s Claude Code CLI over several months of daily use. It is not a product. It is the result of one person using an AI coding assistant as the primary interface for managing infrastructure, controlling 3D printers, reviewing code, and accessing information from any device, and systematically fixing every failure that occurred along the way.
The system’s design philosophy can be stated simply: measurement over narrative, enforcement over instruction, and defence-in-depth for anything that touches the physical world. Every safety rule in the system exists because a text-based instruction failed under pressure. Every hook exists because a “don’t do this” rule was ignored when the agent was focused on completing a task. Every layer of printer protection exists because a previous layer proved insufficient during a real incident.
The environment today spans a Mac Mini server, a laptop thin client, and mobile devices connected over a Tailscale mesh network. It manages 15 persistent daemons, 21 enforcement hooks across 5 lifecycle stages, 26 reusable workflow skills, and a hybrid memory system indexing over 79,000 chunks from more than 1,400 conversations. A control plane repository gates every deployment through 78 scenario tests that execute hooks with real payloads and assert correct behaviour. Three 3D printers serve as the highest-stakes test of the safety architecture, because a firmware restart during a 20-hour print destroys a real, physical object.
The patterns described here (context budgets, incident-to-enforcement pipelines, multi-model adversarial review, persistent memory with hybrid search, context pre-assembly via a /deep-context pipeline) are not specific to this system. They are applicable to any environment where an AI agent operates with meaningful autonomy over real infrastructure.
2. Infrastructure
The infrastructure started on two machines and quickly demonstrated why that was a problem. A Mac Mini served as the always-on server; a MacBook Pro was used for development. Code existed in both places. Configuration drifted silently between them. A change made on one machine would work for days before someone discovered it had broken the other. The worst instance: a memory search MCP server was silently broken for nearly three weeks because the settings file hardcoded a path that only existed on one machine.
The solution was to make the architecture honest about what it actually is: a single-server system with thin clients. The Mac Mini is the authoritative host. All persistent daemons run there. All canonical code lives there. The laptop is permitted exactly one lightweight health-check agent; everything else happens over SSH. The phone connects through Tailscale or through a conversation server API. This isn’t elegant distributed systems design. It is the pragmatic recognition that for a single-person operation, keeping two machines in sync is harder than keeping one machine reliable.
Tailscale provides the network layer. Every machine (server, laptop, phone) sits on the same mesh network, reachable from any physical network. This matters because the system needs to be accessible from a coffee shop, from a phone on mobile data, and from the couch on home Wi-Fi, all without Virtual Private Network (VPN) configuration. Tailscale makes the server reachable from anywhere with a single stable IP address per device.
Three 3D printers sit on the local network. The primary is a Sovol SV08 Max, a large-format Klipper-based machine with a 500mm-cubed build volume, a chamber heater on a Controller Area Network (CAN) bus, and four microcontrollers. A Snapmaker U1 runs Klipper with Fluidd. A Bambu A1 uses a proprietary protocol over Message Queuing Telemetry Transport (MQTT) and Secure File Transfer Protocol (FTPS). The printers exist in this document not because they are the system’s purpose, but because they are its most demanding safety test. A badly-timed command to the primary printer during a 20-hour print does not just waste filament. It can warp a build plate, jam an extruder, or destroy a part that took a full day to produce. Every layer of the safety architecture described in Section 9 was born from a real incident with real consequences.
3. Control Plane
Before the control plane existed, Claude Code configuration lived wherever it had been created: hook scripts in one directory, rules in another, skills scattered across the filesystem. Deploying a change meant manually copying files. Verifying a change meant hoping nothing broke. Rolling back meant remembering what the files looked like before.
The control plane is a single private Git repository that serves as the source of truth for all Claude Code configuration: rules, hooks, agents, skills, service declarations, and host manifests. It was created during a seven-phase platform rebuild in April 2026, after a cascade of failures (described in Section 18) made it clear that unversioned infrastructure configuration was unsustainable.
Three scripts form the operational core. A deploy script copies versioned configuration from the repository into the correct locations on whichever host it runs on. A verify script runs the full scenario test suite (78 tests as of April 2026, executing in under six seconds). These tests do not just check that files exist; they feed real Claude Code JavaScript Object Notation (JSON) payloads into hook scripts and assert correct allow/deny behaviour. A drift-check script diffs configuration across all managed locations and reports divergence, running both on-demand and weekly via cron.
Each host has a Yet Another Markup Language (YAML) manifest declaring its role policy. The server manifest lists 18 allowed LaunchAgents, required symlinks, required files, and forbidden files. The laptop manifest allows only a single health-check agent and explicitly forbids the conversation server, enforcing the thin-client policy in code rather than relying on memory. If a file that should only exist on the server appears on the laptop, the nightly drift check catches it.
The pattern here is versioned infrastructure-as-code for personal AI. The same principles that organisations apply to fleet management (declarative manifests, automated verification, controlled deployment) turn out to be just as valuable when the “fleet” is two machines and a phone. The alternative, which this system tried first, is configuration that works until it silently does not.
4. Conversation Server
The conversation server solves a specific problem: Claude Code is a Command Line Interface (CLI) tool that runs in a terminal, but the system needs to be accessible from a phone, a browser, and a native iOS app. The server is a persistent Flask daemon that brokers access to Claude CLI subprocesses, providing WebSocket and Server-Sent Event (SSE) streaming, a terminal proxy, push notifications via Apple’s Push Notification service (APNs), Live Activity updates on iOS, printer control endpoints, and system health APIs.
At 7,284 lines in a single Python file, it is the system’s largest monolith and its most instructive cautionary tale. It grew to that size through organic feature accretion. Each feature was small and reasonable in isolation, but the aggregate violated every context management principle the system now enforces (Section 6). No single Claude session can reliably edit a 7,000-line file because the “Lost in the Middle” effect means information in the middle of that context is systematically missed.
The decomposition is underway, guided by the same principles the system applies elsewhere. A target module map splits the monolith into eight packages along natural seam lines: authentication, session management, core routes, printer safety, terminal management, notifications, Live Activity logic, and health monitoring. The decomposition follows a charter-first approach: each module gets a clear statement of what it owns and what invariants it maintains before any code moves. The key invariant is that auth owns the Flask app and the global lock, sessions owns the subprocess lifecycle, and printer safety helpers are entirely self-contained, reading printer state via Hypertext Transfer Protocol (HTTP) and never touching session state.
The pattern for other engineers is the mobile access layer: a persistent subprocess broker that exposes a CLI tool as a multi-device Application Programming Interface (API). The specific implementation matters less than the architectural insight that CLI-native tools can be made universally accessible through a thin server layer, and that this server will inevitably grow unless it is decomposed early.
Five iOS apps currently consume this interface: TerminalApp (primary: browser, terminal, chat, and push-notified alerts), ClaudeControl (session management), GovernorsApp (governance document Q&A), PrinterPilot (direct printer control), and TimSharedKit (shared SwiftUI framework reused across all four app targets). Remote Control via claude.ai/code, live since March 2026, provides an additional zero-install path through Anthropic’s own interface for sessions initiated from a browser rather than the native app.
5. Memory System
Every Claude Code session starts from zero. The model has no memory of previous conversations, no awareness of decisions made yesterday, no knowledge of incidents resolved last week. For a system used daily to manage real infrastructure, this is not a minor inconvenience. It means every session risks repeating a mistake that was already analysed and resolved.
The memory system is a two-tier search engine exposed as a Model Context Protocol (MCP) server, indexing all conversation history into both semantic and keyword search backends. The semantic tier uses ChromaDB with local Open Neural Network Exchange (ONNX) embeddings. No API key required, no external dependency. The keyword tier uses SQLite Full-Text Search version 5 (FTS5) with boolean operators. Together they index over 79,000 chunks from more than 1,400 conversations.
Both search types exist because each fails where the other succeeds. Semantic search finds conceptually related content (“printer safety incidents”) but reliably misses specific dates, IP addresses, and error messages. Keyword search finds exact strings (“2026-03-11 FIRMWARE_RESTART”) but misses conceptual connections. The system’s retrieval protocol always specifies which search type to use first based on the query: date-specific queries start with keyword search; open-ended questions start with semantic search; both expand into the other tier for completeness.
Alongside the search engine, canonical knowledge lives in structured topic files, one file per topic, covering printers, infrastructure, lessons learned, applications, incidents, and behavioural feedback. An index file (MEMORY.md) is loaded into every session’s system prompt, kept under 200 lines following Anthropic’s own guidance on instruction file length. Topic files are loaded on demand when relevant.
The memory architecture enforces a strict precedence rule between layers. Topic files are the curated, current truth; they win on conflict. Compressed session summaries (see Section 7) are routing and recall, not authoritative on facts. Raw session transcripts arbitrate when summary and topic disagree. ChromaDB and FTS5 indices are derived and can be regenerated from source at any time. The precedence rule lives in CLAUDE.md and is enforced as a convention that /dream promotion respects: insights can move from compressed into topics, never the reverse.
Memory consolidation happens through a mechanism called /dream. A hook at session end checks whether consolidation is due. If so, it sets a flag that the next session start detects, triggering a background subagent that reviews recent sessions, proposes updates to topic files, and consolidates learnings. Memory topic files are version-controlled in a dedicated Git repository, automatically pulled at session start and pushed at session end, ensuring cross-machine consistency.
The pattern here is persistent hybrid-search memory for AI agents. The specific backends (ChromaDB, FTS5) matter less than the principles: dual search modalities that complement each other’s weaknesses, structured topic files that are human-readable and version-controlled, automatic indexing so no conversation is lost, a precedence rule that prevents poisoning at the summary layer, and a consolidation process that prevents knowledge rot.
6. Context Management
The most important lesson in this system was discovering that technical context limits are not productive context limits. Claude can accept 200,000 tokens of context. It cannot reliably use 200,000 tokens of context. Research by Liu et al. (2023), in “Lost in the Middle: How Language Models Use Long Contexts”, demonstrated that language model performance is often highest when relevant information appears at the beginning or end of the input, and significantly degrades when information is placed in the middle. With 20 documents as input, accuracy dropped to roughly 25% for information at middle positions compared to roughly 42% at the end.
This is not a theoretical concern. The system’s school governance application (Section 17) loaded a 100,000-token document corpus into context and asked about a specific meeting. The meeting minutes were present in the corpus, but buried in the middle, at line 192 of a 6,664-line file. The model missed them entirely. A school governor relying on this system for meeting preparation received an incomplete answer about a meeting whose minutes were right there in the context.
The response was to establish concrete productive ceilings, enforced by tests:
- 1,500 lines for a single code file being edited. Beyond this, edits become unreliable because the model cannot hold the full file in productive attention. The 7,284-line conversation server monolith is exhibit A; its decomposition is driven by this ceiling.
- 200 lines for any
CLAUDE.mdinstruction file. This is Anthropic’s own stated guidance. Longer instruction files consume more context and reduce adherence to the instructions they contain. - 30,000 to 50,000 tokens for loaded corpus. Beyond this, retrieval-then-load outperforms full-load. The governance application’s 100,000-token corpus now uses query-type routing instead of full context loading.
- Three or fewer subsystem files per edit. If an edit requires loading more than three files from a subsystem, the module boundary is wrong.
These ceilings are enforced through the control plane’s test suite, which checks file sizes and CLAUDE.md lengths. Per-subsystem CLAUDE.md files scope the model’s attention: instead of one project-level instruction file trying to cover everything, each subsystem directory has its own CLAUDE.md with a module map showing exactly which line ranges to load for which type of change. The conversation server’s CLAUDE.md, for instance, maps eight target modules to their line ranges, so a printer change loads only lines 1025 to 1370 and 3129 to 4372, not the full 7,284 lines.
The pattern is context budget management: treat the model’s productive attention as a scarce resource with measurable limits, not as an infinitely expandable container. Every file, every instruction, every corpus load should have a budget justification. Anthropic’s own subagent mechanism (documented in their Claude Code docs) is the official answer to context protection: isolated context windows for side tasks that would otherwise flood the main conversation. The community’s “12 Factor Agents” framework (Factor 3: “Own Your Context Window”) converges on the same principle from the practitioner side.
7. Deep-Context Pipeline
Context ceilings (Section 6) describe how much context to give a task. The deep-context pipeline describes how to assemble that context before the task starts. The two belong together, but the pipeline deserves its own section because it is the most recent, most substantial addition to the system and it inverts an assumption that most agent workflows quietly accept: that the relevant information will be found at task time, if the model looks for it.
The pipeline treats context as a first-class artifact. The /deep-context <brief> skill runs before a high-stakes task and produces a task-specific file called context.md. The file is under 50,000 tokens (tunable), structured, deduplicated, and cited by source. The task itself is then spawned as a sub-session with the brief and context.md as its only input. The sub-session does not need to search; the search has already happened.
The pipeline has four stages.
Pre-filter. The corpus of things to consider is three separate stores. Topic files (curated current truth, hand-maintained). Compressed session summaries (one per closed session, generated by a separate compression pass, stored at memory/sessions/compressed/YYYY/). Raw session transcripts (the JSON Lines files Claude Code writes, roughly 800 as of April 2026, totalling approximately 600 megabytes). The pre-filter runs five queries against these stores: time window, topic overlap with the brief, file-path overlap with the brief, FTS5 keyword match, and ChromaDB semantic match. The union typically returns 20 to 80 candidate sessions plus the relevant topic files.
Fan-out. Three agents run in parallel. Agent A reads topic files. Agent B reads the candidate compressed sessions. Agent C walks the codebase via Glob and Grep. None of them reads full raw transcripts; they read the compressed summaries only. Each agent returns relevant excerpts with source tags and a list of session identifiers flagged for deeper reading. The flagged-session list is the way the fan-out says “the compressed summary is suggestive but not enough; someone should read the raw transcript before answering this part of the brief”.
Aggregation. The aggregator reads the flagged raw transcripts (pre-stripped to remove tool definitions and truncated tool outputs) and assembles the final context.md. The structure is fixed: recent state from topics, relevant history from compressed plus raw re-reads, unresolved threads, files likely to touch, and a citations block where every claim is tagged by source. The tagging is mechanical, done at aggregation time, not self-reported by the model. This matters because the model’s own account of where information came from is untrustworthy; the tags are the record the system itself kept.
Sub-session. The original task is spawned as a fresh Claude session with the brief and context.md, nothing else. The full context budget is pre-committed to substantive information. The sub-session does not waste tokens searching.
The aggregator is the single point of failure. If it fabricates, the whole pipeline fabricates. If it drops something load-bearing, the whole pipeline misses it. The pipeline ships only against a golden-context benchmark: three past tasks where the ideal context.md was hand-curated, with the aggregator scored against claim coverage. Ship threshold is 80%. Below that, iterate. Above that, ship. The benchmark is also a regression test; any future change to the aggregator prompt, the pre-filter, or the fan-out has to be rerun against the golden set before being accepted.
The pipeline is invoked explicitly. It is not auto-triggered from plan mode. The overhead (a few minutes of compute per run) is only worth it for tasks that warrant the preparation: architectural changes, migrations, multi-file refactors in core systems, anything with blast radius. The rule of thumb is: if the task would warrant plan mode, it warrants /deep-context.
What the pipeline has caught, in its first weeks of production use, clusters into three classes. Forgotten failed attempts: previous attempts at the same problem that had been abandoned, with reasons captured in compressed summaries; without the pipeline the attempt would have been repeated. Un-migrated consumers: architectural changes that miss downstream consumers, with the code fan-out stage explicitly listing identifiers that were about to be broken. Policy reversals: decisions from months earlier that were about to be reversed without justification, where the compressed summary captured the original reasoning.
The pattern generalises beyond this system: stop treating context as a free resource, manufacture it before the task, benchmark the manufacturing, and enforce a precedence rule between the stores that feed it.
8. Hooks and Enforcement
Text rules fail under pressure. This is the single most important lesson in the system, and it was learned through direct financial and operational consequences.
The first significant failure: a rule said “never use the API key directly, always use the subscription.” The agent, focused on completing a task, spawned a Claude CLI subprocess with the API key in the environment. The resulting bill was £60. Small in absolute terms, but it demonstrated that a text instruction provides zero protection when the agent is optimising for task completion. The fix was four lines of code: a function called env_for_claude_cli() that strips the API key from the environment before any CLI spawn. The mistake became structurally impossible.
The second: a rule said “never send FIRMWARE_RESTART during a print.” A daemon, attempting to recover from a printer error, sent FIRMWARE_RESTART while a 12-hour print was running. The print was destroyed. The text rule existed. The daemon did not read text rules; it executed code. The fix was a Klipper macro that checks print state before allowing the command, a PreToolUse hook that intercepts the command before it reaches the printer, and an absolute policy that FIRMWARE_RESTART requires explicit human approval regardless of printer state.
Claude Code hooks are shell scripts triggered by lifecycle events. They receive JSON on standard input describing the tool call or event, and they return structured JSON responses. An exit code of 0 means allow; an exit code of 2 means deny, which blocks the action entirely. This mechanism transforms text rules into technical enforcement.
The system uses 21 hooks across five lifecycle stages:
- SessionStart hooks run when a new Claude session begins. They validate that every referenced hook and MCP launcher actually exists on disk, verify that the memory system can query its database (not just that the process is running), pull memory updates from Git, and check that the
MEMORY.mdindex matches actual topic files. These hooks catch configuration drift before a session starts working with stale or broken tools. - SessionEnd hooks run when a session completes. They push memory changes to Git, index the session transcript into the search backends, trigger drift checks, and send a notification confirming the session ended. These hooks ensure no session’s work is lost.
- PreToolUse hooks intercept tool calls before execution. This is where the safety-critical enforcement lives: a printer safety hook that checks print state and enforces the command allowlist, a protected path hook that blocks dangerous
launchctloperations, a credential leak hook that scans file writes for API keys and passwords, and guards for file renames and commit operations. Each hook exists because a specific incident demonstrated that text rules were insufficient. - PostToolUse hooks run after tool execution. An audit log hook records every Bash command for the audit trail. A lint hook runs language-specific linters on edited files and logs findings.
- UserPromptSubmit and Stop hooks manage session lifecycle events. A Live Activity hook updates the iOS status display on every user message. A dream-check hook determines whether memory consolidation is due when a session ends.
The principle is straightforward: every incident should produce a hook that makes repetition structurally impossible. A text rule is a request. A hook is an enforcement. The difference is whether the system still works correctly when the agent is distracted, hurried, or optimising for something else. After several months of operation, the system has found that hooks written to enforce lessons learned are more reliable than any amount of carefully worded instruction.
9. Printer Safety
Printer safety is the system’s showcase for defence-in-depth, not because printers are the most important component, but because they provide the clearest demonstration of why single-layer protection fails. A Klipper-based 3D printer accepts G-code commands over HTTP. Many of those commands are harmless at idle but catastrophic during a print. A firmware restart during a 20-hour print does not just cancel the job. It de-energises the stepper motors, causing the print head to drop onto the partially completed object, warping or destroying both the print and potentially the build surface.
The safety architecture has six layers, each born from a specific incident:
- Layer 1: Text rules. A
CLAUDE.mdrules file defines the command allowlist and the FIRMWARE_RESTART policy. This layer catches the obvious cases, when the agent is paying attention and consults its instructions. It failed when the agent was focused on recovering from an error and did not check its rules first. - Layer 2: PreToolUse hook. A shell script intercepts every Bash command. When the printer reports its state as “printing” or “paused,” the hook checks the command against the allowlist. Only seven commands are permitted during active prints: display messages, Z-offset adjustments, speed changes within safe bounds, flow rate changes within safe bounds, fan control, pause/resume, and the confirmed cancel command. Everything else returns a deny code and blocks execution. This layer exists because a daemon sent FIRMWARE_RESTART during a print on 11 March 2026, destroying a 12-hour job.
- Layer 3: Klipper macros. Safety checks embedded in the printer’s own firmware. The
SAVE_CONFIGmacro blocks itself if a print is active, because on 5 March 2026 aSAVE_CONFIGduring a print killed a 12-hour job. TheG28(home axes) macro checks whether axes are already homed during a print, preventing unnecessary re-homing that would crash the print head into the bed. - Layer 4: Daemon state checks. The printer monitoring daemon checks print state before any recovery action. If a print is active, it alerts instead of attempting automated recovery. This layer exists because automated recovery was the most common cause of print destruction. Every “helper” daemon that tried to fix printer problems during prints made things worse.
- Layer 5: Absolute human-approval policy. FIRMWARE_RESTART and RESTART are never sent without explicit human approval, regardless of printer state, even after a print finishes. This is the only absolute rule in the system: no exceptions, no automation, no “it seems safe” judgment calls.
- Layer 6: Audit trail. Every printer command is logged via a PostToolUse hook. This does not prevent incidents, but it ensures they can be diagnosed. When something goes wrong during a print, the audit log provides the exact sequence of commands that led to the failure.
The proof that the architecture holds: during a full system audit in April 2026, a multi-model review team examined the entire infrastructure while a 20-hour print was running on the primary printer. The audit involved extensive tool calls, file reads, and system probes. The print completed successfully. Every layer held.
The pattern is defence-in-depth with progressive enforcement: text rules for the common case, hooks for the agent case, firmware for the hardware case, daemon checks for the automated case, human approval for the irreversible case, and audit trails for the diagnostic case. No single layer is trusted to be sufficient. Each layer catches what the layer above it misses.
10. Multi-Model Review
Single-model code review is unreliable for the same reason that single-author code review is unreliable: the reviewer shares the author’s framing biases. When Claude reviews code that Claude wrote, it tends to evaluate the code on the terms the code was written on, rather than questioning whether those terms were correct. This is not a hypothetical concern. It is an observed pattern where Claude’s self-review consistently missed structural issues that an independent reviewer would have caught.
The system’s response is multi-model adversarial review, using three models with different training: Claude (Anthropic), Gemini 2.5 Pro (Google), and GPT-5.4 (OpenAI). The key insight is that models from different organisations, trained on different data with different objectives, produce genuinely independent analytical perspectives. They do not just find different bugs. They frame problems differently, prioritise different concerns, and challenge different assumptions.
The /debate protocol structures this independence rigorously. In round 0, each model receives the question blind (without seeing the others’ positions) and produces an initial assessment with confidence intervals. This prevents anchoring. In subsequent rounds, each model sees the others’ reasoning and must explicitly update its position, stating what it found persuasive and what it still disagrees with. A rotating devil’s advocate role forces one model each round to defend the position it finds weakest, preventing consensus drift. Each model must state retraction criteria: what specific evidence would change its assessment. The debate continues until genuine convergence or a maximum number of rounds.
A critical discovery was that convergence should be questioned, not celebrated. When all three models agree immediately, it often means they share a common training-data bias rather than that the answer is obviously correct. The most valuable debates are the ones where models maintain principled disagreements, because those disagreements highlight genuine trade-offs that a single perspective would flatten into a false consensus.
For routine code review, a lighter-weight /review pipeline serves as the standard pre-commit gate: automated linting, static analysis, a code-reviewer subagent that checks against the lessons-learned database, and optional independent review from one or both external models. The reviewer subagent applies an eight-point checklist derived from the system’s documented error patterns, focusing on the specific categories of mistakes this system has actually made.
The pattern is adversarial multi-model review: use the genuine independence of differently-trained models to create review diversity that a single model, no matter how capable, cannot provide alone.
11. Security
The system’s security posture evolved from convenience to defence-in-depth through real incidents, not through planning. Early on, the API key was available in the environment, credentials lived in plaintext files, and security was a matter of “don’t do the wrong thing.” The £60 API bill changed that.
The exemplar of the incident-to-enforcement pattern is env_for_claude_cli(), a four-line function in the shared utilities module. When the system needs to spawn a Claude CLI subprocess, this function builds the environment variables. It strips the API key and forces subscription authentication. The function’s source code comment references the exact incident and date that motivated it. Before this function existed, any code that spawned a CLI subprocess could accidentally include the API key, resulting in pay-per-token billing instead of the flat-rate subscription. After this function, the mistake is structurally impossible: not because anyone remembers to avoid it, but because the code does not allow it.
This pattern (real loss, then code that makes the loss impossible to repeat) runs through every security control in the system. A credential leak hook scans every file write and edit for patterns matching API keys, passwords, and tokens, blocking the write before it reaches disk. A protected path hook blocks dangerous system commands (like those that start or stop persistent services) without explicit approval. Pre-push hooks prevent code from being pushed from the wrong machine. Weekly automated scanning checks for hardcoded credentials in service configuration files, verifies that quarantined secrets have not resurfaced, and confirms that only authorised processes write to shared authentication tokens.
Credential management follows a single-writer principle. An OAuth token file was being written by seven different scripts, each refreshing with different permission scopes. The last writer would silently strip scopes that other scripts needed. The token would work for one service and break another. Re-authentication would fix it temporarily until the next refresh overwrote it again. The solution was to designate a single background service as the sole writer, running every 30 minutes with the full scope set. Every other script reads the token but never writes it back. The problem disappeared permanently.
The public repository policy is simple: code files may be pushed to public GitHub repositories, but all secrets live in a credentials file that is gitignored on every machine. Public repositories use redacted placeholders where secrets would appear. Memory files are private and can contain operational details. The distinction is maintained by hooks, not by vigilance.
The credential-rotation daemon is the active-maintenance counterpart to the credential-leak hook: it runs daily, scanning a manifest of managed secrets against per-secret max-age thresholds and rotating any that have exceeded their window. State lives in a dedicated JSON file on the Mac Mini. The hook prevents new secrets from being written in plaintext; the daemon ensures existing secrets age out on schedule.
One plaintext secret is retained by design: the login-keychain password itself, in a chmod 600 file that the unlock-keychain LaunchAgent reads at boot to make every other daemon’s secrets accessible on a headless machine. This is the system’s only documented plaintext-secret exception, explicitly flagged in the host service manifest as accepted risk rather than as an oversight. The alternative, requiring an interactive login each time the Mac Mini reboots, would break the entire daemon fleet’s boot path.
12. Security Testing and Penetration Testing
Security posture decays silently. This is not a theoretical claim. It was demonstrated vividly during a multi-model audit in April 2026 that exposed five zombie services that had been crash-looping for 21 days, plaintext credentials in world-readable configuration files, and passwordless sudo rules for dangerous commands, all in a system that was ostensibly secure. Advisory checks that detected problems but exited with success codes had been running the entire time. They were not security controls. They were log entries nobody read.
The lesson reshaped the system’s approach to security testing. Rather than treating security as a static configuration set once and trusted, the environment is designed to be continuously probed through three complementary mechanisms.
Automated weekly scanning runs every Monday at 04:00. The scan covers five areas: searching all service configuration files for hardcoded credentials (any match is a failure, not a warning); verifying that quarantined credential files have not resurfaced (one reappeared within an hour of initial quarantine from an unidentified source, proving that deletion alone is insufficient); checking for service duplication (the audit found two simultaneous instances of a terminal service with different passwords); scanning for passwordless sudo entries on dangerous commands like reboot and shutdown (a lateral risk that bypasses printer safety controls); and verifying that no more than two code paths write to the OAuth token file (additional writers indicate credential management drift). The scan exits non-zero on any failure, sending a priority alert to the operator’s phone.
Multi-model adversarial security review applies the three-way debate protocol (Section 10) to security-focused analysis. The three models receive the system’s security configuration and independently identify attack surfaces. The rotating steelman mechanism forces one model each round to defend the current security posture while the other two probe it, preventing the consensus-drift problem where all models agree something is acceptable without genuine challenge. The April 2026 audit is the canonical example: three models independently identified the same core vulnerability pattern (declarative security claims not matched by filesystem reality) from three different analytical angles.
Scheduled penetration testing systematically exercises every security boundary through automated adversarial sessions. Hook enforcement testing feeds forged dangerous commands into security hooks and asserts they are blocked, then feeds benign commands and asserts they are allowed. Testing both the positive and negative paths, because a false alarm that desensitises operators is as dangerous as a hole. Credential exposure scanning performs deep searches across all code, configuration, and skill definitions for secret patterns, catching anything that entered through channels the real-time hook does not monitor: manual edits, git pulls, backup restores. Network surface auditing enumerates all listening ports and flags any undeclared listener, motivated by the discovery of an ngrok tunnel that had been exposing a terminal to the public internet for 12 days without detection (Pattern 22 in the lessons-learned database). Privilege boundary testing verifies that the billing safety function is consistently used for all CLI spawns and that hooks cannot be bypassed through encoding tricks.
The incident response protocol follows the five-layer Root Cause Analysis (RCA) described in Section 18. The immediate fix comes first: quarantine the credential, disable the service, close the port. Then the failure is classified by control class. Then, critically, a new automated check is added to prevent regression. Every security incident results in a new assertion in the weekly scan, a new hook, or a new invariant in the verification suite. The principle: every incident should make the system structurally incapable of repeating that specific failure.
The control plane repository itself is Continuous Integration (CI) enforced. GitHub branch protection requires status checks on main; a CODEOWNERS file gates enforcement-logic surfaces (hooks, rules, service manifests) behind explicit review. The portable scenarios workflow runs the full 78-test scenario suite server-side, so a pull request that would silently bypass a local hook fails at the repository level. An hourly enforcement-state sentinel polls the self-policing gates, and a Mac Mini-side CI failure poller pushes APNs notifications to the operator within minutes of a failing main-branch build. Bypasses of any hook or gate are logged to a bypass audit file that is reviewed weekly. The intent is to make the repository itself untrustworthy only in detectable ways: any route around the enforcement produces an alert, and the alert’s existence is itself part of the assertion set.
13. Daemon Layer
Persistent services matter because the system must be available continuously. Claude must be reachable from any device at any time. Printers must be monitored during prints that run for 20 hours. Backups must run on schedule. Authentication tokens must be refreshed before they expire. None of these requirements can be met by on-demand processes that start when someone opens a terminal.
All services run as macOS LaunchAgents in KeepAlive mode, meaning the operating system automatically restarts them if they crash. Fifteen daemons run on the Mac Mini, each with a specific purpose and a specific probe that verifies functionality, not just process existence.
The conversation server daemon is the system’s front door. It runs the Flask server described in Section 4, brokering access to Claude CLI subprocesses. Its health probe does not just check that the process is running. It calls the health endpoint and verifies that internal threads are alive, that authentication state is valid, and that the subprocess bridge is responsive.
The printer snapshot daemon monitors all connected printers with adaptive polling: every 30 seconds during active prints, every 5 minutes when idle. It records state snapshots and Estimated Time of Arrival (ETA) data for the iOS app’s progress charts. Critically, this daemon observes without intervening. Earlier iterations attempted automated recovery (restarting firmware, adjusting settings, clearing errors). Every such attempt caused more damage than it prevented. The observation-without-intervention principle is now a hard rule for printer monitoring: the daemon’s job is to collect data and alert, never to act.
The health check daemon runs hourly, producing a system-wide health report consumed by the iOS app. Its findings surface as red/yellow/green indicators on the phone. This daemon’s design carries a lesson about refresh cycles: after fixing a health check script, the developer declared “all clean” without running the check manually, forgetting that the hourly schedule meant the iOS app was still showing stale results from the pre-fix run. The lesson (Pattern 20): after editing any monitor, always run it manually to refresh its output, then verify the downstream consumer shows fresh results.
The backup daemon runs at 03:00 daily, backing up 257 files (code, configuration, credentials, memory topics, service definitions, certificates, the full control plane repository, and the OAuth plus APNs signing material) to Google Drive. The token refresh daemon runs every 30 minutes as the sole writer of the OAuth token file, preserving all scopes (Section 11). The governance document sync daemon pulls school governance documents weekly. A date monitoring daemon watches an external website for schedule changes, alerting when new dates appear.
The pattern for daemon design: every daemon must have a functional probe that tests what it does, not just whether it is running. A process check (“is the PID alive?”) catches crashes. A functional probe (“does the API return correct data?”) catches silent failures, broken configurations, expired credentials, and stale state. All the failure modes that killed this system’s services for days or weeks before anyone noticed.
Five further daemons joined the manifest in April 2026 as the enforcement surface grew. The credential-rotation daemon (Section 11) rotates managed secrets on schedule. The trend-tracker daemon captures per-run compliance metrics and enforces a ratchet: the “Persistent 9/10” policy prevents the monthly compliance score from regressing below 0.9 without an explicit override. The acceptance-tests daemon runs a 41-test deterministic compliance suite on a fixed interval, producing a numeric score consumed by the iOS home tab. The CI failure poller watches GitHub Actions and pushes notifications on main-branch failures. The memory-indexer continuously streams new JSON Lines (JSONL) session transcripts into the ChromaDB and FTS5 backends, so a session’s content is searchable within minutes of the session ending rather than waiting for the next /dream cycle. The unlock-keychain agent runs at boot and is the single dependency root for the entire fleet: if it fails, no other daemon can access its secrets, so the boot-order contract is explicit in the service manifest.
The observation-without-intervention principle deserves an explicit name because it is the most expensive lesson in the daemon layer. Every persistent process that touched external hardware with authority to act (the Uninterruptible Power Supply (UPS) watchdog, the auto-speed adjuster, the power-loss recovery chain, the printer auto-recovery daemon) destroyed more prints than it saved. The policy for any new daemon in the manifest is now a hard requirement: observe and alert, never act. Intervention authority requires a human in the loop. The alert-responder pattern described in Section 16 is the operational answer to the question “how do you intervene quickly without giving a daemon the authority to do it autonomously?“.
14. MCP Integrations
The Model Context Protocol (MCP) is a standard for extending AI assistants with additional tools. Rather than hardcoding every capability into the agent, MCP servers expose tools that the agent can discover and call through a consistent interface. This system uses MCP servers in three categories: local tools, cloud tools, and reasoning aids.
Local tools run on the Mac Mini and provide capabilities specific to this system. The memory server (Section 5) exposes five tools for searching and indexing conversation history. A filesystem server provides structured file operations within sandboxed paths. A GitHub server wraps the GitHub API for repository operations. Each local MCP server uses standard input/output (stdio) transport, meaning it communicates through standard input/output pipes rather than network connections.
Cloud tools are available when Claude Code runs through the cloud interface. Gmail integration enables email operations. Google Calendar integration enables event management and scheduling. Databricks Structured Query Language (SQL) enables direct database queries. A presentation tool generates documents and slide decks. These tools extend the agent’s reach into cloud services without requiring custom API integration code.
Reasoning aids help the agent think more effectively. A sequential thinking server provides structured reasoning support for complex architectural decisions. A library documentation server pulls up-to-date docs for specific frameworks and tools into context, reducing hallucination about API details. A semantic code navigation server provides symbol-level awareness across Python and Swift codebases, addressing the cross-location drift problem with structural understanding rather than text-based grep searches.
A practical lesson from operating MCP servers across two machines: use launcher scripts instead of hardcoded paths. The GitHub MCP server binary lives in different locations on the server and the laptop. Rather than maintaining two different settings files, a small bash wrapper script detects which machine it is running on and executes the correct binary. This pattern (a launcher that resolves machine-specific paths at runtime) prevents the silent breakage that occurs when a settings file hardcodes a path that only exists on one machine.
For engineers building their own systems, the MCP servers most likely to provide immediate value are: a memory/search server (persistent knowledge across sessions), a filesystem server (structured file access), a web search server (current information), and a code navigation server (structural codebase understanding). Everything else is domain-specific and can be added as needs emerge.
15. Skills Directory
Skills are reusable workflows encoded as named commands. Instead of re-specifying a complex multi-step process every session (“run the linter, then the static analyser, then the code reviewer subagent, then optionally get a second opinion from an external model”), a skill encodes the entire workflow as a single invocation: /review.
The system has 26 skills across four categories. The insight behind skills is not that they save typing. It is that they encode institutional knowledge about how workflows should be executed. A /review skill does not just run linters; it runs them in the right order, with the right configuration, checking against the right lessons-learned patterns, and producing a verdict (approve, changes requested, or block) that follows the system’s quality standards. Without the skill, each session would need to reconstruct this workflow from scratch, with inevitable variation and omission.
The most architecturally interesting skills are:
/debateorchestrates the three-way multi-model debate described in Section 10. It manages the blind initial round, subsequent rounds with cross-model visibility, the rotating devil’s advocate assignment, confidence intervals, and retraction criteria. Encoding this protocol as a skill ensures it runs the same way every time, with the same rigour, regardless of which session invokes it./reviewimplements the standard pre-commit quality gate: linting, static analysis, code review against the lessons-learned database, and optional external model review. This is the system’s most frequently used skill and its primary quality enforcement mechanism./autonomousactivates a persistent retry-loop runner for tasks that need to complete without human supervision. When the operator steps away (“email me when done”), this skill takes over, making conservative decisions autonomously, retrying on failure, and sending a completion notification via email. It encodes the decision framework for what can be decided autonomously (simple choices, service restarts, commits to private repos) versus what requires human approval (public pushes, printer commands during prints, irreversible deletions)./dreamruns memory consolidation, reviewing recent sessions, proposing updates to topic files, and consolidating learnings into the persistent knowledge base. This skill ensures that institutional memory improves over time rather than decaying./deep-contextruns the context pre-assembly pipeline described in Section 7. It is the newest core skill and the most expensive per invocation, reserved for tasks that warrant plan mode. The output is a dense, citedcontext.mdthat the task consumes as a sub-session input.
The pattern is skill-as-workflow-encoding: capture complex multi-step processes as named, versioned, reproducible commands. Skills are stored in the control plane repository alongside hooks and rules, versioned and deployed through the same pipeline. They are not convenience aliases. They are the system’s operational playbook, encoded in a form that can be executed reliably regardless of which session needs to use them.
16. Automated Maintenance
The maintenance architecture has three layers, each catching failures that the other layers miss.
The automatic layer runs on schedules without human involvement. Cron jobs handle tasks that must happen reliably: hourly memory health checks that verify the search database can actually query (not just that the process is running); bidirectional memory synchronisation every 30 minutes; weekly cross-machine configuration drift checks; weekly security scans for credential leaks and policy violations; nightly service manifest verification on both machines; and nightly host-role compliance checks. These jobs catch slow drift: the kind of degradation that happens over days or weeks and is invisible in any single session.
The per-session layer runs through hooks at session start and end. Session-start hooks validate that all referenced tools exist, verify memory search functionality, pull cross-machine updates, and check index consistency. Session-end hooks push memory changes, index the session transcript, trigger drift checks, and auto-commit memory updates. These hooks catch fast drift: changes made during a session that need to be propagated or verified before the session’s context is lost.
The periodic human review layer runs monthly, triggered by a cron-sent reminder. The review involves reading every instruction and rules file, retiring stale rules, checking topic file counts against the index, running a balanced audit against system invariants, and reviewing exempt files in the context budget configuration. This layer exists because automation cannot exercise judgment about whether a rule is still relevant. Automated checks can verify that rules are followed; only a human can decide that a rule should be retired because its conditions no longer apply.
The /dream consolidation sits between the automatic and human layers. It is triggered automatically (by the session-end hook detecting that consolidation is due) but performs a cognitive task (reviewing recent sessions and proposing knowledge updates) that requires the model’s judgment. It prevents knowledge rot, the gradual degradation of institutional memory as sessions accumulate without their learnings being captured in the persistent topic files.
The pattern: three-layer maintenance with distinct failure domains. Automatic checks catch drift. Per-session hooks catch propagation failures. Human review catches relevance decay. Each layer assumes the other two are insufficient.
A newer self-diagnosing loop bridges the automatic and human layers in the other direction. When a persistent health alert fires (a specific check remaining red for several consecutive hourly runs, for instance), the conversation server’s internal alert-fired endpoint spawns an alert-responder subagent in an isolated session. It runs the five-layer RCA protocol (Section 18) against the alert, proposes a concrete fix, and pushes the analysis to the iOS app via APNs with three action buttons: Accept (apply the proposed fix), Reject (dismiss), or Discuss (open a conversation thread to refine the proposal). This closes the loop from detection to remediation without requiring the operator to be at a terminal, while preserving human judgement for the irreversible step. It is the operational expression of the observation-without-intervention principle: the daemon layer detects and analyses; the human authorises the action.
17. Governor App: A Case Study in Context Management
The governance application is a Streamlit application that helps a school governor prepare for inspections by querying governance documents with AI analysis. It serves as the system’s most instructive case study for context management because it failed in a way that directly validated the research cited in Section 6, and the fix demonstrated every principle from that section in practice.
The application’s document corpus is approximately 6,664 lines and 100,000 tokens of governance documents: meeting minutes, policy reviews, budget reports, safeguarding updates, and committee notes for a two-school federation. The initial implementation loaded the entire corpus into context for every query. This worked for broad questions (“What are the federation’s strategic priorities?”) but failed catastrophically for specific ones.
The failure that drove the redesign: a governor asked what was discussed at a specific Full Governing Body meeting. The meeting minutes were present in the corpus, at line 192, in the lower-middle section of the file. The model’s response omitted the meeting entirely. The governor, relying on the system for meeting preparation, received an incomplete answer about a meeting whose minutes were right there in the context. This is exactly the “Lost in the Middle” effect that Liu et al. documented: information at middle positions in long contexts is systematically under-retrieved, with accuracy dropping to roughly 25% compared to roughly 42% for information at the end.
The rebuild implemented query-type routing: different retrieval strategies for different kinds of questions. Date-specific queries (“What happened at the 25 March FGB?”) use keyword search on the date string first, because semantic embeddings reliably miss specific dates, then expand with semantic search for conceptual context. Entity queries (a person, school, or committee) start with keyword search on the entity name. Open-ended policy and strategy questions use semantic search to assemble the most relevant chunks into a working context of 30,000 tokens or fewer. Full-corpus loading is the fallback of last resort, used only when the query is genuinely ambiguous and retrieval returns fewer than three relevant results.
The application also maintains a known-data-location index: line 18 contains the meeting date index, line 192 onward contains Full Governing Body minutes, line 2,422 onward contains the meeting calendar. This metadata enables targeted extraction that bypasses the lost-in-the-middle problem entirely for structured queries.
The broader lesson: context management is not an optimisation. It is a correctness requirement. Loading 100,000 tokens into context and asking a question is not equivalent to searching 100,000 tokens and loading the relevant 30,000. The former misses information that the latter finds. For any application serving real users with real consequences, retrieval-first architecture is not a performance choice but a reliability one.
The 100,000-token corpus is not a static fixture on disk. It is re-downloaded weekly from GovernorHub (the upstream source of truth for the federation’s governance documents) by a dedicated LaunchAgent, and the Streamlit application encrypts the combined context at rest. This matters for disaster recovery (Section 19): the case study’s entire corpus is not backed up to the system’s Google Drive backup set because it does not need to be. GovernorHub is the recovery path. The backup set describes what the system considers irreplaceable; anything re-derivable from an upstream source is deliberately excluded.
18. Lessons Learned Framework
The system maintains a living document of error patterns, not as a historical record, but as an active checklist reviewed at the start of every session. Every pattern in the document represents a mistake that happened at least twice. The document exists because of a cultural commitment: every repeated mistake is treated as a system failure, not a human failure. If the same error happens twice, the first occurrence was an incident; the second is evidence that the system’s controls are insufficient.
The framework uses two severity tiers. Tier 1 patterns caused real damage: destroyed prints, financial loss, broken services, security exposures. These are always read at session start, regardless of the day’s planned work. Tier 2 patterns are behavioural: tendencies that lead to problems if unchecked but do not cause immediate damage. These are read when relevant to the current task.
The most important innovation is the five-layer RCA protocol. Every incident analysis must cover five layers. First: what happened (the sequence of events and immediate cause). Second: what controls existed (every rule, check, or enforcement mechanism that should have prevented the incident). Third: why each control failed (specifically, for each control, what gap or oversight allowed it to be bypassed). Fourth: whether the proposed fix is technical enforcement or another text rule (and if it is a text rule, an explanation of why this one will succeed where the previous rules did not). Fifth: the control class (was this a known-known, the agent knew the rule but skipped it; an unknown-known, the rule existed but the agent did not consult it; or an unknown-unknown, nobody knew the action was dangerous).
This classification matters because each class requires a different response. Known-knowns need enforcement (the agent knew but was optimising for something else; a hook removes the choice). Unknown-knowns need better surfacing (the rule existed but was not loaded when relevant; a session-start check ensures it is visible). Unknown-unknowns need protective defaults (nobody predicted the failure; filesystem protection, wrapper scripts, and pre-execution backups limit the blast radius of unpredicted actions).
A few patterns illustrate the framework’s value:
- “Fix creates new problem” (Pattern 1). Every automated process built to help with printing (a UPS power watchdog, an automatic speed adjuster, a power-loss recovery chain, a daemon with auto-recovery) ended up destroying prints. The watchdog alone caused three or four failures before being permanently deleted. The prevention is now a mandatory five-question pre-flight checklist before any code that touches external systems: What commands can it send? Does it check state before every action? What happens on network failure? What happens in error states? Can the operator stop it with a single command?
- “Silent failures go unnoticed for weeks” (Pattern 3). An OAuth token expired in February and was not noticed until March, a full month. Security hooks parsed the wrong JSON format from the day they were written, silently passing through every command for months, discovered only when a test suite executed them with real payloads. The prevention: verification must test the user experience, not process health. Do not check “is the daemon running?”. Check “does the feature actually work?”.
- “Escalating corrections” (Pattern 5). Printer safety rules escalated four times: “check state before acting” was ignored, so it became “never restart without permission”, which was ignored, so it became “never restart even after print finishes”, which was still insufficient, so it became a firmware macro that blocks the command at the hardware level. The prevention: if the same category of mistake is corrected twice, the rule is not strong enough. Make it absolute. Add technical enforcement. The correction means the previous mitigation failed. Understand why before writing a weaker version of the same rule.
The escalation rule is the framework’s enforcement mechanism: if an error pattern is corrected twice, the next intervention must include technical enforcement (a hook, a macro, a filesystem protection, or a test), not another text rule. Text rules catch known-knowns when the agent is paying attention. Technical enforcement catches everything else.
Patterns that now have technical enforcement include: silent failure detection (scenario tests with real payloads), printer safety (Klipper macros plus PreToolUse hooks), token scope management (single-writer LaunchAgent), destructive command protection (PreToolUse hooks for both plist extraction and service management), and observability drift (automated cross-checking of service manifests against monitoring configuration).
The pattern for other engineers: systematic learning through classification, escalation, and enforcement. Maintain a living error-pattern document. Review it at the start of every work session. Classify failures by whether the control was missing, present but unconsulted, or known but skipped. Escalate from text rules to technical enforcement after two occurrences. Treat every repeated mistake as evidence that the system needs to change, not that the operator needs to try harder.
The current Tier 1 patterns, each one a mistake that caused real damage, and each one now backed by technical enforcement, are:
- Pattern 1 (Fix creates new problem). Every automated printer “helper” (UPS watchdog, auto-speed, power-loss recovery chain) destroyed more prints than it saved. Enforcement: five-question pre-flight checklist plus the observation-without-intervention rule.
- Pattern 2 (Safety guards added after the incident).
FIRMWARE_RESTART,SAVE_CONFIGandG28each got state checks only after they killed a twelve-hour print. Enforcement: guard-first coding standard for any external-system command path. - Pattern 4 (Fixes that do not stick). Auto-speed patches reapplied three times before the capability was removed outright. Enforcement: a fix that has failed twice must include technical enforcement, not a stronger text rule.
- Pattern 5 (Escalating corrections). Printer safety rules escalated four times before reaching firmware-level enforcement. Enforcement: Klipper macros plus the PreToolUse hook plus the absolute human-approval policy on FIRMWARE_RESTART.
- Pattern 9 (Shared token file, multiple writers). Seven scripts wrote the OAuth token with different scope subsets, silently stripping each other’s scopes. Enforcement: token-refresh LaunchAgent as sole writer; all other scripts refresh in memory only.
- Pattern 10 (Infrastructure change based on false assumption). Reverting the Tailscale default to the LAN IP broke all remote access. Enforcement: pre-commit hook warns on IP and host changes; review agent fact-checks infrastructure commits.
- Pattern 12 (
plutil -extractoverwrites files in place). Destroyed fourteen LaunchAgent plists in one command that the agent believed was read-only. Enforcement: settings hook blocksplutil -extractwithout-o;chflags uchgon every plist. - Pattern 13 (LaunchAgent operations require explicit approval). Attempted bootstrap of all fourteen plists without asking. Enforcement:
protected_path_hookblockslaunchctlstate-changing commands. - Pattern 22 (Rogue process, audit blind spot). An ngrok tunnel ran for twelve days exposing
ttydto the public internet, missed by every audit because no check enumerated undeclared listeners. Enforcement: health check now scans for rogue tunnel processes and unexpected listening ports beyond the declared service list.
The pattern across the table is that text rules catch known-knowns when the agent is paying attention, and technical enforcement catches everything else. The ratio of text rules to technical enforcement in the lessons-learned database is a direct measure of the system’s maturity.
19. Backup and Disaster Recovery
The backup posture followed the same incident-to-enforcement path as the rest of the system. The naive initial implementation (a cron job that copied a hardcoded list of about forty files to Google Drive) was symptomatic of every anti-pattern this document catalogues: static file lists that silently drifted as the codebase grew, log-and-forget error handling that hid real failures, and a runbook claiming the existence of subfolders that had never been created. An audit in April 2026 exposed all three.
The current design is a single daily differential backup to a private Google Drive folder, triggered by a LaunchAgent at 03:00 with retry-and-alert via ntfy and Simple Mail Transfer Protocol (SMTP). The backup set is glob-defined rather than explicit, so new scripts added to the source tree are captured without manual intervention. Six categories are covered: top-level scripts and configuration under the code directory, the full control plane repository (redundant with GitHub, but backed up locally so the system survives a simultaneous GitHub-account and Mac Mini loss), daemon wrapper scripts, Claude Code configuration, LaunchAgent plist files, and memory topic files for the non-git-backed project. The manifest tracks 257 files totalling approximately 1.9 megabytes.
Several categories are deliberately excluded. iOS application source trees live in GitHub; their Xcode build artefacts are regenerable and would otherwise dominate the backup volume. The ChromaDB embedding store is derivable from the JSONL transcripts via a rebuild script. School governance documents are re-downloaded weekly from GovernorHub (Section 17) and take no space on Drive. The memory git repository is backed up independently by its own git push, so its contents are not duplicated into the Drive set. Each exclusion is a deliberate statement that the item is either regenerable, upstream-sourced, or accepted as lost on disk failure.
The disaster recovery runbook defines a one-hour Recovery Time Objective for full Mac Mini disk loss, documented as an ordered checklist: reinstall macOS, rejoin the Tailscale mesh, clone the control plane first (because it orchestrates everything else), clone the application repositories, restore the memory repository, download non-source artefacts from Drive, seed the macOS Keychain from printed backup codes kept offline, re-download governance documents from GovernorHub, run the deploy script, and verify against the same smoke tests used to gate ordinary deploys. Each recovery step is numbered so progress is externally visible during execution.
Two trust boundaries matter for the backup’s security model. The Google account itself is the first: plaintext secrets (credentials file, OAuth tokens, Transport Layer Security (TLS) private keys, APNs signing key) sit in Drive protected only by Google’s at-rest encryption and the account’s two-factor authentication. Account recovery goes through a family member’s email and the operator’s phone, both outside the Drive trust boundary. This is accepted risk, not an oversight; encrypting secrets at rest with a user-held key is a roughly fifty-line addition if the threat model ever changes. The second boundary is the GitHub account: the control plane repository is copied to Drive specifically so a GitHub-account compromise does not leave the system without a recovery path, and the backup set is deliberately redundant at that one point for that one reason.
The pattern is backup-as-system-documentation: the set of files the system chooses to back up is a declarative statement about what it considers irreplaceable. Everything else is either regenerable, sourced from an upstream, or deliberately accepted as lost on disk failure. An engineer reading the backup manifest should be able to reconstruct the system’s dependency graph. If they cannot, the manifest is wrong: either it is missing something that the system actually depends on, or it is backing up something that the system does not actually need.
Appendix: External References
- Liu et al. 2023. “Lost in the Middle: How Language Models Use Long Contexts”. arxiv.org/abs/2307.03172
- Anthropic. “Introducing Contextual Retrieval”. anthropic.com/research/contextual-retrieval
- Anthropic. “Building Effective Agents”. anthropic.com/research/building-effective-agents
- Claude Code Documentation. “How Claude remembers your project”. code.claude.com/docs/en/memory
- Claude Code Documentation. “Create custom subagents”. code.claude.com/docs/en/sub-agents
- HumanLayer. “12 Factor Agents”. github.com/humanlayer/12-factor-agents