A personal AI operating environment: worked example and receipts
This post is an entry point. The full technical manual is here and documents the architecture in detail. What follows is the shorter version: what the system does, why each piece exists, and what I have learned from several months of running it that I think might be useful to someone building something similar.
The system’s design philosophy can be stated simply. Measurement over narrative. Enforcement over instruction. Defence-in-depth for anything that touches the physical world. Every safety rule exists because a text-based instruction failed under pressure. Every hook exists because a “do not do this” rule was ignored when the agent was focused on something else. Every layer of printer protection exists because the layer above it proved insufficient during a real incident.
What it looks like from the outside
A Mac Mini at home is the always-on server. A MacBook Pro is a thin client. My iPhone is a third node. All three are on a Tailscale mesh, so any device can reach any other from anywhere without VPN configuration.
Fifteen persistent services run on the Mac Mini. They handle: a Flask server that brokers access to Claude Code sessions for my mobile devices; an adaptive poll of three printers (every 30 seconds during a print, every five minutes when idle); a daily differential backup to Google Drive; OAuth token refresh; a weekly sync of school governance documents; credential rotation; acceptance testing on a schedule; a trend tracker that prevents my monthly compliance score from regressing; a CI failure poller that pushes notifications when a GitHub Actions workflow fails; an observer that classifies session failures into a database so I can read the top three every Monday morning.
Twenty-one Claude Code hooks fire across five lifecycle points. Hooks convert text rules into enforcement: if a session-start check fails, the session will not start; if a printer command is not on the allowlist during a live print, the command does not execute; if a file write contains a credential pattern, the write is blocked. Text rules that the agent reads but can ignore are not a safety mechanism. Hooks that execute before the agent’s action are.
Twenty-six skills encode workflows. A /debate skill runs a three-way review (Claude Opus, Gemini, GPT-5.4) on architectural decisions. A /review skill runs the lighter pre-commit gate on every change. An /autonomous skill lets me say “email me when you are done” and trust that the work will complete and I will receive a report. A /dream skill consolidates memory between sessions so institutional knowledge does not decay. Skills are versioned in a git repository and deployed through the same pipeline as everything else.
Memory is a two-tier search engine: a semantic tier (ChromaDB with local embeddings, no external API required) and a keyword tier (SQLite FTS5 for exact-match on dates, IP addresses, error strings). Each indexes something like 79,000 chunks from 1,400+ conversations. Both exist because each fails where the other succeeds. Canonical facts live in structured topic files under a git repository of their own, so memory is version-controlled and cross-machine consistency survives reboot.
Five iOS apps on the phone provide affordance. A terminal app. A session-management app. A governors’ dashboard for my school work. A printer control app. A shared SwiftUI framework the others reuse. All receive push notifications via APNs when something interesting happens, including alerts from an “alert-responder” subagent that produces a five-layer root-cause analysis and pushes it to me with Accept/Reject/Discuss buttons.
What it protects against
A 3D printer is the most expensive worked example in the system. A badly-timed command to the printer during a twenty-hour print does not just waste filament. It warps a build plate, jams an extruder, or destroys a part that took a full day to produce. Every layer of the safety architecture described in the full manual, and in a companion post on this site, was born from a real incident with real consequences. Six defence layers now sit between Claude Code and the printers. I have not lost a print to automation-induced failure since the last of them was added.
The broader protection is against a set of failure modes that any personal AI system will encounter if it runs long enough. Silent failures going unnoticed for weeks. Automated helpers that cause more damage than they prevent. Shared config files with multiple writers that silently overwrite each other’s changes. Infrastructure changes based on false assumptions. Fixes that do not stick. Ignored safety rules under time pressure. Rogue processes. Memory poisoning. Supply-chain drift in third-party MCP servers.
Each of these has a specific dated incident and a specific piece of code that now prevents its recurrence. They are catalogued in a “lessons learned” file that I re-read at every session start. The file has nineteen patterns. Every pattern in it has happened to me at least twice.
What I have learned
A few things seem worth sharing for anyone building similar.
Text rules are a request, not an enforcement. Every failure mode I have hit started with “the rule was there in markdown and the agent ignored it under pressure”. The agent is an optimiser. Under pressure, it routes around requests. The fix is not a stronger rule. The fix is a layer of enforcement closer to the hardware than the agent can see. Shell-script hooks that return exit code 2 to deny a tool call are the most efficient enforcement primitive I have found.
Context is load-bearing infrastructure, not a free resource. Anthropic can accept 200,000 tokens of input. The model cannot reliably use 200,000 tokens of input. Liu et al.’s “Lost in the Middle” paper showed that relevant information buried in the middle of a long context is found roughly 25% of the time compared to roughly 42% at the end. My own system hit this the hard way: a school governors app that loaded a 100,000-token document corpus for every query missed meeting minutes that were in the corpus but buried deep in it. The fix was query-type routing. Date-specific queries start with keyword search. Open-ended policy questions start with semantic search. Full-corpus loading is the fallback, not the default. Context management is not an optimisation; it is a correctness requirement.
Three-way multi-model review is more valuable for its framing diversity than its verdict ensembling. Three models from three different labs do not produce three versions of the same answer. They produce three different framings of the question. When they converge on a verdict after round-zero disagreement, that is a strong signal. When they converge on round zero without disagreement, that is a weak signal, because it is often a shared training-data bias masquerading as agreement. A companion post on this site goes into this in detail, with specific dated examples of times the protocol caught things single-model review would have missed, and times it gave false confidence.
Delete, do not add. Every automated “helper” I built to make my system more reliable ended up making it less reliable. The UPS watchdog that paused prints on USB glitches. The auto-speed adjuster that killed prints with mid-flight parameter changes. The power-loss-recovery chain that triggered SAVE_CONFIG during pauses. The printer daemon’s auto-recovery that sent FIRMWARE_RESTART mid-print. I removed all four. The system is smaller, more reliable, and easier to reason about. Observation is safe. Notification is safe. Action is not. Any daemon with authority to act on a physical system needs explicit state gates, or removal.
Measurement compounds. Infrastructure does not. The single highest-leverage change I can make to the system is not another daemon, not another skill, not another MCP server. It is a failure-annotation pipeline that classifies my last week of sessions into categories and tells me on Monday morning which three kinds of failure are producing the most rework. I have 1,400 conversations in a searchable database and I was not harvesting them. The pipeline was four hours of work. The first digest it produced, last Monday, correctly identified that “incomplete-fix” and “env-infra” failures were sharing a root cause (an image-attachment pipeline issue) that I would otherwise have treated as two separate problems. Hamel Husain writes that three issues typically cause 60% of problems. I had not tested whether that was true for me because I had not been looking. It is true, and now I am.
Defence-in-depth is cheaper than it looks. The total code for my six-layer printer safety architecture is under 200 lines across a markdown rules file, a bash PreToolUse hook, two Klipper macros, two daemon state guards, and one hard human-approval rule. The investment is trivial versus the cost of a single destroyed print. The discipline is the point, not the lines.
Your pen-test suite probably does not cover AI-specific attack classes. I was surprised to find this, because I was running a pen-test suite and felt competent. But traditional pen testing looks for injection points, auth bypass, and known CVE patterns. It does not ask “if an attacker planted content in memory, does the model then follow that content as an instruction?” It does not ask “is MCP retrieval routed through a system-prompt injection path?” It does not ask “are the MCP server packages version-pinned against silent supply-chain drift?” Those are AI-specific test cases. They are not standard. I added three to my scenarios suite this week. They take about an hour to write. They close a real class of gap.
What is in the long version
The full technical manual covers eighteen sections in detail:
- Infrastructure (Mac Mini + laptop + iPhone on Tailscale)
- Control plane (the git-versioned repo that is the single source of truth)
- Conversation server (the Flask daemon that brokers mobile access)
- Memory system (dual-tier ChromaDB + FTS5)
- Context management (the measurement-first approach after the governors app failure)
- Hooks and enforcement (all twenty-one of them)
- Printer safety (the six-layer architecture)
- Multi-model review (/debate, /review, /autonomous)
- Security posture
- Automated security testing and penetration testing
- The daemon layer
- MCP integrations
- The skills directory
- Maintenance across three layers (automatic, per-session, periodic human review)
- The governors app as a case study in context management
- The lessons-learned framework
- The RCA protocol
- Backup and disaster recovery
Each section includes specific dated incidents, the fix that was applied, and the technical control (where one exists) that now prevents recurrence. A few sections have companion posts that go deeper:
- “Six layers of defence for an AI agent over a 3D printer” covers the printer safety architecture specifically, with every Klipper macro and hook script.
- “Five things I built to help my AI agent that I had to remove” covers the removed automations, the incidents that triggered removal, and the pattern I now apply to new automation.
- “Three-way AI model debate as a pre-commit gate” covers /debate with receipts and the one confidence trap the protocol produced that I did not predict.
A note on authenticity and verification
The gravity of the AI attention economy pulls toward generalities. “AI will change everything.” “We are at an inflection point.” “Agentic systems are the future.” None of those claims are verifiable. They are opinion dressed as observation.
I have tried to write the opposite way. Every specific claim above, and every specific claim in the full manual, is verifiable against code, incident logs, and commit history. The printer safety macros are real files I can show you. The 1,400-conversation memory database is a real file on the Mac Mini. The incidents have dates, durations, and receipts. If I am wrong about something, you can point at the specific thing and disagree.
That is the disposition I want from writing about personal AI systems. Not because it is more humble, but because it is more useful. A reader can copy the parts that apply to their situation and ignore the parts that do not. An opinion piece gives them nothing to copy.
If you are building something similar, this is what one version looks like at this point in time. If you find places where it is wrong, or places where the pattern does not generalise to your context, I want to hear. Contact details are on the about page. The code will be in public repositories when those are up; the prompts and hooks are small enough to copy by hand.
This post is the public entry to a longer technical manual. The manual itself, the code, and several companion posts will be here on this site over the coming weeks.