"Email me when done": a persistent task runner with a delivery guarantee

The sentence that forced this build was “email me when you are done”. I had left the laptop running overnight on a long task, closed the lid, and assumed the agent would email me the result when it finished. In the morning there was no email. The task had completed, or had probably completed, but the session had died somewhere in the process and the completion message was sent to a dead terminal. I had no record of what the agent had produced, how far it had got, or what had gone wrong.

The lesson was specific: in-session “I will email you” is a promise the session cannot keep. The email has to be sent by something that outlives the session. Anything else is an unreliable deliverable.

The result is a skill called /autonomous and a small Python script called autonomous_runner.py. This post is what they do, why they are shaped the way they are, and the failure modes they were built against.

The trigger

The skill is invoked when I say something that implies I am not going to watch the session to completion. The natural-language triggers are literal: “email me when done”, “ping me”, “let me know”, “I am stepping away”, “going to bed”, “back later”, “logging off”. The skill is also explicitly invocable as /autonomous.

When the skill triggers, the current Claude session does three things. First, it reformulates my request into a self-contained prompt that does not depend on any in-session context (memory, previous messages, or file state that has not been committed). Second, it launches autonomous_runner.py as a detached process via nohup. Third, it tells me the runner is launched and the expected email address. The current session is then free to die, be cancelled, or be closed; the runner continues independently.

The point of the reformulated prompt is that the runner’s Claude invocation is a fresh session with no memory of whatever the current session was doing. Everything the task needs has to be in the prompt itself. This is a constraint that occasionally requires care (a long-running task based on state only in the current session has to serialise that state first), but in exchange it produces a task description that is portable, testable, and re-runnable.

The retry loop

The runner executes claude -p "prompt" in a non-interactive subprocess. Each attempt is a complete, fresh Claude session: same tools, same CLAUDE.md, same memory system, no shared state with previous attempts.

If the attempt fails (non-zero exit, a timeout, or an error marker in the output), the runner retries. The backoff schedule is 30 seconds, 60 seconds, 120 seconds, 240 seconds, 300 seconds, with a default of five retries. Between retries the runner logs the attempt output so that a later post-mortem can see what happened.

Each retry is a fresh session by design. A retry is not “continue where you left off”; it is “try again from scratch”. This is deliberate. Many of the failure modes the runner was built for (model transient errors, tool failures, network glitches) are the kinds of errors that clear on retry. Errors that are genuinely deterministic (a broken prompt, a missing dependency, a credential issue) will fail identically every time, and the runner will exhaust its retries and move to the failure path.

The fresh-session-per-retry model is also a security property. Each retry cannot inherit compromised state from a previous retry. If the first attempt was tricked by prompt injection from a piece of fetched content, the second attempt starts from a clean context.

Success and failure paths

On success, the runner extracts the final output of the Claude invocation and emails it to me. The email body contains the actual result, not a status message. This is the specific guarantee. “Task complete” is not a deliverable. The summary of what was done, with the specific outputs, is.

The email is sent via Simple Mail Transfer Protocol (SMTP) to a specific Gmail app password that lives in the Mac Mini’s credentials file. The subject line encodes the task name and outcome. The body is the Claude output.

If all retries are exhausted, the runner emails a failure report. The report contains the prompt that was submitted, every attempt’s output (truncated if excessive), and the runner’s own log. The failure email is as informative as the success email; the principle is that I should be able to read one email and know what happened.

If the SMTP send itself fails (wrong password, network down, Google rate-limiting), the runner retries the email delivery up to five times with its own backoff. If the email continues to fail after that, the runner writes the result to /tmp/autonomous_result.txt so that at minimum the output is recoverable by reading the file later. This is the last-ditch fallback. It has never fired in production. It exists because the cost of “we thought the email went and it did not” is higher than the cost of writing a file that is never read.

If the runner itself crashes (a rare case: out of memory, disk full, bug in the runner code), it attempts a crash notification email. The crash email is minimal: a few lines of traceback plus a pointer to the log. The principle is the same: whatever happens, something tells me what happened.

Why a detached process

I considered building this inside the Claude Code session itself, using some kind of persistent in-session daemon. I rejected the idea after about an hour of thinking about it. The reason is that a session running the runner is still a session. A session can be killed by the operating system, by running out of resources, by being cancelled, by the terminal being closed. Anything the runner depends on being in-session is vulnerable to those failure modes.

A detached nohup process is not a session. It has its own process identifier, its own lifecycle, its own logging. It inherits none of the session’s fragility. The operating system will keep it running until it exits or is killed by explicit human action. This is the specific property I wanted.

The cost is that the runner does not have the ergonomics of an interactive session. It cannot ask me questions mid-task. It cannot adapt based on intermediate feedback. The runner is a fire-and-forget abstraction, which is precisely what the trigger (“I am not going to watch this finish”) requires.

The rules when running autonomously

The Claude invocation the runner spawns is told, via its system prompt, that it is running in autonomous mode. The rules it follows are encoded in the memory system’s operational rules:

Never block on a non-critical question. If a decision has to be made, make the safer, simpler, more reversible choice and document why.
Only ask to block on truly irreversible actions: pushing to public repositories, sending messages to third parties, deleting data, sending commands to a printer during a live print.
If the task hits a dead end, try a different approach. If the different approach also fails, include that fact in the email. Do not give up silently.
Always send the email, even on failure. The email is the deliverable.
Document every autonomous decision in the email or in a memory update.

These rules exist because the alternative (blocking on me for confirmation when I cannot respond) is exactly the failure mode that made the runner necessary in the first place. Asking me a question I cannot answer is not a safe action; it is an incomplete action that produces no deliverable.

The rules also encode a specific principle about reversibility. The runner is allowed to be bold about things that can be undone (a file write, a local command, a commit to a private repository). It is not allowed to be bold about things that cannot (a public repository push, a message to another person, a destructive printer command). The asymmetry matters: a reversible mistake is a cheap learning opportunity; an irreversible mistake is a real cost.

Where the runner lives and how it is triggered

The runner code is at ~/.claude/skills/autonomous/autonomous_runner.py. The skill definition is at ~/.claude/skills/autonomous/SKILL.md. The log is at /tmp/autonomous_runner.log. An active-task marker file at /tmp/autonomous_task_active contains the process identifier and start time so that I can inspect what is running.

The typical invocation is a one-line bash call inside a Claude Code session:

nohup python3 ~/.claude/skills/autonomous/autonomous_runner.py \
  --prompt "SELF-CONTAINED PROMPT" \
  --email "[email protected]" \
  --max-retries 5 \
  --timeout 600 \
  > /tmp/autonomous_runner_stdout.log 2>&1 &

A few design decisions worth naming:

--timeout 600 is per-attempt, not total. The total wall time is up to max-retries * timeout + backoff-total. A ten-minute task with five retries could theoretically take up to 75 minutes if all retries are used; in practice it completes on the first or second retry or fails cleanly.
The runner strips ANTHROPIC_API_KEY from the environment before spawning Claude, so the invocation uses the subscription authentication rather than paid Application Programming Interface (API) credits. This is a hard rule in my setup; the runner enforces it at the process boundary.
The credentials for Simple Mail Transfer Protocol (SMTP) are loaded from the Mac Mini’s credentials.py at runtime. They are not hard-coded, not passed as arguments, not logged.

What it has actually caught

The runner has been live for a few weeks. Its real value has surfaced in three situations.

First, the overnight case that motivated it. I now routinely leave long-running tasks to complete while I sleep. The tasks do complete; the emails do arrive. The morning routine is “check the overnight email, read what happened, act on the output”. This is what I wanted and it now works.

Second, a case of mid-task network drops. A task was pulling a large data set from an external Application Programming Interface (API) that timed out twice in its first invocation. The runner retried, the second attempt succeeded on the third backoff step, and the email arrived with the completed data set. Without the retry loop, the first timeout would have killed the task.

Third, a case where the task was genuinely impossible. I had asked the runner to do something that required credentials I had not provisioned. The runner hit the authentication failure on every attempt, exhausted its retries, and sent me a clear failure email with the error message and the attempt outputs. I read the email, fixed the credentials, and re-ran. The cost of the failure was one email and two minutes of my time.

The email that never arrives is the category of failure I most wanted to eliminate. The runner has eliminated it. That is the specific property I set out to build and it is the specific property the runner now has.

What it does not do

A few things the runner is explicitly not for.

It is not for interactive tasks. Anything that requires me to answer questions, provide mid-flight context, or approve intermediate steps is the wrong fit. Use a session for those.

It is not for long-horizon planning. The runner runs a single Claude invocation per attempt. Anything requiring multi-step orchestration across hours, with intermediate state and conditional branches, is better expressed as a proper pipeline with its own scheduling. The runner is a fire-and-forget for a single self-contained task.

It is not a replacement for daemons. Recurring work (health checks, backups, monitoring) runs as LaunchAgents or cron jobs, not through the runner. The runner is ad-hoc execution; daemons are scheduled execution. They are different problems and they have different solutions.

The principle the runner encodes

Completion is what you can verify, not what you hope. A task is complete when its result has arrived in a channel where I can read it. If I cannot read it, the task is not complete, even if the work has been done.

This is a surprisingly demanding definition. It forces the runner to treat the email as the deliverable rather than the computation. It forces a retry loop on the email itself, not just on the task. It forces a fallback file write when email fails. It forces a crash notification when the runner itself dies. Each of these layers exists because, at some point, completion-without-delivery happened, and I decided the next instance of that failure mode was one too many.

“Email me when done” is a contract. The runner is what makes the contract enforceable.

The runner is used perhaps five times a week. Its failure rate (defined as tasks that completed but where no email arrived) is zero over the period it has been deployed. The fallback file has never been created. I consider both of those facts load-bearing for the contract the runner encodes.