Ofsted is meant to be a surprise. We have known they are coming for two weeks.
You are only meant to find out about an inspection on the Monday morning it lands. For Victoria C of E Infant & Nursery School in Berkhamsted, that call comes next week.
Why? Because yesterday someone downloaded all of our school policies from our public school website, from an IP address that looks an awful lot like Ofsted. We were already pretty sure it was coming a week ago, when the same source started browsing the website more carefully, the school calendar in particular. The senior leadership team will be working through this weekend to make sure every i is dotted and every t is crossed. We were already very well prepared. Now we are using every tool in our arsenal to be exceptionally well prepared.
This post is about one of those tools: why I started building it in February, what it does, and what I rebuilt over the last fortnight to be ready for next week.
Where this started
The Castle Federation is two primary schools in Berkhamsted. One of them was rated Requires Improvement at its Ofsted inspection in October 2023. A re-inspection can land with about a day’s notice and the governing body has to show inspectors that the concerns have been dealt with and the school is operating at the required standard.
The evidence base for that conversation is enormous. When I went looking, the federation’s document archive came in at 1,650 files and 918MB: policies, minutes, strategy documents, assessment data, safeguarding records, local authority correspondence, subject leader reports, action plans. None of us had read all of it. Most of us had read a fraction. I am one of those governors. I am not a teacher, not an inspector, not a full-time administrator. The practical question we faced was simple. How does a governing body answer an inspector with cited evidence on the day, across a corpus none of us has fully read?
I started on this at an internal hackathon at work in February 2026. The brief was pick your own project and build something with AI that would be useful to the business. I built two things that day. The merchant-risk pipeline I have written about elsewhere, and the first version of the school tool, which started as a terminal script on my laptop reading governance documents and answering Ofsted-style questions. By the end of the day it ran on my machine. Useful, not useable.
What followed over the following two months was seven iterations, each one made obsolete the moment the failure mode of the previous one was clear.
What it does
A web app that has read every document in the federation’s evidence base. A governor asks an Ofsted-style question, the tool returns an answer with citations linking back to the source documents. You can type or speak. Governors logged in at the same time see the same conversation and can build on each other’s questions. Sign-in is by email to a federation address.
The tool does not replace governors’ judgement. It lets us answer “where did the school document its response to this concern?” in thirty seconds rather than thirty minutes. The human stays accountable for the answer; the machine is accountable for finding the paragraph.
Live at governors.timtrailor.com, restricted to Castle Federation governors.
The seven original choices
The seven versions were not planned. Each was a reaction to the way the previous one failed in a real governor’s hands.
- Version 1: a terminal tool on my laptop. It loaded a chunk of the corpus into a language model’s context window and let me ask questions in a shell. Worked, but only when my laptop was open. Not going to survive a real inspection.
- Version 2: a web app with a school selector. Castle Federation has two schools and the evidence base is different for each. The first thing the app asks now is which school the question is about, and the retrieval is scoped accordingly. This sounds obvious in retrospect; it was the single biggest accuracy lift early on, because answers from one school’s policies were no longer drowning the other’s.
- Version 3: voice input via Whisper Flow. Inspection sessions are live conversations, not pauses to type a question and wait for an answer. I used Whisper running on the laptop so audio never left the machine, which mattered for confidentiality, and Whisper Flow handled the streaming so the transcript appeared as I spoke. Voice input changed the tool from “research aid” to “thing you can use in the meeting”.
- Version 4: source citations. Earlier versions produced fluent answers governors had no way to verify, and a fluent answer with no source is worse than no answer because it looks trustworthy. Citations with clickable links fixed this. It took six iterations because matching a model’s paraphrase back to the exact paragraph in the source is harder than it sounds.
- Version 5: a rewritten prompt for tone and format. Early answers were rambly. Inspectors want concise, evidence-based answers; a paragraph-long monologue loses credibility quickly. The prompt now asks the model to answer the way a governor would in an interview: one or two sentences, then a citation, then silence. That single prompt change moved the tool from interesting demo to usable in a real conversation.
- Version 6: hosted on Streamlit Community Cloud, with magic-link sign-in restricted to the federation’s email domain. Other governors could log in from anywhere. The corpus lived in the repository as a Fernet-encrypted blob (a standard symmetric encryption format that scrambles the contents until a key unlocks them); decryption happened at runtime with a key in hosted-environment secrets, so the repository was public but the decrypted contents were not.
- Version 7: real-time collaborative chat. Governors logged in at the same time see the same thread; a question asked by one surfaces for all; a presence indicator shows who is online. An inspection is a team event and the tool being a team tool changed its character.
Roughly seven sessions, none more than a few hours, no engineers involved.
That was where things stood a fortnight ago. Headline numbers: 1,650 files, 918MB, seven iterations, no engineers. The user-facing description was a sentence. The plumbing underneath worked.
Then the inspection signal arrived
A working system is the worst kind of cover. Ours had been answering questions, citing sources, and looking competent for a fortnight. The reason I re-opened it: a careful eye on what was actually inside the answers and how long they took to produce.
A governor asked something like “What are the school’s current development areas?” and the tool gave a clear answer citing documents from October 2024. October 2024 is twenty months ago. The school is not the same school it was twenty months ago. The answer was retrieved correctly (the School Improvement Plan from October 2024 is genuinely about development areas) and answered fluently (the model summarised it well). It was the wrong answer because the question was about right now. The retrieval pipeline had no concept of “right now”.
That single observation triggered most of what follows. If retrieval cannot tell new from old, every analytical question over a multi-year corpus risks being subtly wrong. In an Ofsted inspection, that does not survive contact with the inspector.
Here are the actual numbers as of today, after the rebuild:
- 1,973 documents in the indexed corpus, up from 1,650.
- 10,240 chunks in the retrieval index after section-aware chunking with an 80-character minimum (this skipped 639 stub chunks that would otherwise have polluted the top-K).
- 1,536-dimensional dense embeddings from OpenAI
text-embedding-3-small, replacing 768-dim Gemini embeddings. - One end-to-end question in about 8 to 15 seconds for a factoid, up to about 60 seconds for a deeply analytical one. This is the headline line that took the whole week to deliver.
- Public URL is now
https://governors.timtrailor.com, notcastle-ofsted-agent.streamlit.app. The Cloud version is retired.
Most of those numbers are the outputs of decisions that ate the week. The rest of this post is about the decisions.
The corpus, and the file-type war
Castle Federation’s evidence base lives in GovernorHub, the standard school-governance document platform in the UK. It is a tidy, well-organised tool. It is also a closed shop: there is no public document API, the export options are limited, and most documents arrive as office formats (.docx, .doc, .xlsx, .xlsm, .pptx, .rtf, .odt), with PDFs sprinkled through.
I had built an extractor that walked the file tree and pulled text out. The first version handled .pdf and .docx and assumed those were the bulk. They were not.
When I went looking for the actual count, the desktop folder had 1,778 files. The indexer had ingested 640. More than 1,100 documents were silently being skipped because their extension was outside the allow-list.
That number is what made me rebuild the extractor properly:
.docxviapython-docx(already there)..doc(legacy Word) viaantiwordandtextract. About 50 documents in our corpus were.doc, mostly from before 2018..xlsxand.xlsmviaopenpyxl. Spreadsheets matter for an Ofsted inspection because pupil data, attendance, phonics screening checks, and assessment trackers all live in them..pptxviapython-pptx. Strategy decks, governor away-day outputs..rtfviastriprtf..odtviaodfpy..pdfviapdfplumber, withpypdfas a fallback. Both produce slightly different text on the same PDF; we keep both and merge.
After the rewrite the corpus came in at 1,973 documents indexed end-to-end (some files are duplicates or generated by GovernorHub itself and were skipped). About 300 of those documents have been added since the original 1,650-file post, both because the extractors recovered older formats and because the senior leadership team has been actively producing new evidence.
There is a separate, less glamorous problem the file-type war flushed out. Some of those documents reference each other (“see attached: the policy of 14 May”) and the references are URLs into GovernorHub or Drive. Earlier in the week I had broken the document-link resolution while moving the bundle around: when a governor clicked a citation, they got a 404. Two days of rework on the link resolver gave us Google Drive view URLs as the standard link target, served back through a tiny custom file server I run alongside Streamlit, with proper MIME types and Content-Disposition: attachment; filename="..." headers so a .docx file downloads as a .docx and not a .docx.html. (Streamlit’s static file server caps non-allow-listed extensions to text/plain, which is how the .html suffix gets bolted on by the browser.)
What “good retrieval” actually means
The hardest decision of the week was on retrieval architecture. Here is the abridged journey, with what each step is, what it gave, and what it took away.
Approach 1: flat context dump (where we started)
Stuff every relevant document into the model’s context window, ask the question, hope for the best. This is the default Stack Overflow answer to “I want my model to read my documents”. It works for small corpora.
For us it broke on two axes:
- Memory. Streamlit Cloud’s free tier has a 1 GB RAM ceiling. The compressed-and-encrypted bundle of source text grew to 100MB during the OpenAI re-embedding (was 22MB on Gemini, the new model has bigger embeddings and there are more documents). Decompressing and parsing that JSON in memory peaked around 1.5GB. The container was being SIGKILL’d before it had even rendered the home page. More on that below.
- Lost-in-the-middle. Long-context models do not weight evidence evenly across the prompt. Liu et al. (2023), Lost in the Middle: How Language Models Use Long Contexts, showed that retrieval accuracy on a long-context QA task forms a U-shape: information at the start and end of the prompt is well-recalled, information in the middle is essentially forgotten. With a 100K-token prompt of school documents, anything in the middle 80% had a fighting chance of being ignored. Stuffing the whole corpus into the prompt is worse than properly retrieving a tenth of it.
So flat context had to go.
Approach 2: BM25 only
BM25 is the classical keyword-relevance scoring function. It is a probabilistic ranking function with decades of information retrieval research behind it; the reference text is Robertson and Zaragoza (2009), The Probabilistic Relevance Framework: BM25 and Beyond. It is not a neural model. It does not understand semantics. It is brutally good at “find me documents that contain these specific words”, and for a governance corpus where vocabulary is highly specific (Ofsted, EYFS, GLD, phonics, KS1), it is surprisingly hard to beat as a baseline.
BM25 alone is the keyword half of what eventually became the live hybrid. I did not run it as a standalone end-to-end mode against the formal eval set (the eval was hand-graded against the live hybrid baseline, not the early ablations). The failure modes I want to flag from a fortnight of using it: it whiffs on synonyms (“teacher” vs “educator”), paraphrases (“how often do governors visit” vs “frequency of governor visits”), and questions where the answer document does not contain the question’s literal vocabulary at all.
Approach 3: dense embedding retrieval only
Replace BM25 with a vector similarity score over OpenAI embeddings. The model that built the index is text-embedding-3-small, 1,536 dimensions, ~62MB on disk for the 10,240-chunk index. Dense retrieval handles paraphrase and synonym beautifully. The reference paper for this style of system is Karpukhin et al. (2020), Dense Passage Retrieval for Open-Domain Question Answering.
Like BM25, dense alone was an early ablation rather than a formally-eval’d standalone mode. Where it fails: rare specific terms (“Mrs Hellings”, a former governor), exact policy titles, and any question whose answer turns on a precise number rather than a paraphrasable concept.
The two approaches fail on different questions. That is the textbook setup for a hybrid.
Approach 4: hybrid (BM25 + dense + reciprocal-rank fusion)
The reason hybrid retrieval is the industry norm now: Cormack, Clarke and Büttcher (2009), Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods, showed that combining retrieval systems by reciprocal-rank rather than score normalisation is robust across very different rankers, and tends to outperform either input alone.
The implementation runs BM25 and dense retrieval in parallel, takes the top-100 from each, and combines via 1 / (k + rank) with k = 60. Top-K from the merged ranking goes to the next stage. On the 50-question eval set this gives 68% overall fact recall, 61% retrieval@10, with the easy tier at 83%, medium at 73%, hard at 47%. The latency cost of the parallel BM25 + embedding lookup is about 0.6 seconds, well under budget.
There is a subtlety hiding here that bit me mid-week. When two chunks tied on score, the order they came back in was non-deterministic (it depended on the dictionary insertion order, which depended on which parallel future returned first). Two governors asking the same question could get different sources cited. I fixed this with np.lexsort((indices, -scores)) for a stable top-K with a deterministic tiebreaker. This sounds pedantic until you imagine an inspector cross-referencing two governors who quoted different documents on the same fact.
Approach 5: hybrid + Flash rerank
Hybrid still leaves about 16% of questions with the right answer outside the top 5. The standard fix is a cross-encoder rerank: a small smart model looks at the question and each candidate together and re-orders. We use Gemini 2.5 Flash for this step because it is cheap, fast and good at ranking. (The original idea generalises: see Reimers and Gurevych (2019), Sentence-BERT, and the more recent Cohere Rerank docs for the production pattern.)
Rerank turned out to trade surface accuracy for depth. Easy fell from 83% to 78%, medium from 73% to 67%, but hard rose from 47% to 53%. Overall fact recall is 66%, slightly below hybrid’s 68%, because the aggressive reordering hurts the easy and medium questions more than it helps the hard ones in absolute terms. It also added latency: TTFC went from 16.5s to 20.7s, total from 30.7s to 35.9s. For an analytical question that is fine. For a factoid like “when is the next Full Governing Body meeting?” it is wasted time.
Approach 6: skip rerank for factoids
I added a factoid heuristic that detects questions which look like one-fact lookups (when is, who is, what date, “list of”) and skips the rerank entirely on those. Overall fact recall holds at 67%; the win is speed. TTFC drops from 16.5s (hybrid) and 20.7s (with rerank) to 13.3s, and total time from 30.7s and 35.9s to 27.0s. Hard questions, which are almost always non-factoid, still go through the rerank path.
This is the mode the live system runs in today.
Approach 7: LLM-judge filter (analytical path only)
For the analytical path I additionally pass the reranked top-15 through a Sonnet-based “judge” call that decides which chunks are actually responsive to the question. This was inspired by Zheng et al. (2023), Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, which showed that capable models can be reasonable evaluators of relevance. The judge is just a focused prompt that returns “keep” / “drop” for each chunk. It lifts overall fact recall to 70%, the best of the modes tested, with the largest gain on the hard tier (47% with hybrid only, 53% with rerank, 58% with rerank + judge). The quality-of-citation lift is much more visible to humans than the recall number alone suggests: the judge filters out chunks that are adjacent to the topic but do not actually answer the question.
Performance trajectory
The eval set is 50 questions, hand-graded across three difficulty tiers (easy 15, medium 20, hard 15), each scored against the source document containing the truth. Two recall numbers per mode: fact recall (did the answer model produce the right factual answer end-to-end), and retrieval@10 (was the truth-document in the top 10 retrieved chunks). Latencies are mean seconds across all 50 questions; TTFC is time-to-first-character, total is end-to-end.
| # | Approach | Easy fact (n=15) | Medium fact (n=20) | Hard fact (n=15) | All fact (n=50) | Retrieval@10 | TTFC mean | Total mean |
|---|---|---|---|---|---|---|---|---|
| 1 | Flat context dump | OOM | OOM | OOM | OOM | n/a | n/a | n/a |
| 4 | Hybrid: BM25 + dense + RRF (current default) | 83% | 73% | 47% | 68% | 61% | 16.5s | 30.7s |
| 5 | + Flash cross-encoder rerank | 78% | 67% | 53% | 66% | 58% | 20.7s | 35.9s |
| 6 | + factoid auto-skip-rerank | 74% | 73% | 52% | 67% | 61% | 13.3s | 27.0s |
| 7 | + LLM-judge filter (best overall) | 84% | 68% | 58% | 70% | 63% | 18.3s | 30.6s |
Rows 2 and 3 (BM25-only and dense-only) are not in the table because they were exploratory ablations rather than formally-eval’d modes. The eval set was hand-graded once the hybrid baseline was up.
A few things the table makes obvious that the prose does not:
- Rerank does not always help overall recall. Mode G (with rerank) is 66% overall, two points below mode A (hybrid only) at 68%. The reason is that rerank’s reordering tends to help the hard questions (47% to 53%) at the cost of the easy and medium ones (5 to 6 points each). On a 50-question eval where easy + medium are 35 of the questions, the easy-tier loss outweighs the hard-tier gain in the average.
- Factoid auto-skip is a speed win, not a recall win. Mode F (factoid path skips the rerank, analytical keeps it) holds overall recall at 67%, very close to mode A’s 68%, but cuts mean total time from 30.7s to 27.0s and TTFC from 16.5s to 13.3s.
- LLM-judge is the only mode that improves both overall fact recall and the hard tier. Mode H is 70% overall (the best), with hard at 58% (up from 47% with hybrid only).
A note on answer accuracy versus retrieval recall: retrieval finding the document does not mean the answer model produces the right factual answer. The “fact” columns above are end-to-end (the model wrote an answer that matched ground truth); the retrieval@10 column is retrieval-only. The gap between the two (about 7 points on average) is the answer-model’s contribution.
The recency problem
Recall the October-2024 development-areas answer. The retriever was finding the relevant document; it just had no concept that the document was old. I added a recency boost in two layers:
- Default decay applied to every query, with weight 0.025 for a current-month document and the boost halving every four months. This is small enough that a clearly-on-topic old document still beats an off-topic new one, but large enough that a tied score goes to the recent doc.
- Explicit boost triggered when the query contains words like latest, current, this term, this year. Weight goes to 0.10, three to four times the default. This is the case where the user has told us they want recent.
The boost is layered on top of the answer model prompt, which now says (paraphrased): Default to the current academic year. Older sources are not the answer unless the question is explicit about a past period (for example “response to the last Ofsted”) or no recent source exists. When an answer must rely on a document older than 12 months, flag it at the top of the answer.
This is the change that fixed the development-areas question. The same retrieval pipeline that previously surfaced the October-2024 SIP now surfaces the September-2025 SIP, and it does so without us having to manually tune the query.
Off Streamlit Cloud, onto a Mac Mini
The original article said “hosted on Streamlit Community Cloud, restricted by federation email”. By midweek that hosting was untenable, for a reason that is interesting in its own right.
When I re-embedded the corpus with OpenAI’s 1,536-dim model (twice the dimensionality of Gemini’s 768-dim), the on-disk size of the embeddings doubled. The compressed-and-Fernet-encrypted bundle that the Streamlit Cloud surface decrypts at startup grew from 22MB to 100MB. Loading it into memory peaks high enough to OOM the 1GB-tier container during the JSON parse: Streamlit was being SIGKILL’d silently with no Python traceback, which is the most fun kind of debugging.
Three fixes applied in sequence:
- Streaming-style decode. Decompress the gzip into bytes, parse, and immediately
delthe intermediate buffers, withgc.collect()calls after each large drop. Cut peak memory by about 600MB. - Float16 embeddings.
np.load(...).astype(np.float16, copy=False)halves the in-memory size of the embeddings (60MB to 30MB) at no measurable accuracy cost on dot-product similarity. OFSTED_DISABLE_BM25=1. A flag that skips loading the BM25 pickle entirely, falling back to dense-only retrieval. About 200MB saved at peak.
These three got the system back inside 1GB. But the more painful discovery was upstream of the engineering: Streamlit Cloud no longer offers a paid tier. Snowflake acquired Streamlit, and the hosted option for a Streamlit app is now either Snowflake’s own managed offering (priced for an enterprise data team, not a school governor’s side project) or self-host. The free tier’s 1GB limit is a hard ceiling I cannot pay my way past on the Streamlit Cloud surface.
The migration was straightforward and inevitable:
- Mac Mini at home becomes the host. Same hardware that already runs other personal projects, several gigs of RAM headroom, no memory ceiling.
- Cloudflare Tunnel for public reachability. The tunnel daemon (
cloudflared) connects out from the Mac Mini to Cloudflare’s edge over an authenticated channel; Cloudflare terminates HTTPS at the edge and forwards plain HTTP to the local Streamlit on127.0.0.1:8501. No port forwarding on the home router, no IP exposure, no DDNS gymnastics. governors.timtrailor.comas the public hostname. Cloudflare’s free Universal SSL covers*.timtrailor.com(a single-label wildcard) so the existing certificate works with no extra paperwork.- launchd LaunchAgent with
KeepAlivefor crash recovery. The service auto-restarts on any non-zero exit, so an OOM (now hypothetical) or any other crash does not leave the public URL down.
Public URL was up within a few hours of starting the migration. There are some caveats: the www.governors.timtrailor.com variant does not work because two-label subdomain wildcards need Cloudflare’s paid Advanced Certificate Manager, but the bare hostname is what governors use anyway.
Authentication, and the iOS Safari problem
The original Streamlit Cloud version used a magic-link sign-in restricted to federation emails. That uses Streamlit’s hosted auth, which the self-hosted version does not have.
I rebuilt the magic-code flow myself: governor enters email, server generates an HMAC-signed time-bucketed 6-digit code, sends it via SMTP, governor enters code, server verifies, session begins. The code is HMAC-SHA-256 over email:bucket where bucket = floor(time_now / 600), so codes are valid for at most 20 minutes (current bucket plus the previous one for sliding-window tolerance).
Persistence is where it got interesting. Streamlit’s session state is per-WebSocket connection; mobile networks drop WebSockets routinely; iOS Safari’s Intelligent Tracking Prevention purges JavaScript-set first-party cookies after seven days at the most aggressive setting and as soon as next browser launch in some configurations. So a JavaScript cookie alone is not reliable.
The persistence layer ended up as belt-and-braces:
- HMAC-signed token in the URL. After the user verifies, we redirect to
?auth=<token>where the token encodesemail:expiry:signaturebase64-encoded. Reloads, pull-to-refreshes, browser restarts, all carry the URL with them. Bookmarking the URL is equivalent to bookmarking a session. - JavaScript cookie as a secondary path, set to expire after seven days, in case the user pastes the URL without the
?auth=query. - Server-side validation of either path on every request. The signature includes the user’s email, so an attacker cannot fabricate a token without the auth secret.
It is the URL-token mechanism that is doing the actual work. The cookie is there for completeness.
The model choices, by stage
Five different models live in different parts of the pipeline. Each was chosen for a specific job.
| Stage | Model | Why |
|---|---|---|
| Query decomposition | Gemini 2.5 Flash | Fast, cheap, decent at parsing intent. About 4-6 seconds. |
| Dense embeddings (corpus) | OpenAI text-embedding-3-small | Replaced Gemini embeddings after Gemini’s free quota started returning 503s during a re-embed. 1,536-dim quality is a step up at marginal cost. |
| Embeddings (query) | OpenAI text-embedding-3-small | Has to match the corpus model. The retrieval manifest pins the embedding model so I cannot accidentally mix providers. |
| Cross-encoder rerank | Gemini 2.5 Flash | Same model as decomp, different prompt. About 3-10 seconds depending on candidate count. |
| LLM-judge filter | Sonnet 4.5 | Stronger than Flash, used sparingly (only on the analytical path). |
| Answer-writing (default) | Sonnet 4.5 via API, streaming | First token in about 1.4 seconds, total time about 5 to 15 seconds for a typical answer. Prompt caching with cache_control: ephemeral on the system prompt. |
| Answer-writing (deepest) | Opus 4.7 via subscription | Reserved for analytical questions that need the larger model. Slower (30-90 seconds) but deeper. Currently being routed via API on this build because the subscription bridge has been intermittent this week. |
The decision to default the public surface to Sonnet API streaming rather than Opus subscription was made yesterday. Streaming matters for perceived responsiveness: the first token reaches the governor’s screen in well under two seconds, which is enough to feel like the system is working rather than blocked, even if the full answer takes another ten seconds. Anthropic’s own research on perceived latency says the first 200ms of any response is the most psychologically expensive; the budget gets spent on that first token, and whatever token rate the model gives afterwards is accepted.
Opus subscription was the more attractive option early in the week because it is “free” once you have the Max plan. It became the less attractive option after the local Claude CLI started failing intermittently with no informative error, which on a public-facing tool is unacceptable. The fix exists (proper subprocess stderr capture, environment-variable propagation under launchd) but it was not the right battle to fight in the run-up to an inspection. API spend is bounded; an unreachable model is not.
Rate limits, and the “spend money to remove the bottleneck” decision
By Thursday afternoon I was tripping Anthropic’s per-minute input-token rate limit. Per-question prompts run about 100K input tokens (system prompt plus retrieved documents plus question), the API tier I was on caps at 450K input tokens per minute, so even six rapid-fire questions in a sixty-second window would 429. The retry-with-backoff code helped but did not fix the underlying physics.
The rational response, on a one-week timeline before an Ofsted inspection that has cost the senior leadership team months of preparation, is to spend the money. I am upgrading the API tier (Tier 4, ~$400 deposit, 2M input tokens/min, four times the headroom) over the weekend. £400 is a rounding error against the cost of a poorly-handled inspection, and the cap goes back down once we are no longer being asked questions live.
There is a cleaner architectural fix on top: restructuring the prompt to put the static system rules in a separately-cached block and the dynamic documents-and-question in the unannotated user message. Anthropic’s prompt caching gives a 90% read-discount on the cached portion, which means later questions in a session are effectively half the rate-limit cost. That ships this weekend.
What the architecture looks like today
[Governor's iPhone Safari]
|
v HTTPS
[Cloudflare edge, governors.timtrailor.com]
|
v authenticated tunnel (cloudflared)
[Mac Mini @ home, port 8501]
|
+-- Streamlit (LaunchAgent, auto-restart)
| |
| +-- Stage 1: query decomp (Gemini Flash)
| | -> hybrid retrieval (BM25 + OpenAI embeddings + RRF)
| | -> rerank (Gemini Flash, skipped on factoids)
| | -> LLM-judge filter (Sonnet, analytical only)
| |
| +-- Stage 2: synthesis (Sonnet 4.5 API, streaming)
| | with prompt caching, 1 retry on transient errors
| |
| +-- Auth: URL token + cookie + server-side HMAC verify
|
+-- docs_server.py (port 8502, custom MIME types)
-> serves source documents back to citations on click
It all fits on a Mac Mini, answers a typical question in about 15 seconds, and is one rollback away from the version I wrote about a fortnight ago.
What I am doing this weekend
Three things, in order:
- API tier upgrade. Console click-through, $400 deposit. Removes the rate-limit bottleneck before any plausible inspection-day load.
- Pre-flight dry run. A 10-question script of representative Ofsted questions, end-to-end against the production stack, with answer-quality and latency captured for each. Anything that surprises me gets fixed before governors see it.
- Lock the surface. Pin the synth backend to Sonnet API. Hide the experimental controls (model picker, retrieval-mode selector, time-frame filter) from non-developer governors so nothing surprising can happen during an inspection. Friendly fallback message (“system busy, try again in a moment”) for any error path, instead of a stack trace.
Fourth thing if there is time: write the missing ?auth= token into the magic-code emails directly so the governor never has to type a code, just clicks the email link. The infrastructure is there; it is one extra line in the email template.
What I have learned
The original article ended with two takeaways: that “too much documentation, not enough time, high-stakes conversation” is a category, and that citations are what made the tool trusted. Both of those still hold. Two more from this fortnight:
A working system will hide its assumptions. The retrieval pipeline had an implicit assumption that all documents were equally current. It produced fluent, citation-backed answers about a school as it was twenty months ago. The error mode was not visibly broken, it was quietly out of date. The only way I caught it was by reading the answers carefully against current ground truth. A user-acceptance test suite without a temporal axis cannot catch this class of failure.
Spending money on the last mile is a feature, not a defeat. I spent the first half of the week trying to make the system fit inside Streamlit Cloud’s 1GB ceiling and Anthropic’s Tier 2 rate limit. The decisions that finally made the system bulletproof for an inspection were “move to a beefier host” and “buy more rate limit”. Both are unglamorous compared to a clever caching trick, but they are vastly more reliable. With the inspection due next week, paid-tier headroom is the right answer.
Sources, if you want to dig further
Papers referenced in the post:
- Robertson and Zaragoza (2009), The Probabilistic Relevance Framework: BM25 and Beyond.
- Karpukhin, Oguz, Min, Lewis, Wu, Edunov, Chen, Yih (2020), Dense Passage Retrieval for Open-Domain Question Answering.
- Cormack, Clarke, Büttcher (2009), Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.
- Reimers and Gurevych (2019), Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.
- Liu, Lin, Hewitt, Paranjape, Bevilacqua, Petroni, Liang (2023), Lost in the Middle: How Language Models Use Long Contexts.
- Zheng, Chiang, Sheng, Zhuang, Wu, Zhuang, Lin, Li, Xing, Zhang, Gonzalez, Stoica (2023), Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Industry references:
- Anthropic on prompt caching.
- Cohere Rerank, the production-deployment pattern that hybrid + rerank is converging on across the industry.
- Streamlit on self-hosting after Snowflake.
The code repository is still public at timtrailor-hash/castle-ofsted-agent. The documents themselves are not. If you want to fork it for your own federation, the documents slot is yours to fill.
The next time I update this post, the inspection will have happened. I will say what worked, what did not, and what I wish I had built. If everything goes to plan, the most interesting thing I will have to report will be that nothing remarkable happened: the tool sat in the meeting and the governors used it without thinking about it.
That is the bar. That is what next week is for.
Repository: timtrailor-hash/castle-ofsted-agent. Live tool: governors.timtrailor.com. Access restricted to Castle Federation governors.