918MB, an Ofsted inspection, and a governor who is not a developer

One of the two schools in my kids’ federation was rated “Requires Improvement” at its previous Office for Standards in Education (Ofsted) inspection in October 2023. The re-inspection could happen at any time with roughly a day’s notice. The governing body’s job, when the call comes, is to give evidence to the inspectors that the school has addressed the concerns raised last time and is now operating at the standard required.

The evidence base was 1,650 files. 918 megabytes. Policies, minutes, strategy documents, assessment data, safeguarding records, correspondence with the local authority, subject leader reports, action plans. A fraction of it is the real evidence; most of it is the scaffolding. Even reading the fraction takes a team of governors longer than they have.

I am one of those governors. I am not a teacher, an inspector, or a full-time administrator. I have a day job. So do my fellow governors. The practical question we were staring at was: how do we answer inspector questions with cited evidence, on the day, across a corpus none of us have fully read?

What we built

A web application that has read every document in the evidence base. You ask it an Ofsted-style question, it answers with cited references, and the citations link back to the source document. You can use it by typing or by speaking (the school halls are noisy during an inspection; hands-free mattered). Multiple governors logged in at the same time see the same conversation and can build on each other’s questions. Authentication is by email to a federation email address.

The point of the tool is not to replace the governors’ judgement. It is to let the governors answer “where did the school document its response to this concern?” in thirty seconds instead of thirty minutes. The human is still accountable for the answer given to the inspector; the machine is accountable for finding the paragraph.

It is live at castle-ofsted-agent.streamlit.app. Access is restricted to federation governors.

What it went through to get there

Seven versions. Not as a straight line; as a staircase, with each version making the previous one obsolete once the failure mode was understood.

Version 1 was a terminal tool. I ran it on my laptop. It loaded a subset of the corpus into the context of a language model and let me ask questions. It worked. It could only be used by me, at my laptop, when my laptop was open. That obviously was not going to survive contact with a real inspection.

Version 2 was a web application with a model selector and a school focus (the federation is two schools, and answers are better when scoped to one of them). This ran on my laptop and was reachable by other governors only if they were on my home network, which they were not.

Version 3 added voice input. Inspections are noisy and governors are frequently doing something with their hands. I used Whisper running locally on my laptop so that audio never left the machine. That mattered for document confidentiality.

Version 4 added source citations. The early versions produced fluent answers that felt right but had no way for a governor to verify. A fluent answer with no source is worse than no answer, because it looks trustworthy. Citations with clickable links to the source document fixed this. This took six iterations on its own, because matching a model’s paraphrase back to the exact paragraph in the source is harder than it sounds.

Version 5 rewrote the prompt to adjust tone and format. The early model answers were discursive and long. Inspectors want concise, evidence-based answers. A governor giving an inspector a paragraph-long monologue loses credibility quickly. The prompt now instructs the model to answer as a governor would in an interview: one or two sentences, followed by a citation, followed by silence.

Version 6 moved the whole thing to a hosted environment (Streamlit Community Cloud), protected by email-based magic-link authentication restricted to the federation’s own email domain. Other governors could now log in from anywhere. The document corpus lives in the repository as a Fernet-encrypted blob, decrypted at runtime with a key held in the hosted environment’s secrets. The repository itself is public; the decrypted contents never are.

Version 7 added real-time collaborative chat. Multiple governors logged in at the same time see the same conversation thread. A question asked by one surfaces for all. A presence indicator shows who is online. This is a small feature with a disproportionate effect: an inspection is a team event, and the tool being a team tool rather than an individual tool changes its texture.

That is the whole staircase. Roughly seven sessions. None of them more than a few hours. No engineers involved.

What it is not

It does not replace governor judgement. The tool is explicitly framed in its responses as a first-pass evidence retriever, not an inspector-facing authority. The final word on what the governing body believes and what it can evidence stays with the humans in the room.

It does not expose the documents outside the governing body. Decryption happens server-side behind authentication. The public repository contains the encrypted corpus and the code, nothing else.

It is not trying to be generalisable school-inspection software. Every school’s documents are different, every federation’s priorities are different, and the value comes from the tool knowing this specific school’s evidence base. If another governor wants to build the same thing for their school, the code is in the public repository and the documents slot is theirs to fill.

What it taught me

Two things.

First, the category of problem. “Too much documentation, not enough time to read it, and a high-stakes conversation where specific references are required” is a template. A governing body is one example. A board reading a data-room before an acquisition is another. A senior policy-maker preparing for a select committee is another. The pattern of the tool (encrypted corpus, authentication to a defined group, retrieval with citations, collaborative chat) generalises. I would build it again for any situation that shape.

Second, the politics of using it. I was careful, early on, to frame the tool for the governing body as a research aid, not a decision-maker, and to explicitly say that the citations were the thing that made it useful. If a governor could not click through and verify the evidence, the tool would not have been trusted. The citations made it trustworthy in a way that a fluent answer alone did not. This is a specific case of a general rule: in any context where you are answering to a third party, the evidence is more valuable than the fluency.

What happens next

At the time of writing the school has not yet been re-inspected. When the call comes, the tool will be in the meeting. I will update this post afterwards.

The code is public. The documents are not. The tool exists because a non-developer on a governing body had one evening a week for a month and a reason to want an answer that was better than “I could not read all 918 megabytes”.

Repository: timtrailor-hash/castle-ofsted-agent. Live tool: castle-ofsted-agent.streamlit.app. Access is restricted to Castle Federation governors.