Tim Trailor

Three-way AI model debate as a pre-commit gate: receipts from several months of use

The harness is about 150 lines of Python. I have run it on architectural reviews, plan approvals, and more recently on every significant write to my personal AI infrastructure.

This post is about what I have learned from running it. Most of what I assumed before I built it turned out to be wrong. The value is not where I expected. The failure modes are not what I worried about. And the protocol caught two classes of error single-model review would have missed, both with specific dated receipts.

What the protocol does

The orchestrator (me, or a Claude Opus session acting as scheduler) writes the decision to disk as context. The context is the full design document, the change plan, or the pull request diff with enough surrounding code to make it standalone.

Round 0. All three models see the context. None of them see any other’s response. Claude writes its position to a file. The Python harness calls Gemini and GPT-5.4 in parallel and writes theirs. All three files are revealed simultaneously. Each position must end with a verdict line: APPROVE, CHANGES REQUESTED, or BLOCK.

If all three return the same verdict, the debate is over. Unanimous. That result carries meaning because it happened before any model anchored on any other.

If two of three agree, the dissenter’s reasoning gets flagged. The orchestrator can either accept the majority with a note, or proceed to round 1 on the grounds that the dissent is worth exploring.

If all three differ, round 1 is mandatory.

Round 1+. Each model sees all three round-zero positions and must engage with them explicitly. “Where do you agree? Where do you disagree? If either reviewer raised a valid point you missed, update your verdict honestly. If your position is still correct, defend it with new evidence, not the same argument.” The rotating blast-radius seat is assigned on round 1: for that round, that model’s job is specifically to walk the dependency graph and identify what consumers of the change have not been checked. The other two continue design critique. I rotate the seat across debates so no single model’s blind spots dominate.

Rounds continue until unanimous, majority with acknowledged dissent, or max rounds hit (default 4). If max rounds hit without convergence, all three final positions are surfaced and I decide.

The harness is at [URL when published]. The core of the Python runner:

def run_round(round_n, context, prior_positions):
    if round_n == 0:
        gemini_prompt = BLIND_PROMPT.format(context=context)
        gpt_prompt = BLIND_PROMPT.format(context=context)
    else:
        gemini_prompt = INFORMED_PROMPT.format(
            context=context,
            own_prior=prior_positions["gemini"],
            claude_prior=prior_positions["claude"],
            gpt_prior=prior_positions["gpt"],
        )
        gpt_prompt = INFORMED_PROMPT.format(
            context=context,
            own_prior=prior_positions["gpt"],
            claude_prior=prior_positions["claude"],
            gemini_prior=prior_positions["gemini"],
        )
    # Call Gemini and GPT in parallel; Claude writes its position directly to file
    with ThreadPoolExecutor(max_workers=2) as pool:
        gem_future = pool.submit(call_gemini, gemini_prompt)
        gpt_future = pool.submit(call_gpt, gpt_prompt)
        return {"gemini": gem_future.result(), "gpt": gpt_future.result()}

Cost per debate at full tier (Gemini Pro + GPT-5.4 full): roughly Critique.30 to Critique.60 per decision for four rounds, with Claude on the orchestrating session costing nothing extra because the Opus subscription is already paying for that. Mini tier (Gemini Flash + GPT-5.4-mini) is about a third of that.

What I expected before running it

I expected debate value to come from ensembling. Three models, three imperfect judgments, majority verdict more reliable than any single one. The “wisdom of crowds” framing.

I expected the cost to be round count. Four rounds, three models, context-heavy prompts, not cheap.

I expected the failure mode to be collusion. If all three came from similar training data, they would anchor on the same mistakes, unanimous verdicts would be cheap signal, and the protocol would give false confidence.

I expected the protocol to be mostly useful for safety-critical decisions (new daemons, security changes, anything touching the printer) and less useful for small design choices.

Most of these expectations were wrong.

What I actually found

Value is in framing diversity, not verdict ensembling. Three models trained by three different labs do not produce three versions of the same answer. They produce three different framings. Claude defaults to systems-level risk analysis. Gemini defaults to “what is missing” and “who has not been consulted”. GPT-5.4 defaults to first-principles re-derivation of the problem. When I see those three framings on the same question, I learn more from the three-way comparison than I do from any single one, even when they converge.

Convergence is dangerous. When all three models agree immediately on round 0, it is frequently because they share a training-data bias. The first time I ran the protocol on a “should I decompose this 7,000-line monolith?” decision, all three returned APPROVE with almost identical reasoning. I shipped the decomposition plan. Two weeks in, I realised all three had missed the same question: did any downstream consumer depend on the monolith’s internal call graph in a way that module splits would break? The answer was yes. Unanimous round-zero agreements are now flagged as suspicious and I force a round 1 specifically looking for what the agreement masked.

Principled disagreement is the signal. The most valuable debates are the ones where two models converge and one holds out with a specific technical objection. That objection is almost always worth a third look. Not always right. But worth another round of thought. In my data, round-one dissent after round-zero agreement has changed my decision about a quarter of the time, which is a much higher signal rate than any round-zero unanimous.

The blast-radius seat is load-bearing. This was not my idea. It came from a failed review in April 2026: I ran the protocol on the plan to rebuild my controlplane infrastructure, got unanimous APPROVE on round 0, shipped the plan, and produced four red flags on my iOS app within hours of the first real check. All three models had reviewed the design in isolation from its operational blast radius. They had not asked “what consumes this? what is the staleness horizon of every consumer?” because they had not been asked to. When I debriefed the failure, all three models converged on the same fix: one seat per round must be explicitly responsible for operational dependency analysis, not design critique. I added that constraint to the protocol the next day. Every debate I have run since has surfaced at least one unchecked consumer. Without the blast-radius seat, those would have become post-ship surprises.

Specific incidents where the protocol caught things

I am going to name three, because concrete matters more here than general claims.

The research-plan debate (20 April 2026). I asked the protocol to review my plan for a global research review paper on Claude Code best practices. Round 0, blind. All three returned CHANGES REQUESTED. Not unanimous on the specific changes, but unanimous on the verdict. Claude flagged “the plan is biased toward addition; add a simplification vector”. Gemini flagged “the plan is too broad for the time budget; sharpen the thesis”. GPT-5.4 flagged “the plan underweights adversarial review of prompt-injection pathways”. Three different framings of the same underlying issue: the plan was asking the wrong question. The correct question was not “what tools and patterns exist?” but “what is under-measured in this setup that should be measured first?”. I rewrote the plan with a measurement-first anchor. The eventual paper was, to the extent I can judge my own output, materially better than the plan’s original shape would have produced.

The controlplane rebuild debate (11 April 2026). This is the failed one, which makes it more instructive. I asked the protocol to review a seven-phase plan to rebuild my infrastructure under a unified control plane repository. Round 0, unanimous APPROVE with substantive commentary. I shipped. Within hours of the first acceptance-test run, four red flags appeared on my iOS app, all pointing at consumers of the old layout that the new layout had broken without migration. In debrief, all three models converged on the same root cause: they had reviewed the design in isolation from downstream operational state. None of the three had asked whether health_check.py’s hardcoded list of LaunchAgents matched the new services.yaml. None had asked whether the iOS app’s expected file paths were in the new tree. The design was correct; the rollout was broken. The lesson was not “the debate was wrong”. The lesson was “the debate did not cover operational blast radius, and it needed to”. That is where the mandatory blast-radius seat came from.

The recommendations-paper debate (this week). The more recent one. I asked the protocol to review a draft recommendations paper I was about to publish to myself. Round 0 unanimous CHANGES REQUESTED, with the three models converging on one missing recommendation from slightly different angles. Claude said “prompt/system-message management strategy is missing”. Gemini said “cost observability for API spend attribution is missing”. GPT-5.4 said “versioning and pinning of prompts, skills, MCP server configs, and model IDs as first-class release artifacts with rollback by exact digest is missing”. All three were circling the same gap, which is that the configuration surface I am managing is not versioned, measured, or pinned as a coherent artifact. I consolidated the three observations into one new high-confidence recommendation and added it to the paper. Single-model review (by any of the three alone) would have surfaced one of the three angles. Three-way debate surfaced the underlying common concern precisely because three partial views of the same thing are more informative than one full view of a fragment.

Where the protocol has not helped

Small decisions. For edits under 50 lines or choices between two equivalent implementations, three-way debate is noise. I use /review (which runs a lighter check: ruff lint, semgrep static analysis, a code-reviewer subagent, and optionally one external model) for everything below a complexity threshold. /debate is for architecture, plans, anything irreversible, anything touching the printer. Maybe one in twenty pull-request-equivalent decisions.

Time-critical decisions. The protocol takes three to eight minutes per round. For decisions under time pressure (a live incident, a print heading toward failure) it is the wrong tool. I use it before the change, not during.

Decisions the three models are all confidently wrong about. This is the failure mode that cannot be eliminated by protocol. If all three models share a training-data gap, the protocol’s unanimous verdict will be confidently wrong. I have hit this twice that I know about. In both cases the gap was empirical: a specific piece of reality that models trained on data from before the reality changed were systematically wrong about. Round-zero unanimous is now a flag, not a conclusion, partly for this reason.

Aesthetic or taste decisions. “Should the tone of this post be more formal?” is a reasonable human question and a bad debate question. Three models’ aesthetic judgments average to mush. The framing diversity that makes the protocol valuable on technical questions works against it on taste questions.

The one confidence trap I did not predict

Over the first few months of running the protocol, I caught myself doing something I did not like: treating unanimous round-zero APPROVE as licence to ship faster than I would have without the protocol. The debate was doing the work of my own due diligence. That is exactly what I built it for, but it cuts the wrong way: when the debate converges early, it reduces my own scrutiny of the decision, which is precisely the moment the collusion failure mode bites hardest.

The fix was to add a rule I now follow: unanimous round-zero APPROVE is not a green light. It is a flag to look again specifically at what the agreement might be masking. For plans involving real-world state, I force a round 1 specifically assigned to “what did all three miss?”. The rule has not been costly. Round 1 costs a few minutes and often surfaces something interesting. The cost of skipping it, on the one occasion I did so (the controlplane rebuild), was not trivial.

Would I recommend this

If you are making a lot of agent-system or infrastructure decisions, yes. The protocol is cheap relative to the cost of a single wrong architectural decision.

If you are making one decision a month, probably not. The overhead of setting up the harness is not worth the cost saving on that volume.

If you already have two humans who will disagree with you in principled ways, you do not need the protocol. Humans are strictly better at this than models are. The protocol is a substitute for not having those humans, which is my situation and probably yours if you are running solo.

If you are making aesthetic or taste decisions, no. The protocol is built for the kinds of questions where there is a right answer (or a set of wrong answers) that three models can independently converge on. Taste is not that kind of question.

The full harness is at [URL when published]. The prompts are there, the verdict-detection regex is there, the rotating-seat logic is there. Copy what is useful. The parts that matter most are: (1) round 0 is blind, so models cannot anchor; (2) every round requires engaging with the previous positions, not just asserting; (3) one seat per round is explicitly operational rather than architectural. Without those three, it is not really debate. It is voting with extra steps.


The debate harness, prompts, and several transcripts of dated debates (redacted for any project-specific detail) will be published alongside this post. The controlplane rebuild debrief is a worked example of how the protocol updates itself in response to its own failures.