Tim Trailor
Story

From model to agent: what changed when I stopped predicting and started investigating

Why the regression models that came out of the hackathon got replaced within weeks by three agentic tools. The short version: probability scores without narrative are not what analysts need.

A regression model tells you which merchants are risky. It does not tell you why. For a fraud analyst, the “why” is the whole job. That is the reason every model I built in the weeks after the hackathon got replaced by agents within a month. Not because the models were wrong; they were fine. Because the outputs were the wrong shape.

What the models did well

I built six of them. Each targeted a specific fraud vector: shell sites, catalogue mismatches, compliance gaps, suspicious geography, transaction-pattern outliers, dispute-rate anomalies. Each was a binary classifier trained on labelled cases, validated on holdout, and reasonably calibrated. Put a merchant in, get a probability out.

The models were good enough. I was running one of them across a live portfolio within three weeks of the hackathon. The gini coefficient on the best of them was in the mid-nineties on validation. The precision at high thresholds was acceptable for triage. That was real progress.

What the models did badly

They did not help an analyst write a case memo.

Every flagged merchant becomes a file. The file contains the model’s output, the reasons the analyst investigated, the evidence supporting or rebutting the suspicion, and the decision. A compliance auditor later reads the file. The auditor does not accept “the model scored this merchant 0.87” as a reason. They want the narrative: what was suspicious, what was checked, what was found, what the conclusion was grounded in.

The model produced the score. An analyst then had to produce everything else, by hand, for every flagged case. The bottleneck moved one step back. The models had compressed the triage stage and expanded the investigation stage. The system was faster but the analysts were not.

I spent about a week trying to fix this by making the models more explainable (feature importance, counterfactuals, local explanations). It helped slightly. It did not close the gap. The output of a feature-importance explainer is not a case memo. It is a technical debugging tool in the wrong register for the reader.

What the agents did differently

At this point I stopped thinking about models and started thinking about what the analyst actually did with the day. Three workflows emerged that together covered most of the work. Each one became its own tool.

Quick Scan. Thirty seconds, single URL in, structured risk summary out. This is the top of the triage funnel. An analyst is working a queue of flagged merchants and needs a first read before deciding whether to open a full investigation. Quick Scan fetches the website, extracts the analyst-visible signals (the same ones from the hackathon pipeline, now refined), pulls the merchant’s transactional profile, and returns a four-paragraph summary in the register the analyst would write in themselves. It replaced the model probability on the analyst’s screen.

Deep Dive. Eight phases, for cases that pass triage. Website crawl. Catalogue analysis. Compliance-page review. Contact-information verification. Transactional pattern check. Dispute analysis. Cross-reference against external risk databases. Narrative write-up. Each phase is an agent call with its own prompt, its own tools, its own output schema, feeding into the next. The whole thing takes two to three minutes of wall time, during which it produces something close to what an analyst produces in forty minutes. The output is a draft case memo that an analyst edits rather than writes. Editing is faster than composing, and the analyst is still the author of record. The tool is a scribe, not a decision-maker.

Report Writing. The third tool was not obvious. It emerged from watching analysts do the weekly team meeting, in which the lead would summarise what patterns the team had seen across their casework. Sector trends, geography trends, payment-method trends. That summary was being written by hand every Monday morning. Report Writing takes the week’s closed cases, clusters them, and produces a draft of that summary. Same principle as Deep Dive: human edits, machine drafts, human is the author.

Why the agents worked where the models did not

One observation and one lesson.

The observation: regulated-industry work is as much about narrative as about decision. The model’s answer is upstream of the work; the narrative is the work. Agents write narratives. Models do not.

The lesson: the bottleneck in an analyst’s day is rarely the decision. It is the artefacts around the decision. The memo. The summary. The escalation. The weekly report. Anything that takes the analyst’s judgement and turns it into a document that other humans can read and act on. Agents are good at drafting the document, provided the judgement is constrained by an explicit schema and the analyst has final sign-off. Remove the drafting time and you multiply the analyst’s throughput without asking them to decide faster.

This is not a claim about agents versus models in general. There are cases where a model is exactly right and an agent is overkill. Real-time transaction scoring, for instance, where you need an answer in milliseconds and a narrative is irrelevant. But for the investigative parts of the work, the ones where the output is a document rather than a number, the shift from model to agent was decisive.

What this became

Internally the toolchain has a name. I am calling it SuperAgent here, for the same reason I anonymised AutoAgent in the previous post; this is live risk infrastructure and the specifics are not for publication.

Analyst adoption was the test. By the time I handed the toolchain over, three things were true. Analysts were using Quick Scan in preference to the model score for triage. Deep Dive was producing draft memos that were being edited and submitted rather than rewritten. Report Writing was shaving two hours off the weekly summary. None of those were asked for explicitly; all three emerged from watching what the tools were actually used for after they shipped.

The bigger insight for me, beyond the fraud work, was that the unit of automation had changed. In the old world I would have scoped a project, written a specification, handed it to engineers, and waited for a build. In the new world the scope, the build, the first users, and the iteration were all one loop, and the loop ran in days. The toolchain I handed over existed because I had been able to sit with an analyst, watch them use the first version, and rewrite it that afternoon. That is not something models alone let you do. It is something agents let you do, because the output the analyst is reacting to looks like the artefact they would otherwise be writing themselves.

That shift, from scoring to drafting, is the smallest change I can name that actually mattered.


SuperAgent is anonymised; the three tools, the adoption pattern, and the shift from regression to agentic workflow are not. If you are building similar tooling for a regulated team, the one piece of advice I would give is to watch what analysts actually do with their Mondays. The Monday-morning artefacts are where the biggest time savings live.