What I found when I went to check what an AI agent had actually built.

I have been in the data industry for over 25 years. I have built a number of products and solutions during that time. Depending on the requirements, I pick the appropriate methodology for designing and building. These principles I have taken with me into my foray of AI-Assisted Engineering.
I have used AI to assist in building two data products this year. The schemas don't match.
Not in the way you'd expect — same architect, same methodology, same stack, same house style. I'd written the pattern down before I started the first one. Five layers, numbered L0 through L4, a clean progression from administrative metadata at the bottom to consuming applications at the top. I'd even thought ahead and written down what to do if something didn't fit: ask. Don't invent a new layer. Stop and check.
The first product came out clean: L0_Admin, L1_Raw, L2_Vault, L3_Mart, L4_Apps.
The second product, TenderDigest, came out as: L0_Config, L1_Raw, L3_Conformation, L4_Mart, Apps, Telemetry.
Admin became Config. Vault disappeared. Conformation appeared at L3. The numbering jumped. Apps lost its prefix. Telemetry showed up uninvited. Every change is locally defensible — Config is arguably more accurate than Admin, Conformation describes what L2 actually does better than Vault does. Each deviation, taken alone, looks like a small improvement.
Globally, it's chaos. The two databases no longer share a vocabulary. Anyone moving between them has to re-learn the layout. Any tooling I build against the pattern has to handle both variants. The whole point of having a house style — that future-me, a colleague, or an agent acting on my behalf, can build the third product the same way as the first two — is gone.
And here's the part that made me pause when I noticed it: I don't know how it happened.
I built TenderDigest with an AI agent, with the deliberate intent of reusing skills and work done on the first data product. Most of the actual DDL came out of a Claude Code session. So when I went to look at the database and saw the deviation, three explanations were equally plausible:
Conformation instead of Vault. I said yes — possibly while distracted, possibly because in that moment Conformation did sound better. The drift is my decision, made in a conversation I don't remember.These three scenarios could each introduce drift from the intended database. They have completely different remedies. Scenario 1 is a discipline problem and the fix is mine — pay attention when an agent asks a clarifying question. Scenario 2 is a skill design problem and the fix is in how I write the rules — the halt condition wasn't strong enough. Scenario 3 is a model reliability problem and the fix lives somewhere I can't reach — the agent didn't honour explicit instruction.
I cannot tell which one happened. So I cannot fix the right thing.
This isn't from lack of operational discipline. I run a structured agent team with named specialists, a defined escalation matrix, greenlit zones with hard "ask first" actions, an adversarial agent that reviews work daily, and a cross-zone comms protocol. The drift happened anyway. The receipt gap exists despite all of that — which is precisely why the gap matters. Discipline reduces the rate of drift; it doesn't make it diagnosable after the fact.
This is the bit that, once you see it, changes how you think about AI-assisted engineering generally.
The output is fine, mostly. The agent built a working warehouse. The DDL is clean. The data flows. If I were grading the deliverable in isolation, it'd pass. The problem isn't the artefact — it's that the process that produced it is opaque. I have a database, but I don't have a record of the conversation that built it.
In any other regulated discipline this would be unthinkable. A pharmacist doesn't just produce a drug; they produce a paper trail showing what was prescribed, what was dispensed, who checked it, and what was queried along the way and when it was dispensed. A surveyor doesn't just produce a building; they produce signed evidence of every decision made on site. Financial services calls this accountability, and under SMCR it's not optional — named individuals carry personal responsibility for outcomes, which means they need evidence of how those outcomes were produced. The same logic applies in any regulated discipline where AI-assisted work touches outcomes that matter — healthcare, legal, pharma, critical infrastructure. The vocabulary differs; the underlying need does not.
The Senior Managers and Certification Regime is a UK Financial Conduct Authority framework that places personal accountability on named senior managers for conduct and compliance failures within their area of responsibility. It means "the AI did it" is not a defence.
AI-assisted engineering, right now, has none of this. We have prompts that vanish into history, conversations that aren't logged, agent reasoning that isn't captured, and final artefacts whose origin is essentially folkloric. "Claude built it." That's not a chain of custody. That's a shrug.
I've been sketching what the missing artefact actually looks like. I'm calling it a receipt, because the word does the work — it's evidence of a transaction, signed, dated, itemised, and reviewable later. Not just a log (passive, decays into noise) but a deliverable in its own right (active, attached to a specific build, designed to be examined and very much part of the development lifecycle when agentic aided development is involved).
A useful receipt would capture:
That schema is enough to distinguish my three scenarios. With it, I can answer: did the agent ask? what did I say? did it comply? Without it, I'm guessing forever.
The schema also needs to be machine-parseable, not just human-readable — every field typed, every category drawn from a controlled vocabulary, every relationship queryable. A receipt you can't analyse across a portfolio of builds is just a long-form log. The interesting questions live at the corpus level: across the last hundred builds, which agents had the highest halt rate? Which skills correlate with drift? Does compliance failure increase as session context fills up? None of that is answerable from prose. I'll cover the schema in detail in future posts.
To rely on agentic aided development at scale, we have to be able to inspect the process. Teams need to validate that models, skills, and hooks behave consistently across people and across runs. Engineering leaders need visibility into cost, performance, and drift — not as nice-to-have telemetry, but as the operating data of a discipline that is about to look very different from the one we have today. The practices we use to manage software engineering — code review, test coverage, performance budgets, cost attribution — were built for human-authored work. They need their agentic equivalents, and receipts are the substrate those equivalents will sit on.

The schema-naming drift is a small example. The pattern generalises.
Every house style is, in the end, a closed vocabulary — a finite set of patterns and frameworks the team has agreed to use. Data Vault has hubs, links, and satellites. Kimball has facts, dimensions, and a small number of slowly-changing-dimension types. Naming conventions are closed alphabets. Every regulated discipline has a controlled vocabulary, and the discipline of working within it is what makes outcomes predictable.
AI agents, by default, do not respect closed vocabularies. They reason from first principles every time, which means they generate plausible novelty whenever the rules are even slightly ambiguous. That's useful behaviour for greenfield work. It can be catastrophic for house style.
The fix is two things, working together:
A halt condition only works if you can later confirm it fired. So the methodology needs a third element:
Even with well-written skills, agents can drift. What looks like clear instruction often contains accidental ambiguity — wiggle room the agent will quietly fill. Without a way to capture when skills fired and what happened when they did, drift only becomes visible after the cleanup is already needed.
That last point reframes everything. Most "AI governance" tooling on the market right now is logging dressed up — passive capture of what an agent did. Receipts are different. A receipt is a claim: this build was produced under these conditions, with this dialogue, by this accountable party. The claim is signed. The claim is checkable. The claim is the deliverable.
I'm building a few data products this year — TenderDigest among them. I'm going to document the build, with receipts, and use the series to work out the methodology in the open. The schema-naming drift is the first concrete example; the next pieces will go deeper into how skills, planning sessions, and receipts compose into something a financial services firm could actually defend in front of a regulator.
I haven't named the methodology yet. I've been calling it AI-Verifiable Engineering in working notes — a term I've coined to describe what this is trying to be — but I'm going to let the practice settle before I commit to a name publicly. Names should be earned rather than declared.
More soon.
If this resonates — particularly if you're working in regulated environments where AI-assisted work needs to be defensible (financial services, healthcare, legal, pharma) — I'd be interested to hear from you. Find me on LinkedIn. The next post in this series will go into the receipt schema in more detail, including why hash-chained signed receipts are the right primitive and why blockchain almost certainly isn't.

Navigare necesse est
Subscribe to The Approach. Working notes on AI-assisted engineering. Published when there's something worth saying.
One message, no commitment. We reply personally within two working days.