Why every AI-built data warehouse needs receipts

I have been in the data industry for over 25 years, I have built a number of products and solutions during that time. Depending on the requirements, I pick the appropriate methodology for designing and building. These principles I have taken with me into my foray of AI Assisted Engineering.

I have used AI to assist in building two data products this year. The schemas don't match.

Not in the way you'd expect — same architect, same methodology, same stack, same house style. I'd written the pattern down before I started the first one. Five layers, numbered L0 through L4, a clean progression from administrative metadata at the bottom to consuming applications at the top. I'd even thought ahead and written down what to do if something didn't fit: ask. Don't invent a new layer. Stop and check.

The first product came out clean: L0_Admin, L1_Raw, L2_Vault, L3_Mart, L4_Apps.

The second product, TenderDigest, came out as: L0_Config, L1_Raw, L3_Conformation, L4_Mart, Apps, Telemetry.

Look at that for a second. Admin became Config. Vault disappeared. Conformation appeared at L3. The numbering jumped. Apps lost its prefix. Telemetry showed up uninvited. Every change is locally defensible — Config is arguably more accurate than Admin, Conformation describes what L2 actually does better than Vault does. Each deviation, taken alone, looks like a small improvement.

Globally, it's chaos. The two databases no longer share a vocabulary. Anyone moving between them has to re-learn the layout. Any tooling I build against the pattern has to handle both variants. The whole point of having a house style — that future-me, a colleague or an agent acting on my behalf, can build the third product the same way as the first two — is gone.

And here's the part that made me pause when I noticed it: I don't know how it happened.

Three scenarios

I built TenderDigest with an AI agent, with the deliberate intent of reusing skills and work done on the first data product. Most of the actual DDL came out of a Claude Code session. So when I went to look at the database and saw the deviation, three explanations were equally plausible:

The agent asked me whether L3 should be Conformation instead of Vault. I said yes — possibly while distracted, possibly because in that moment Conformation did sound better. The drift is my decision, made in a conversation I don't remember.
The agent never asked. It reasoned its way to the new naming silently — Conformation is technically a more accurate description, so it picked that and moved on. The drift is the agent's decision, made unilaterally, with no record.
The agent asked, I said "stick to the standard," and it did something different anyway. The drift is a compliance failure — the agent acknowledged the rule and then ignored it.

These three scenarios could introduce drift from the intended database. They have completely different remedies. Scenario 1 is a discipline problem and the fix is mine — pay attention when an agent asks a clarifying question. Scenario 2 is a skill design problem and the fix is in how I write the rules — the halt condition wasn't strong enough. Scenario 3 is a model reliability problem and the fix lives somewhere I can't reach — the agent didn't honour explicit instruction.

I cannot tell which one happened. So I cannot fix the right thing.

This isn't from lack of operational discipline. I run a structured agent team with named specialists, a defined escalation matrix, greenlit zones with hard "ask first" actions, a critic agent that reviews work daily, and a cross-zone comms protocol. The drift happened anyway. The receipt gap exists despite all of that — which is precisely why the gap matters. Discipline reduces the rate of drift; it doesn't make it diagnosable after the fact.

The thing that's missing

This is the bit that, once you see it, changes how you think about AI-assisted engineering generally.

The output is fine, mostly. The agent built a working warehouse. The DDL is clean. The data flows. If I were grading the deliverable in isolation, it'd pass. The problem isn't the artefact — it's that the process that produced it is opaque. I have a database, but I don't have a record of the conversation that built it.

In any other regulated discipline this would be unthinkable. A pharmacist doesn't just produce a drug; they produce a paper trail showing what was prescribed, what was dispensed, who checked it, and what was queried along the way. A surveyor doesn't just produce a building; they produce signed evidence of every decision made on site. Financial services calls this accountability, and under SMCR it's not optional — named individuals carry personal responsibility for outcomes, which means they need evidence of how those outcomes were produced. The same logic applies in any regulated discipline where AI-assisted work touches outcomes that matter — healthcare, legal, pharma, critical infrastructure. The vocabulary differs; the underlying need does not.

On SMCR

The Senior Managers and Certification Regime (SMCR) is a UK Financial Conduct Authority framework that places personal accountability on named senior managers for conduct and compliance failures within their area of responsibility. It means "the AI did it" is not a defence.

AI-assisted engineering, right now, has none of this. We have prompts that vanish into history, conversations that aren't logged, agent reasoning that isn't captured, and final artefacts whose origin is essentially folkloric. "Claude built it." That's not a chain of custody. That's a shrug.

What a receipt would contain

I've been sketching what the missing artefact actually looks like. I'm calling it a receipt, because the word does the work — it's evidence of a transaction, signed, dated, itemised, and reviewable later. Not a log (passive, decays into noise) but a deliverable in its own right (active, attached to a specific build, designed to be examined).

A useful receipt would capture:

Who invoked the work, and on whose authority. From solo deployments to team, this is a delegation chain.
What ran, including agent name and version, model family and version, sampling parameters, and which skills were loaded into context.
What was asked, including the spec or planning document the build was meant to satisfy.
The dialogue, including every clarifying question the agent raised and how it was answered. This is the load-bearing piece.
What was produced, with hashes of the actual artefacts.
The diff between spec and output, so deviation is visible rather than buried.
Sequence and causality, so a later reader can replay what depended on what.
A signature, anchoring the receipt to a specific accountable human.
A chain link, hashing the previous receipt so tampering is detectable.

That schema is enough to distinguish my three scenarios. With it, I can answer: did the agent ask? what did I say? did it comply? Without it, I'm guessing forever.

The schema also needs to be machine-parseable, not just human-readable — every field typed, every category controlled-vocabulary, every relationship queryable. A receipt you can't analyse across a portfolio of builds is just a long-form log. The interesting questions live at the corpus level: across the last hundred builds, which agents had the highest halt rate? Which skills correlate with drift? Does compliance failure increase as session context fills up? None of that is answerable from prose. I'll cover the schema in detail in the next post.

The deeper claim

The schema-naming drift is a small example. The pattern generalises.

Every house style is, in the end, a closed vocabulary — a finite set of patterns the team has agreed to use. Data Vault has hubs, links, and satellites. Kimball has facts, dimensions, and a small number of SCD types. Naming conventions are closed alphabets. Every regulated discipline has a controlled vocabulary, and the discipline of working within it is what makes outcomes predictable.

AI agents, by default, do not respect closed vocabularies. They reason from first principles every time, which means they generate plausible novelty whenever the rules are even slightly ambiguous. That's useful behaviour for greenfield work. It's catastrophic for house style.

The fix is two things, working together:

Skills that encode the vocabulary explicitly, including not just the rules but the rationale, so an agent can't quietly improve them.
A halt condition for when something doesn't fit — the agent must stop and ask, not invent.

And then, because halt conditions are only meaningful if you can verify they fired:

Receipts that capture whether the halt happened. A build that produced complex output with zero clarifying questions is suspicious. A build with three logged halts and clean output is trustworthy. The receipt makes that visible.

That last point reframes everything. Most "AI governance" tooling on the market right now is logging dressed up — passive capture of what an agent did. Receipts are different. A receipt is a claim: this build was produced under these conditions, with this dialogue, by this accountable party. The claim is signed. The claim is checkable. The claim is the deliverable.

What I'm doing about it

I'm building a few data products this year — TenderDigest among them. I'm going to document the build, with receipts, and use the series to work out the methodology in the open. The schema-naming drift is the first concrete example; the next pieces will go deeper into how skills, planning sessions, and receipts compose into something an FS firm could actually defend in front of a regulator.

I don't have a clean name for the methodology yet. I've tried a few; none of them fit perfectly. I'm going to let the practice settle before I name it, because names should be earned rather than declared.

But the tool that produces the receipts has a name, because tools need names to be referred to, and I've been calling it Forge.

More soon.

If this resonates — particularly if you're working in regulated environments where AI-assisted work needs to be defensible (financial services, healthcare, legal, pharma) — I'd be interested to hear from you. Find me on LinkedIn. The next post in this series will go into the receipt schema in more detail, including why hash-chained signed receipts are the right primitive and why blockchain almost certainly isn't.

Tony Purkins · Principal · Data Argo · 26 April 2026

Navigare necesse est

Share on LinkedIn →Follow the author →Reply by email →