Defining the shape of what we need to know and why.

Showing the contract and where it fits within working processes.
In Report 001 I discussed that AI-assisted engineering produces process opacity, and that the missing artefact is a receipt — signed, dated, hash-chained, queryable. I sketched, in a list of bullets, what such a receipt might capture.
This report goes a layer deeper. It shows the actual schema — what a receipt has to contain to do the job, and why each piece is in there. This is not about implementation or data capture. It is establishing the attributes that are deemed necessary to answer questions and resolve scrutiny from Agentic AI Delivery and Co-Development.
That sequence — define first, implementation second, measurement third — is deliberate. A clear contract makes the implementation testable. An unclear contract makes the implementation a moving target. So this report is about the contract.
The diagram above illustrates at a high level the conceptual steps involved when delivering a solution. Depending on the user, their requirements and capability, AI may be used in some varying degree of density across the full breadth of this flow.
Start with the schema-naming drift from Report 001. Two products, same architect, same house style, divergent schemas. Three plausible explanations for the drift, indistinguishable from the outside.
Now imagine a receipt for the build that produced the second product. What would it have to contain to tell you which of the three scenarios actually occurred?
You'd need a record of the dialogue — every clarifying question the agent raised and how it was answered. Without that, you cannot tell whether the agent asked. That gives you steps[], an ordered, typed sequence of every action and exchange.
You'd need each step typed by intent, not just by mechanism. A tool call is a mechanism. "Agent asked a clarifying question; human gave a response; agent acknowledged the rule; agent did something different anyway" — that is a sequence of typed events: clarifying_question, response, confirmation, override. Without typing, you can reconstruct what happened but not what kind of conversation it was.
You'd need the halt events called out as first-class. The moment an agent encounters something not in its closed vocabulary and refuses to invent — "this pattern doesn't match what I'm allowed to do; I'm asking rather than guessing" — is the methodology working. That refusal needs to be visible in the receipt, not buried inside step content. So steps[].halt_triggered is its own boolean, and halt_reason is its own typed enum.
That gets you most of the way to diagnosing the schema-naming drift. But not all the way.
A few more fields are needed before the three scenarios can be distinguished cleanly.
You need to know which agent ran, on whose authority. A receipt without a principal (the accountable human) and an actor (who actually executed) cannot answer the "on whose authority" question that regulated environments will eventually ask. And in any team that uses agents to invoke other agents — Marlow telling Iris to do something, in studio terms — you need a delegation chain, not just a flat actor field. So delegation_path[] captures the chain of authority transfers, each with its own scope and granted-at timestamp.
This field came from a specific incident. The studio's agent operating manual records, in its 2026-04-11 changelog, a violation where one agent directed another to read directly from a database layer it shouldn't have touched. The fact of the violation could be reconstructed afterwards from session logs. But the chain of authority — who told whom they could do what, and on what basis — could not. A flat actor field would not have helped. A delegation chain would have.
You need to know what the agent was running on. Model family, model version, sampling parameters, the system prompt the model was operating under. Drift between two builds may be explained by drift between two models — a Sonnet that halts reliably and an Opus that waves things through, or the reverse. Without runtime.model.{family, version} and runtime.system_prompt_hash, this hypothesis is untestable.
You need to know how full the context was when each step happened. There is a hypothesis the methodology cares about: agents drift more when context is loaded. The skills that produce halt-and-ask behaviour live in context; if the model is summarising or compressing context, those skills may be the first to degrade. If the hypothesis is true, halt rate should drop as context_state.tokens_used_pct rises. If it's false, no harm done — the field is cheap to capture. Either way, the only way to find out is to capture it at every step. So context_state appears at receipt level and at step level.
You need inputs classified by sensitivity. PII, PHI, PCI, MNPI — the controlled vocabulary differs by regulatory regime, but every regulated environment cares whether the receipt-producing process touched data of a given class. And you need to record this by hash, not by content; the receipt must not become a secondary data exposure.
You need a spec reference — the planning document, ADR, or specification the build was meant to satisfy — and a diff against spec with severity typed as a closed enum (none, cosmetic, material, breaking). Without this, divergence is invisible. With it, divergence becomes queryable: "show me all builds where the diff was material and the human didn't acknowledge it."
You need a signature — the receipt signed by the principal, not just timestamped — and a chain link hashing the previous receipt. The first makes the principal's accountability cryptographically attached to the artefact. The second makes the chain tamper-evident: alteration of any earlier receipt invalidates the chain forward.
That is the receipt, in narrative form. Most of the rest of the schema — output hashes, signoff state, retention policy references — is plumbing. The fields above are what carry the methodology.

Every field in the receipt has an origin — one of reasoned, evidenced, regulatory, or operational. The origin is itself a typed value, part of the schema definition.
Reasoned fields came from first-principles thinking about what makes an artefact auditable. Evidenced fields came from a specific incident in the studio's own work. Regulatory fields are required by a named regime. Operational fields came from observing the studio's own agent operations and noticing a pattern worth capturing.
The reason this matters: by tagging origin, the schema helps provide clarity on what we want to track and why. The reasoning helps provide further justification on collection, prioritising those data facets that are absolutely necessary. A reader can ask "how do you know this field is necessary?" and the schema answers — "we reasoned it," or "we observed it," or "the FCA requires it," or "this incident showed us we needed it."
The honest disposition is that v0.1 is more reasoned than evidenced. The first hundred real receipts will tell us where the reasoning held and where it needs revising. Some fields will graduate from reasoned to evidenced. Some will turn out to be unnecessary and get retired. The methodology's evolution is not hidden; it is part of the schema's structure.
It is as much a part of the methodology to agree and understand what needs to be captured and why as it is to do the actual capturing and analysis.
| Field | Description | Origin |
|---|---|---|
receipt_id | UUID for the receipt | reasoned |
schema_version | Which version this receipt conforms to | reasoned |
chain.previous_receipt_hash | Hash of the previous receipt | reasoned |
chain.sequence_number | Position in the chain | reasoned |
principal | The accountable human (id, role) | regulatory |
actor | Who actually executed (id, type) | reasoned |
delegation_path[] | Chain from principal to actor, with scope | evidenced |
runtime.agent_name | Agent identity | reasoned |
runtime.model.{family, version, provider} | Which model ran | reasoned |
runtime.sampling | Temperature, top_p, max_tokens | reasoned |
runtime.system_prompt_hash | Hash of the system prompt at session start | reasoned |
runtime.skills_loaded[] | Skills in context, with version | operational |
runtime.tools_available[] | Tools accessible at runtime | reasoned |
context_state.tokens_used_pct | Context utilisation at receipt start | operational |
context_state.position_in_session | Step N of total | reasoned |
context_state.compaction_events | Times context was summarised | operational |
inputs[].source | Where data crossed into the model context | reasoned |
inputs[].classification | Sensitivity (public, pii, phi, pci, mnpi, etc.) | regulatory |
inputs[].content_hash | Hash of the input — never the raw data | regulatory |
inputs[].redacted | Whether redaction was applied before ingestion | regulatory |
spec_reference | Hash or URI of the spec being built against | reasoned |
steps[] | Ordered, typed sequence of dialogue and actions | reasoned |
steps[].step_type | Typed by intent (clarifying_question, response, override, halt, etc.) | reasoned |
steps[].halt_triggered | Whether this step was a halt-and-ask event | evidenced |
steps[].halt_reason | Why the halt fired | evidenced |
steps[].context_state | Per-step snapshot of context state | operational |
outputs[].artefact_type | What was produced (ddl, code, doc, etc.) | reasoned |
outputs[].artefact_hash | Hash of the produced artefact | reasoned |
diff_against_spec.severity | none, cosmetic, material, breaking | reasoned |
diff_against_spec.human_acknowledged | Did the principal review the diff | regulatory |
signoff.state | signed, auto_approved, flagged, rejected, pending | regulatory |
signoff.signature_method | cryptographic, attested, none | reasoned |
signature | Cryptographic signature of the receipt body | reasoned |
source_control.system | Version control system (git, mercurial, none, etc.) | reasoned |
source_control.repository | Canonical URN or URL of the repository | regulatory |
source_control.commit_hash | Commit at session start | regulatory |
source_control.branch | Branch the work was done on | reasoned |
source_control.is_dirty | Were there uncommitted changes at session start | evidenced |
source_control.commits_during_session[] | Commits made during the session | reasoned |
Most rows speak for themselves. The narrative above covered the ones that warrant explanation.

One question this schema invites: if tamper-evidence matters, why not blockchain?
Three properties are worth distinguishing:
The receipt schema commits to (1) and (2). It does not commit to (3). The reason is operational: a regulator does not need decentralised consensus to trust evidence; they need the evidence to be signed by an accountable party and demonstrably unaltered. Hash-chained, signed receipts in append-only storage achieve that. Periodic anchoring to a public timestamp source provides external verifiability without putting receipt content on-chain.
A blockchain would provide all three properties at the cost of latency, transaction fees, public exposure of receipt metadata, and operational overhead. The minimum mechanism that achieves the audit purpose is the right one. Where stronger properties are required — multi-party attestation, supply-chain provenance — receipts can compose with existing standards (in-toto, SCITT) designed for exactly those needs, rather than reinventing them.
This is the working principle. Don't reinvent. Compose. Cite. Adapt.
This methodology has been working under a placeholder for a while. The name I keep coming back to is AVE — AI-Verifiable Engineering. Engineering work that can be verified, after the fact, to have followed an agreed practice — with evidence captured at the time of the work rather than reconstructed afterwards.
The name is internally precise. It is not a claim that AVE verifies AI. It is a claim that AI-assisted engineering can be made verifiable — through closed vocabularies, halt conditions, and signed receipts of the dialogue that produced the work.
AVE is a practice. It is Data Argo's methodology, demonstrated through this journal series, supported by implementation for capture and collection, composing with industry-standard infrastructure (OpenTelemetry, in-toto, SCITT) rather than competing with it.
The schema above is AVE v0.1. It is a draft until it has survived contact with at least three real builds across two products. Some fields will graduate from reasoned to evidenced as further testing and utilisation occur. Some will turn out to be unnecessary. The version itself will be revised when the evidence requires it. That is how the methodology is meant to work — slowly, with evidence, with its own basis visible.
The purpose of AVE is to provide a means to verify what was asked of AI, how it delivered on the ask, and what mitigating factors could directly or indirectly affect the quality of that output. AVE receipts should be collected wherever AI lends assistance. Varying degrees of involvement may occur at the definition stage or deployment stage depending on the manner of integration and collaboration between human user and AI assistant. AI assistance will more certainly be found within the planning through to build and validating phases, making these the highest volume of receipt types due to the density of AI agent involvement.
Report 003 will introduce the capture layer — where and how we will capture the various attributes that make up the receipts. Report 004 will start to look at the output and testing of the receipts, discovering where the schema held, where it didn't, and what changed when AVE entered the workflow.
The contract is now defined. The work continues.

Navigare necesse est
Subscribe to The Approach. Working notes on AI-assisted engineering. Published when there's something worth saying.
If this resonates — particularly if you're working in regulated environments where AI-assisted engineering work needs to be defensible — I'd be interested to hear from you. Detailed schema documentation, including frontmatter not surfaced here, lives in the AVE workspace and will move to its own published location when the contract is stable.
One message, no commitment. We reply personally within two working days.