A human researcher with weak provenance discipline produces poor research. An agentic LLM system with weak provenance discipline produces confident hallucinations at scale.

The asymmetry matters more than it sounds. A poor human researcher's output is bounded by their throughput — they can only manufacture so many unsubstantiated claims per day. An agent has no such throttle. A pipeline that produces 200 dossiers per night will produce 200 dossiers per night regardless of whether the underlying provenance discipline is intact. If it is not, the result is not 200 dossiers with a few errors. It is 200 dossiers in which an unknown subset of claims are systematically untraceable, distributed evenly enough to defeat spot-checking.

We have been delivering provenance-first research deliverables to clients for years. The framework — A through E source tiering, confidence ratings, the open-questions register — was designed for human researchers. When we started building agent systems, the first instinct was to relax it: "agents work differently, the framework is overkill." We tried this. It is not overkill. It is the missing structural constraint that prevents the most expensive failure mode of LLM systems.

What "provenance for agents" means in practice

It does not mean asking the model to "cite its sources." That instruction is well-known to produce fabricated citations more often than absent ones — the model has learned that citations are part of the output format, and so it generates them whether or not they correspond to retrieved evidence.

It means restructuring the agent's output unit. The agent should not produce a finding. It should produce an evidence row: claim, source identifier, source tier, confidence rating, snapshot timestamp. The dossier is then assembled from accepted evidence rows by a deterministic step. An evidence row whose source identifier does not resolve to a verifiable artifact is rejected at the orchestration layer before it can enter the dossier. A claim cannot exist in the system without a source — because the data structure does not permit it.

This is a structural constraint, not a prompt instruction. The difference is the difference between asking an honest person not to lie and designing a system where lying is not a representable state.

The eval implication

Standard LLM evaluations measure whether the model's output is correct. A provenance-aware eval measures three orthogonal things, all of them required:

  • Factuality — does the claim match what the cited source actually says?
  • Attribution accuracy — does the cited source identifier resolve to a real, retrievable document?
  • Confidence calibration — when the system rates a claim "high confidence," does it hold up against tier-A re-verification?

The third metric is the one most teams skip, because it is the most expensive to maintain — it requires a continuously refreshed ground-truth set. It is also the one that catches the failure mode that matters most: a system that produces confident-looking output that does not survive scrutiny.

Why the human framework transferred

The reason the source-tier framework worked for human researchers is the same reason it works for agents: it externalises a discipline that is otherwise easy to forget under pressure. Researchers under deadline pressure cut corners on attribution. Agents under throughput pressure invent sources. The framework is a forcing function in both cases.

What changes is the cost of failure. A human cutting corners produces a deliverable a reviewer can interrogate. An agent cutting corners produces a deliverable that looks reviewable, with citations that look valid, in a volume that defeats human review. The discipline does not become optional for agents. It becomes structurally non-negotiable.