Methodology context — this case study uses Zhianrui's provenance-first methodology (workstream decomposition, source-tier evidence, confidence rubric, open-questions register). If you're new to the framework, start there.

LLM agent flow with provenance gate

The brief

Sector: Regulatory / cross-cutting (industrial cybersecurity). Need: an agentic LLM system that produces structured due-diligence dossiers on supplier-side cybersecurity posture, applied across hundreds of suppliers in a constrained timeframe. The client's analysts could produce the same dossier manually for one supplier in ~12 hours; doing it across the supplier base by hand was infeasible. They wanted parity of output structure with their internal manual process — including the source-tier evidence framework and the open-questions register.

Workstream decomposition

  • WS-1: Methodology encoding. Translate Zhianrui's provenance-first delivery framework — workstream plan, A–E source tiers, confidence rubric, open-questions register — into a structured agent specification with explicit roles and hand-off contracts.
  • WS-2: Agent design. Build five specialised agents — Sourcing, Tier classifier, Claim extractor, Contradiction detector, Confidence scorer — connected through an orchestration layer that enforces evidence row completeness before any claim enters the dossier.
  • WS-3: Eval-first delivery. Build the evaluation harness before the agent system. 240 hand-graded due-diligence dossiers as ground truth; multi-judge scoring across factuality, source attribution accuracy, and confidence calibration. Deploy gate blocks any release that regresses on the calibration metric.
  • WS-4: Human-in-the-loop layer. A reviewer interface where flagged contradictions and unverified claims escalate to a Zhianrui analyst before the dossier is finalised.

Method highlight

The decisive design move: agents do not produce findings directly. They produce evidence rows — claim plus source plus tier plus confidence plus snapshot timestamp. The dossier is assembled from accepted evidence rows by a deterministic step, not by a generation step. This collapses the most common LLM failure mode (confident hallucination) into a structural constraint: an evidence row without a verifiable source identifier is rejected at the orchestration layer before it can enter the dossier.

The eval harness then measures the system on three axes Zhianrui's manual deliverables are also measured on: factuality (does the claim match the source?), attribution accuracy (does the source identifier resolve to a real document?), and confidence calibration (when the system rates a claim "high confidence," does it hold up against the ground-truth tier-A evidence?).

Deliverable shape

  • Multi-agent orchestration system with five specialised agents, written model-agnostic (provider-pluggable).
  • Evaluation harness with 240 ground-truth dossiers and a continuous regression suite.
  • Reviewer interface for human-in-the-loop validation of flagged claims.
  • Research dossier on the methodology encoding decisions: 47 source rows.
  • Operational handover: model upgrade playbook, prompt-revision change-control, eval threshold maintenance.

Outcomes

The system produced supplier dossiers at ~6% of the marginal-time cost of the manual baseline, with structural parity to the manual deliverable. Calibration metric (the fraction of "high confidence" claims that held up against tier-A re-verification) ran at 91% in production over the first three months — within 4 points of the manual baseline at 95%. The 4-point gap is documented in the open-questions register, with proposed mitigations.

The client adopted the same evaluation harness as a deploy-gate for subsequent agent revisions. The methodology encoding (WS-1) was published as an internal standard.