Provenance-aware RAG for legacy engineering documentation — Case study

Methodology context — this case study uses Zhianrui's provenance-first methodology, applied to a RAG system. The chunk-strategy benchmark referenced here is documented in our lab notes.

The brief

Sector: Industrial engineering, regulated. Need: the client had ~38,000 PDF documents — drawings, specifications, change-orders, test reports, regulatory submissions — accumulated over two decades. Engineers spent disproportionate time hunting through them. The client wanted a RAG system that would answer engineering questions in natural language with full source attribution back to the originating document and page. The "with full source attribution" was non-negotiable: an unattributed answer in this domain was worse than no answer.

Workstream decomposition

WS-1: Chunk-strategy benchmark. Test seven chunk strategies on a held-out evaluation set of 200 representative engineering questions. Strategies ranged from naive fixed-size to layout-aware (splitting on document section boundaries) to semantic (splitting on paragraph-level coherence with model-judged boundaries).
WS-2: Re-ranking with domain priors. Bi-encoder retrieval is not enough for technical documents — the document collection contains too many superficially-similar chunks (the same component appears in dozens of drawings). Build a domain-specific re-ranker that promotes chunks based on document type, recency, and explicit revision-supersedence relationships.
WS-3: Source-attribution machinery. Every generated answer carries inline citations resolving to (document ID, page number, revision number). Citations are not appended at the end — they are embedded in the response stream as the model writes them. A post-processing step verifies every citation resolves to an actual chunk in the retrieved set; unverifiable citations cause the answer to be rejected and re-generated with a tighter constraint.
WS-4: Hallucination-specific eval. Standard RAG eval measures answer relevance. We additionally built a citation-hallucination eval: take the model's response, extract every cited (doc-id, page) pair, and verify it (a) exists, (b) actually contains the substance the citation is supporting. Failure on (b) is an invisible hallucination — the citation looks valid but is misleading. This is the failure mode standard evals miss.

Method highlight

The architectural insight: RAG is not a search problem with a generation skin. It is a generation problem with a retrieval substrate. Treating it as search produces systems that surface the right documents but generate confident, unverifiable summaries of them. The provenance-aware design inverts this: the generation step is constrained to produce only claims that resolve to specific retrieved chunks, with the citation as a structural part of the output, not an afterthought.

The chunk-strategy benchmark in WS-1 produced a counter-intuitive result: layout-aware chunking outperformed semantic chunking by a significant margin on this corpus. We hypothesise this is because the document set was engineered to be navigated — section boundaries were deliberate authorial decisions, and respecting them preserved coherence better than letting an LLM judge paragraph boundaries.

Deliverable shape

RAG system with provenance-aware generation, citation verification, and revision-supersedence handling.
Chunk-strategy benchmark report: methodology + results across seven strategies + recommendation rationale.
Citation-hallucination eval suite as a CI-runnable test artifact.
Operations guide: how to onboard a new document corpus, how to audit a poorly-attributed answer, when to retire a corpus revision.
Research dossier: 52 source rows including the chunk-strategy benchmark methodology.

Outcomes

Engineers using the system found documentation 4–6× faster than browsing manually, with the bottleneck shifting from "finding the document" to "verifying the answer." Citation-hallucination rate measured at 1.4% on the 200-question evaluation set after deployment , versus 11–18% for the baseline RAG implementations the client had previously evaluated. The client adopted the chunk-strategy benchmark methodology for the two additional document corpora subsequently brought into the system.

← All case studies