Retrieval-augmented generation looks, from a distance, like search-with-an-extra-step. Find relevant documents, paste them into the prompt, get an answer. Most failed RAG implementations were built on that mental model. The architectural implications of getting it wrong show up at every layer.

RAG is not search-with-generation. It is generation-with-a-retrieval-substrate. The generation step is the load-bearing one; retrieval is the substrate that constrains what the generation step is allowed to invent. Treating it as search produces systems that surface the right documents but generate confident, unverifiable summaries of them — the failure mode that costs the most and is the easiest to overlook.

A few specific consequences.

The chunk strategy is part of the model output, not part of the search index

Chunk strategy is usually treated as a search-engine concern: how do we slice documents so that bi-encoder retrieval finds the right pieces? But the chunks are also what gets pasted into the model context. A chunk strategy optimised for retrieval recall — say, overlapping fixed-size windows — may also be the strategy that gives the model contradictory snippets and lets it pick whichever one supports its first guess.

A useful chunk strategy is one that retrieves and generates well. We have seen layout-aware chunking — splitting on document section boundaries — outperform semantic chunking on technical document corpora because the section boundaries were authorial decisions worth respecting. Generic strategies do not transfer. The chunk strategy is part of the system; it should be benchmarked on the specific corpus.

Citations have to be part of the generation, not appended

The standard pattern — generate the answer, then list the retrieved documents at the end — is the architectural error that makes RAG-as-search systems hallucinate. The model has not been forced to ground each claim in a specific chunk; it has been given a pile of context and asked to summarise. The summary is constrained by nothing.

The provenance-aware design forces citations into the response stream as the model writes them. A post-processing step verifies every citation resolves to an actual chunk in the retrieved set. Unverifiable citations cause the answer to be rejected and re-generated under tighter constraint. This is more expensive at inference time. It is also the only design we have seen that produces a citation-hallucination rate in the low single digits in production.

Re-ranking is not optional for technical corpora

Bi-encoder retrieval is fast and approximately correct. It is not adequate alone for any corpus where superficially similar chunks are common — engineering specifications, regulatory text, medical documentation. The same component appears in twenty drawings; the same regulatory clause appears in five documents at different revision levels. Bi-encoder retrieval surfaces them all at similar scores; the model then has to disambiguate, which is precisely what it is bad at.

A domain-specific re-ranker is not an optimisation. It is a structural prerequisite. Promote chunks based on document type, recency, explicit revision-supersedence relationships, and any other domain priors you have. Without it, the system's most-confident answers will be the ones drawn from the most-common chunks, regardless of whether those chunks are the most relevant.

Hallucinated citations are a separate failure mode

Standard RAG evaluation measures answer relevance. There is a second, harder failure mode that standard evals miss: the model produces an answer that looks correct, with citations that look valid, but the citations do not actually contain the substance the citation is supporting. The citation exists; the claim and the cited content disagree. A reviewer who trusts the citation does not catch the error.

The eval for this failure has to extract every (document, page) citation from the response, retrieve the chunk, and verify that the cited content actually supports the claim. It is more expensive than relevance evaluation. It is also the eval that distinguishes a RAG system you can ship from one you cannot.

Why this matters more in regulated domains

In a consumer chat application, an unverifiable RAG answer is a quality issue. In a regulated domain — medical, legal, financial, engineering — it is a liability issue. The system is producing an answer that looks substantiated, that a reviewer might rely on, that cannot be defended if challenged. The cost of getting RAG architecture wrong is not paid at evaluation time; it is paid the first time someone makes a decision based on a confidently-cited answer that the citation does not actually support.

Treat RAG as generation-with-a-retrieval-substrate, design the citation as a structural element of the output rather than a post-script, and benchmark the chunk strategy on the specific corpus. The first two of those decisions account for most of the gap between RAG systems that ship and ones that do not.