A repeatable chunk-strategy benchmark for technical-document RAG

We ran a repeatable chunk-strategy benchmark on a reference engineering corpus (anonymised, ~12,000 PDFs of technical specifications and drawings). Seven chunk strategies, evaluated on retrieval F1 against 200 hand-graded engineering questions with known ground-truth chunks.

Strategies tested

Fixed 512-token windows, no overlap.
Fixed 512-token windows, 64-token overlap.
Fixed 1024-token windows, 128-token overlap.
Sentence-bounded, target 512 tokens.
Paragraph-bounded, target 512 tokens.
Layout-aware (split on document section boundaries detected from PDF structure).
Semantic (LLM-judged paragraph coherence boundaries, target 512 tokens).

Results

Horizontal bar chart of retrieval F1 across seven chunk strategies, with layout-aware winning at 0.78

Strategy	Retrieval F1	Mean chunks per doc	Notes
1. Fixed, no overlap	0.61	38	Baseline
2. Fixed, 64-token overlap	0.66	41	Overlap helps
3. Fixed 1024 / 128 overlap	0.64	22	Bigger chunks worse here
4. Sentence-bounded	0.67	47	Marginal vs. fixed
5. Paragraph-bounded	0.71	31	Notable improvement
6. Layout-aware	0.78	19	Strongest
7. Semantic (LLM boundaries)	0.67	28	Underperformed

Why layout-aware won (our hypothesis)

Engineering documents are structured documents. Section boundaries are deliberate authorial decisions: a section break in a specification document signals a topic shift the author considered semantically significant. Splitting on those boundaries preserves coherence in a way an LLM-judged semantic boundary cannot, because the LLM is approximating what the author already explicitly decided.

This is corpus-specific. We would not expect the same result on a corpus of unstructured prose (legal opinions, prose research articles), where authorial structure is weaker and semantic chunking has more to work with.

What to copy

The benchmark methodology, not our specific F1 numbers — those are corpus-specific.
The chunks-per-document column — it matters operationally. Layout-aware produced half the chunks of fixed-window strategies, which has consequences for retrieval cost and re-ranking computation.
The decision to test multiple strategies on the actual corpus, not pick a strategy from a generic guide. Our second-best strategy was paragraph-bounded; in a different corpus that might be the best.

Caveats

N = 200 questions; we have ~85% confidence in F1 differences greater than 0.04, lower confidence in smaller differences.
Retrieval F1 measures whether the right chunks were retrieved, not whether the downstream generation produced a correct answer. Generation-quality differences across strategies are smaller than retrieval differences in our experience but not measured here.
The layout-aware extractor used PDF structural metadata that some PDFs do not carry. Documents without metadata fell back to paragraph-bounded; the F1 number above is the blended performance.

Method link

The benchmark methodology, the question construction protocol, and the question set characterisation are documented under our standard source-tier framework — see methodology for the framework, and the provenance-aware RAG case study for an applied use of these chunk-strategy results.

← All lab notes