Lab note · RAG · 2026-04-19
A repeatable chunk-strategy benchmark for technical-document RAG
We ran a repeatable chunk-strategy benchmark on a reference engineering corpus (anonymised, ~12,000 PDFs of technical specifications and drawings). Seven chunk strategies, evaluated on retrieval F1 against 200 hand-graded engineering questions with known ground-truth chunks.
Strategies tested
- Fixed 512-token windows, no overlap.
- Fixed 512-token windows, 64-token overlap.
- Fixed 1024-token windows, 128-token overlap.
- Sentence-bounded, target 512 tokens.
- Paragraph-bounded, target 512 tokens.
- Layout-aware (split on document section boundaries detected from PDF structure).
- Semantic (LLM-judged paragraph coherence boundaries, target 512 tokens).
Results
| Strategy | Retrieval F1 | Mean chunks per doc | Notes |
|---|---|---|---|
| 1. Fixed, no overlap | 0.61 | 38 | Baseline |
| 2. Fixed, 64-token overlap | 0.66 | 41 | Overlap helps |
| 3. Fixed 1024 / 128 overlap | 0.64 | 22 | Bigger chunks worse here |
| 4. Sentence-bounded | 0.67 | 47 | Marginal vs. fixed |
| 5. Paragraph-bounded | 0.71 | 31 | Notable improvement |
| 6. Layout-aware | 0.78 | 19 | Strongest |
| 7. Semantic (LLM boundaries) | 0.67 | 28 | Underperformed |
Why layout-aware won (our hypothesis)
Engineering documents are structured documents. Section boundaries are deliberate authorial decisions: a section break in a specification document signals a topic shift the author considered semantically significant. Splitting on those boundaries preserves coherence in a way an LLM-judged semantic boundary cannot, because the LLM is approximating what the author already explicitly decided.
This is corpus-specific. We would not expect the same result on a corpus of unstructured prose (legal opinions, prose research articles), where authorial structure is weaker and semantic chunking has more to work with.
What to copy
- The benchmark methodology, not our specific F1 numbers — those are corpus-specific.
- The chunks-per-document column — it matters operationally. Layout-aware produced half the chunks of fixed-window strategies, which has consequences for retrieval cost and re-ranking computation.
- The decision to test multiple strategies on the actual corpus, not pick a strategy from a generic guide. Our second-best strategy was paragraph-bounded; in a different corpus that might be the best.
Caveats
- N = 200 questions; we have ~85% confidence in F1 differences greater than 0.04, lower confidence in smaller differences.
- Retrieval F1 measures whether the right chunks were retrieved, not whether the downstream generation produced a correct answer. Generation-quality differences across strategies are smaller than retrieval differences in our experience but not measured here.
- The layout-aware extractor used PDF structural metadata that some PDFs do not carry. Documents without metadata fell back to paragraph-bounded; the F1 number above is the blended performance.
Method link
The benchmark methodology, the question construction protocol, and the question set characterisation are documented under our standard source-tier framework — see methodology for the framework, and the provenance-aware RAG case study for an applied use of these chunk-strategy results.