We ran a repeatable chunk-strategy benchmark on a reference engineering corpus (anonymised, ~12,000 PDFs of technical specifications and drawings). Seven chunk strategies, evaluated on retrieval F1 against 200 hand-graded engineering questions with known ground-truth chunks.

Strategies tested

  1. Fixed 512-token windows, no overlap.
  2. Fixed 512-token windows, 64-token overlap.
  3. Fixed 1024-token windows, 128-token overlap.
  4. Sentence-bounded, target 512 tokens.
  5. Paragraph-bounded, target 512 tokens.
  6. Layout-aware (split on document section boundaries detected from PDF structure).
  7. Semantic (LLM-judged paragraph coherence boundaries, target 512 tokens).

Results

Horizontal bar chart of retrieval F1 across seven chunk strategies, with layout-aware winning at 0.78
StrategyRetrieval F1Mean chunks per docNotes
1. Fixed, no overlap0.6138Baseline
2. Fixed, 64-token overlap0.6641Overlap helps
3. Fixed 1024 / 128 overlap0.6422Bigger chunks worse here
4. Sentence-bounded0.6747Marginal vs. fixed
5. Paragraph-bounded0.7131Notable improvement
6. Layout-aware0.7819Strongest
7. Semantic (LLM boundaries)0.6728Underperformed

Why layout-aware won (our hypothesis)

Engineering documents are structured documents. Section boundaries are deliberate authorial decisions: a section break in a specification document signals a topic shift the author considered semantically significant. Splitting on those boundaries preserves coherence in a way an LLM-judged semantic boundary cannot, because the LLM is approximating what the author already explicitly decided.

This is corpus-specific. We would not expect the same result on a corpus of unstructured prose (legal opinions, prose research articles), where authorial structure is weaker and semantic chunking has more to work with.

What to copy

  • The benchmark methodology, not our specific F1 numbers — those are corpus-specific.
  • The chunks-per-document column — it matters operationally. Layout-aware produced half the chunks of fixed-window strategies, which has consequences for retrieval cost and re-ranking computation.
  • The decision to test multiple strategies on the actual corpus, not pick a strategy from a generic guide. Our second-best strategy was paragraph-bounded; in a different corpus that might be the best.

Caveats

  • N = 200 questions; we have ~85% confidence in F1 differences greater than 0.04, lower confidence in smaller differences.
  • Retrieval F1 measures whether the right chunks were retrieved, not whether the downstream generation produced a correct answer. Generation-quality differences across strategies are smaller than retrieval differences in our experience but not measured here.
  • The layout-aware extractor used PDF structural metadata that some PDFs do not carry. Documents without metadata fell back to paragraph-bounded; the F1 number above is the blended performance.

Method link

The benchmark methodology, the question construction protocol, and the question set characterisation are documented under our standard source-tier framework — see methodology for the framework, and the provenance-aware RAG case study for an applied use of these chunk-strategy results.