Lab notes — Zhianrui

2026-04-28 · model-eval

Comparative eval of three open-weight 70B models on regulatory text comprehension

Aggregate scores cluster within 4 points; per-subset performance diverges by up to 28 points. Pick by failure mode, not by leaderboard.

2026-04-19 · rag

Layout-aware chunking beat semantic chunking by 11 points on retrieval F1 for our reference engineering corpus. Why, and what to copy.

2026-04-12 · model-eval

When the output must conform to a schema, the constraint-enforcement library matters more than the model. Three backends are reliable; one isn’t.