Internal research surface
Lab notes
Short experiment write-ups and model evaluations. Lower polish bar than insights — more current, more "what we tried and what happened." Same source-tier and confidence framework.
Comparative eval of three open-weight 70B models on regulatory text comprehension
Aggregate scores cluster within 4 points; per-subset performance diverges by up to 28 points. Pick by failure mode, not by leaderboard.
A repeatable chunk-strategy benchmark for technical-document RAG
Layout-aware chunking beat semantic chunking by 11 points on retrieval F1 for our reference engineering corpus. Why, and what to copy.
Schema-enforcement comparison across four constrained-generation backends
When the output must conform to a schema, the constraint-enforcement library matters more than the model. Three backends are reliable; one isn’t.