Lab note · Model eval · 2026-04-28
Comparative eval of three open-weight 70B models on regulatory text comprehension
We ran three open-weight 70B-class models against a regulatory-text comprehension eval set we built for an unrelated client engagement. The set has 612 questions across NIS2, IEC 62443, and EU AI Act source documents, with three subsets: factual recall ("what does Article 21(2)(d) require?"), interpretation ("does this clause apply to a hosted SaaS arrangement?"), and contradiction detection ("do these two clauses impose conflicting obligations?").
Setup
- 612 questions, evenly split across the three subsets.
- Three open-weight models, all 70B parameter class, evaluated under identical decoding settings (temperature 0, top-p 1.0).
- Three judges per response: two LLM judges (different model families from each other and from the systems under evaluation), one human-in-the-loop sample at 10% with disagreement-triggered escalation.
- Aggregate and per-subset scores reported. Citation accuracy not in scope for this run — the prompts did not request citations.
Results
| Model | Aggregate | Factual recall | Interpretation | Contradiction |
|---|---|---|---|---|
| Model A | 71.4% | 78.2% | 64.9% | 71.0% |
| Model B | 73.8% | 76.5% | 71.3% | 73.6% |
| Model C | 70.9% | 84.7% | 56.8% | 71.2% |
Aggregate spread: 2.9 points. Per-subset spread: 7.9 / 14.5 / 2.6 points.
What we read into it
If the use case is regulatory factual lookup ("retrieve the right clause and quote it back"), Model C wins by a wide margin. If the use case is regulatory interpretation ("does this clause apply here?"), Model C loses by a wide margin — in fact, it is the worst of the three by 14.5 points on that subset. Picking Model C for an interpretation-heavy workflow on the basis of its top aggregate score on a different benchmark would be a mistake.
This is the failure mode aggregate scores are designed to hide. The per-subset reporting is what makes the eval useful for the actual deploy decision.
Caveats
- N = 612 is enough to distinguish 5+ point differences with high confidence; the 2.9-point aggregate spread is within noise, the 14.5-point interpretation spread is not.
- The eval is constructed against EU regulatory text in English. Performance against the same regulations in their native-language source documents was not measured in this run.
- The contradiction-detection subset is the hardest to construct ground truth for; some of our held-out items have legitimate ambiguity where reasonable lawyers would disagree. Flagged in the open-questions register.
Method link
Same source-tier and confidence framework as the rest of our work — see methodology. The eval set itself is not publicly released; the methodology for constructing it is.