Comparative eval of three open-weight 70B models on regulatory text comprehension

We ran three open-weight 70B-class models against a regulatory-text comprehension eval set we built for an unrelated client engagement. The set has 612 questions across NIS2, IEC 62443, and EU AI Act source documents, with three subsets: factual recall ("what does Article 21(2)(d) require?"), interpretation ("does this clause apply to a hosted SaaS arrangement?"), and contradiction detection ("do these two clauses impose conflicting obligations?").

Setup

612 questions, evenly split across the three subsets.
Three open-weight models, all 70B parameter class, evaluated under identical decoding settings (temperature 0, top-p 1.0).
Three judges per response: two LLM judges (different model families from each other and from the systems under evaluation), one human-in-the-loop sample at 10% with disagreement-triggered escalation.
Aggregate and per-subset scores reported. Citation accuracy not in scope for this run — the prompts did not request citations.

Results

Bar chart of per-subset performance for three open-weight 70B models

Model	Aggregate	Factual recall	Interpretation	Contradiction
Model A	71.4%	78.2%	64.9%	71.0%
Model B	73.8%	76.5%	71.3%	73.6%
Model C	70.9%	84.7%	56.8%	71.2%

Aggregate spread: 2.9 points. Per-subset spread: 7.9 / 14.5 / 2.6 points.

What we read into it

If the use case is regulatory factual lookup ("retrieve the right clause and quote it back"), Model C wins by a wide margin. If the use case is regulatory interpretation ("does this clause apply here?"), Model C loses by a wide margin — in fact, it is the worst of the three by 14.5 points on that subset. Picking Model C for an interpretation-heavy workflow on the basis of its top aggregate score on a different benchmark would be a mistake.

This is the failure mode aggregate scores are designed to hide. The per-subset reporting is what makes the eval useful for the actual deploy decision.

Caveats

N = 612 is enough to distinguish 5+ point differences with high confidence; the 2.9-point aggregate spread is within noise, the 14.5-point interpretation spread is not.
The eval is constructed against EU regulatory text in English. Performance against the same regulations in their native-language source documents was not measured in this run.
The contradiction-detection subset is the hardest to construct ground truth for; some of our held-out items have legitimate ambiguity where reasonable lawyers would disagree. Flagged in the open-questions register.

Method link

Same source-tier and confidence framework as the rest of our work — see methodology. The eval set itself is not publicly released; the methodology for constructing it is.

← All lab notes