We ran three open-weight 70B-class models against a regulatory-text comprehension eval set we built for an unrelated client engagement. The set has 612 questions across NIS2, IEC 62443, and EU AI Act source documents, with three subsets: factual recall ("what does Article 21(2)(d) require?"), interpretation ("does this clause apply to a hosted SaaS arrangement?"), and contradiction detection ("do these two clauses impose conflicting obligations?").

Setup

  • 612 questions, evenly split across the three subsets.
  • Three open-weight models, all 70B parameter class, evaluated under identical decoding settings (temperature 0, top-p 1.0).
  • Three judges per response: two LLM judges (different model families from each other and from the systems under evaluation), one human-in-the-loop sample at 10% with disagreement-triggered escalation.
  • Aggregate and per-subset scores reported. Citation accuracy not in scope for this run — the prompts did not request citations.

Results

Bar chart of per-subset performance for three open-weight 70B models
ModelAggregateFactual recallInterpretationContradiction
Model A71.4%78.2%64.9%71.0%
Model B73.8%76.5%71.3%73.6%
Model C70.9%84.7%56.8%71.2%

Aggregate spread: 2.9 points. Per-subset spread: 7.9 / 14.5 / 2.6 points.

What we read into it

If the use case is regulatory factual lookup ("retrieve the right clause and quote it back"), Model C wins by a wide margin. If the use case is regulatory interpretation ("does this clause apply here?"), Model C loses by a wide margin — in fact, it is the worst of the three by 14.5 points on that subset. Picking Model C for an interpretation-heavy workflow on the basis of its top aggregate score on a different benchmark would be a mistake.

This is the failure mode aggregate scores are designed to hide. The per-subset reporting is what makes the eval useful for the actual deploy decision.

Caveats

  • N = 612 is enough to distinguish 5+ point differences with high confidence; the 2.9-point aggregate spread is within noise, the 14.5-point interpretation spread is not.
  • The eval is constructed against EU regulatory text in English. Performance against the same regulations in their native-language source documents was not measured in this run.
  • The contradiction-detection subset is the hardest to construct ground truth for; some of our held-out items have legitimate ambiguity where reasonable lawyers would disagree. Flagged in the open-questions register.

Method link

Same source-tier and confidence framework as the rest of our work — see methodology. The eval set itself is not publicly released; the methodology for constructing it is.