Schema-enforcement comparison across four constrained-generation backends

A short note from a recent comparison. We needed to produce structured outputs (JSON conforming to a non-trivial schema with nested objects and conditional required fields) from an LLM. We benchmarked four constrained-generation approaches on the same model and the same input set.

The schema in question (excerpt):

{
  "type": "object",
  "required": ["finding", "evidence", "confidence"],
  "properties": {
    "finding":    { "type": "string", "minLength": 20 },
    "evidence":   {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["source_id", "tier"],
        "properties": {
          "source_id": { "type": "string", "pattern": "^SRC\\.[0-9]+\\.[0-9]+$" },
          "tier":      { "enum": ["A", "B", "C", "D", "E"] },
          "page":      { "type": "integer", "minimum": 1 }
        }
      },
      "minItems": 1
    },
    "confidence": { "enum": ["HIGH", "MED", "LOW", "UNVERIFIED"] },
    "open_questions": { "type": "array", "items": { "type": "string" } }
  },
  "if":   { "properties": { "confidence": { "const": "UNVERIFIED" } } },
  "then": { "required": ["open_questions"] }
}

Setup

1,000 input prompts, each requiring an output matching a schema with 14 fields, 3 nested objects, and 4 conditional-requirement rules.
Same base model across all four backends.
Measured: schema-conformance rate (does the output validate against the schema?), semantic correctness rate (is the output the right answer, conditional on conforming?), and tokens-to-completion (efficiency).

Results

Backend	Conformance	Semantic correctness	Tokens to completion
1. Native function-calling (provider feature)	99.6%	91.2%	baseline
2. Grammar-constrained decoding (open-source)	100.0%	90.4%	+18%
3. Constrained sampling via JSON schema	99.8%	91.0%	+6%
4. Prompt-only ("respond in JSON conforming to this schema")	86.4%	89.7%	-2%

What we read into it

The first three are interchangeable on this schema. Provider-native function-calling is fastest because no extra decoder pass is needed. Grammar-constrained is the most thorough on validation but slowest. Schema-constrained sampling is a good middle position.
Prompt-only is not viable for this use case at this scale. A 13.6% non-conformance rate, even with otherwise-good semantic quality, means roughly one in seven outputs fails validation and needs to be retried — which negates the latency advantage.
Semantic correctness is largely independent of the constraint backend. The model is doing the same job regardless of how the schema is enforced.

What this changes for our work

For client engagements where the deliverable is a structured generation pipeline, we now default-recommend provider-native function-calling where available, with grammar-constrained decoding as the open-source fallback. We do not recommend prompt-only schema enforcement for any use case where downstream consumers depend on conformance — which is most of them.

Caveats

One model, one schema. We have not tested whether the conformance ranking holds across model families or across more complex schemas. Anecdotal results from earlier work suggest grammar-constrained pulls further ahead as schemas become more deeply nested, but we have not benchmarked this rigorously.
Tokens-to-completion is a rough efficiency proxy; the actual latency picture depends on provider implementation choices outside our measurement.

← All lab notes