Schema-Valid LLM Output Still Gets 20% of Values Wrong

97% JSON Pass, 20% Wrong Values

The Structured Output Benchmark (SOB), published in April 2026, evaluated 21 frontier LLMs on schema-constrained extraction tasks. The result that should stop you from shipping: every model clears 97%+ on JSON Pass Rate, but Value Accuracy — whether the extracted values are actually correct — sits 15 to 30 percentage points lower. GPT-5.4, the top performer, scores 99.3% on JSON Pass and 79.8% on Value Accuracy. That 19.5-point gap is the space where your production pipeline silently corrupts data while your monitoring dashboards show green — and as we’ve documented before, most teams lack the observability to catch these failures (Interfaze SOB Benchmark).

If your reliability strategy for LLM output ends at schema validation, you are checking that the answer is well-formatted while ignoring whether it is correct. This is not a theoretical concern. A financial services team documented by Rotascale achieved 100% schema compliance over six months of pipeline hardening — then discovered their credit risk agent was approving loans that should have been declined, because the model filled required fields with plausible but fabricated values (Rotascale).

The Compliance Trap

Structured output in LLMs has evolved rapidly through four phases. In 2023, teams parsed raw text with regex and accepted 15-30% failure rates. JSON mode in early 2024 pushed format failures near zero. Constrained decoding — where a logit processor restricts token generation at each step to only schema-valid tokens — brought format reliability to 99.9%+ by late 2024. By 2025, most teams declared the problem solved and moved on (Rotascale).

The progression looks like this: raw text → valid JSON → schema compliant → type safe → semantically correct. Constrained decoding solves the first four steps. Step five — semantic correctness — remains untouched. Every major provider (OpenAI, Anthropic, Google Gemini, Mistral) now supports native structured output via constrained decoding. All guarantee the same thing: syntactic validity and schema compliance. None guarantee that the values inside that schema are factually grounded (Collin Wilkins).

An empirical study of 288 structured output calls across every major LLM provider catalogued eight distinct failure modes in JSON generation: markdown fence wrapping, trailing commas, wrong boolean/null literals (Python’s True vs JSON’s true), inline comments, unescaped quotes, truncated objects, ellipsis placeholders, and encoding issues. JSON mode and constrained decoding eliminate most of these. But as the study’s author notes: “JSON mode moves you from ‘output is sometimes syntactically broken’ to ‘output is always syntactically valid but sometimes structurally wrong or incomplete,’ which is real progress but not a complete solution” (The Crosswalk).

Three Silent Failure Modes

The dangerous failures are not the ones that throw parse errors. They are the ones that pass every validation check you have built.

Confident Hallucination in Required Fields

When a JSON schema marks a field as required, the model must produce a value. If it lacks sufficient information in context, it does not return null or flag uncertainty — it generates a plausible wrong value. OpenAI’s own community forums document cases where strict: true with function calling produces schema-compliant output that semantically diverges from what was requested (OpenAI Community). The confidence score in the output? Not correlated with accuracy. A risk_score of 72 with confidence 0.94 can be entirely fabricated.

Schema-Shaped Drift

Model updates shift how schemas are interpreted without changing the schema itself. A model update causes risk_category: "moderate" to be assigned to cases previously classified as "high". The enum values have not changed. The distribution of values has. Your monitoring checks output validity — whether “moderate” is a valid enum — not whether “moderate” still means what it meant last month. This is semantic drift wearing a syntactically valid disguise (Rotascale).

Adversarial Schema Compliance

Prompt injection attacks do not need to break your schema. they just need to influence values within it — a pattern that compounds the prompt injection risks already present in LLM deployments. An attacker who understands your schema can craft inputs that steer toward specific schema-compliant outputs — the JSON passes every validation check while serving the attacker’s intent, not yours. This is especially dangerous in financial services, insurance claims, and any pipeline where schema-compliant output triggers automated actions (Rotascale).

Provider Implementation Gaps

Each provider’s structured output implementation has specific gotchas that affect reliability in production.

ProviderMechanismKey Limitation
OpenAIresponse_format: { type: "json_schema" } with constrained decodingRefusals return a refusal object, not schema-compliant JSON — retry loops hang if not handled — and as retry amplification can cascade into broader reliability problems. Optional fields need union with null in strict mode. Assistants API sunsets August 2026.
Anthropic Claudeoutput_config.format with native JSON schemaNo recursive schemas, no min/max constraints, no string length limits. 20 strict tools/request limit. First request per schema pays 10-30s compile latency. Required properties output first — put reasoning fields before answer fields or the model commits early.
Google Geminiresponse_mime_type + response_json_schemaUnsupported JSON Schema keywords are silently ignored — no error, no warning. Model returns JSON that does not match your schema, and you will not know unless you test the actual schema end-to-end.

Anthropic’s latency trap deserves particular attention. The first request with a new schema incurs a 10-30 second compile delay while the grammar is built. There is a 24-hour grammar cache, but any schema edit invalidates it. In a CI/CD pipeline that deploys schema changes, this means the first request after every deploy is an outlier. If your SLA targets assume sub-second latency, you need a warm-up step (Collin Wilkins).

Google’s silent keyword ignoring is arguably the most dangerous behavior. If you specify minimum/maximum constraints on numeric fields and those keywords are not supported for your model version, Gemini accepts the schema without error and generates output that violates those bounds. You discover this in production, not in testing — unless you test with your actual production schema, not a simplified version (Collin Wilkins).

Tighter Schemas Won’t Fix This

The common reflex is to add more constraints: tighter enums, conditional required fields, co-occurrence rules, cross-field validation, custom validators. Teams build schemas with 200+ constraints. These become maintenance nightmares while the fundamental problem — the model generating factually incorrect values inside valid structures — remains.

Schema complexity grows linearly. The semantic space you are trying to constrain grows combinatorially. Adding more schema constraints for semantic reliability is like adding spell-check rules to catch factual errors. You can make the spell-checker arbitrarily sophisticated — it will still never tell you that a correctly spelled sentence is factually wrong (Rotascale).

The SOB benchmark quantifies this precisely. In its leaderboard, structural metrics (JSON Pass, Path Recall, Structure Coverage, Type Safety) cluster near ceiling for every model. Value Accuracy and Perfect Response are the metrics that separate them. Even GPT-5.4, the overall leader, achieves only 46.9% Perfect Response — meaning that in more than half of extractions, at least one leaf value is wrong. The gap between structural compliance and value correctness is the space where production failures live (Interfaze SOB Benchmark).

What Semantic Reliability Requires

Production reliability for LLM structured output needs four capabilities above the schema layer.

Reasoning capture. Know why the model produced each value, not just what it produced. Log the chain of evidence the model used. If a downstream consumer disputes a value, you need to trace whether the model had sufficient context or fabricated.

Semantic evaluation suites. Before any model touches production, evaluate it against test cases that check value correctness, not just schema compliance. Pull 1,000 recent schema-valid outputs and manually evaluate semantic accuracy. If your test suite only asserts that risk_score is a number between 0 and 100, you are testing schema compliance. You need tests that assert the score is within a reasonable range of what a human reviewer would assign given the same input.

Runtime monitoring on distributions. Track the distribution of output values over time, not just individual validity. If risk_category: "moderate" jumps from 30% of outputs to 55% after a model update, that is a signal — even though every individual output is schema-valid. Set up alerts on distribution shifts, not just on parse errors.

Policy enforcement. Business rules that cannot be expressed in JSON Schema — “decline if applicant has more than 2 defaults in the last 12 months” — must be enforced in code after schema validation, not delegated to the model. The model’s job is extraction. Your code’s job is policy.

The Model-Size Trap

One of the more surprising findings from the SOB benchmark: model size is not a predictor of structured output accuracy. Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. On image-based extraction, Gemma-4-31B (ranked 11th on text) takes first place, while GPT-5.4 (ranked 1st overall) drops to 9th. No single model wins across all three modalities (text, image, audio) (Interfaze SOB Benchmark).

The engineering implication: if you are routing structured extraction tasks to a single model regardless of input modality, you are leaving accuracy on the table. The model that produces the most valid JSON is not necessarily the model that produces the most correct values. And the model that performs best on text extraction may be mediocre on document images.

The JSONSchemaBench evaluation framework, which pairs 10,000 real-world JSON schemas with the official JSON Schema Test Suite, confirms that constrained decoding frameworks vary significantly in their coverage of diverse constraint types. XGrammar, Guidance, Outlines, and provider-native implementations each have different failure profiles — efficiency, coverage, and output quality trade off differently (JSONSchemaBench, OpenReview).

Engineering Checklist

Based on the evidence from these benchmarks and production post-mortems, here is what a production structured output pipeline needs beyond schema validation:

  1. Separate schema validation from semantic validation. Pydantic or Zod for schema. Custom validators for business logic. Never conflate the two.
  2. Log reasoning traces. If your provider supports it, capture the model’s chain-of-thought alongside the structured output. This is your audit trail.
  3. Monitor value distributions, not just validity rates. Set alerts on statistical shifts in output distributions. A sudden change in enum distribution is a stronger signal than a parse error.
  4. Run semantic eval suites in CI. Before deploying model or prompt changes, run against a held-out set of inputs with ground-truth values. Fail the deploy if Value Accuracy drops.
  5. Test your actual schema. Especially with Gemini — unsupported keywords are silently ignored. Do not discover this in production.
  6. Handle provider-specific edge cases. OpenAI refusal objects, Anthropic’s first-request latency, Gemini’s keyword silencing — each provider has failure modes that only manifest at scale.
  7. Warm up new schemas. Send a throwaway request after schema changes to prime Anthropic’s grammar cache. Include this in your deploy pipeline.
  8. Enforce policy in code, not in prompts. Business rules that protect against real harm — credit decisions, medical triage, legal classifications — must survive model failures. Code is deterministic. Prompts are not.

References