Structured outputs create false confidence

Posted by gmays 12/21/2025

Structured outputs create false confidence(boundaryml.com)

155 points | 66 comments

simonw 12/21/2025|

I'm not 100% convinced by this post. I'd like to see a more extensive formal eval that demonstrates that structured outputs from different providers reduces the quality of data extraction results.

Assuming this holds up, I wonder if a good workaround for this problem - the problem that turning on structured outputs makes errors more likely - would be to do this:

1. Prompt the LLM "extract numbers from this receipt, return data in this JSON format: ..." - without using the structured output mechanism.

2. If the returned JSON does indeed fit the schema then great, you're finished! But if it doesn't...

3. Round-trip the response from the previous call through the LLM again, this time with structured outputs configured. This should give you back the higher quality extracted data in the exact format you want.

hellovai 12/21/2025||

(on of the creators of BAML here) yep! exactly!

that workaround we've found works quite well, but the problem is that its not sufficient to just retry in the case of failed schema matches (its both inefficient and also imo incorrect).

Take these two scenarios for example:

Scenario 1. My system is designed to output receipts, but the user does something malicious and gives me an invoice. during step 2, it fails to fit the schema, but then you try with step 3, and now you have a receipt! Its close, but your business logic is not expecting that. Often when schema alignment fails, its usually because the schema was ambiguous or the input was not valid.

Scenario 2. I ask the LLM to produce this schema:

    class Person {
      name string
      past_jobs string[]
    }

However the person only has ever worked at 1 job. so the LLM outputs: { "name": "Vaibhav", "past_jobs": "Google" }. Technically since you know you expect an array, you could just transform the string -> string[].

thats the algorithm we created: schema-aligned parsing. More here if you're interested: https://boundaryml.com/blog/schema-aligned-parsing

Benchmark wise, when we tested last, it seems to help on top of every model (especially the smaller ones) https://www.reddit.com/r/LocalLLaMA/comments/1esd9xc/beating...

Hope this helps with some of the ambiguities in the post :)

joatmon-snoo 12/21/2025||

(author here) To be more specific, here's a benchmark that we ran last year, where we compared schema-aligned parsing against constrained decoding (then called "Function Calling (Strict)", the orange ƒ): https://boundaryml.com/blog/sota-function-calling

skybrian 12/21/2025||

I wonder what it would look if you redid the benchmarks, testing against models that have reasoning effort set to various values. Maybe structured output is only worse if the model isn't allowed to do reasoning first?

ramraj07 12/21/2025|||

Isn't it better to put it in an agent loop, with the structured output json just specified as a tool? The function call can then just return a summary of the parsed input. We can add in the system prompt a validation step to ask the llm to verify it has provided inputs correctly. This will allow the llm itself to self reflect and correct if needed.

kemiller 12/21/2025||

That is more or less what BAML does

refulgentis 12/21/2025||

I understand this but A) then they should have done it here B) the idea that you can't get CoT x JSON without sacrificing JSON formatting is flat out wrong with ~any 2025 model. (i.e. reasoning models and their APIs specifically enable this)

supermdguy 12/21/2025||

If your output schema doesn’t capture all correct outputs, that’s a problem with your schema, not the LLM. A human using a data entry tool would run into the wrong issue. Letting the LLM output whatever it wants just makes it so you have to deal with ambiguities manually, instead of teaching the LLM what to do.

I usually start by adding an error type that will be overused by the LLM, and use that to gain visibility into the types of ambiguities that come up in real-world data. Then over time you can build a more correct schema and better prompts that help the LLM deal with ambiguities the way you want it to.

Also, a lot of the chain of thought issues are solved by using a reasoning model (which allows chain of thought that isn’t included in the output) or by using an agentic loop with a tool call to return output.

dhruvbird 12/21/2025|

This ^^^^

While the provided schema has a "quantity" field, it doesn't mention the units.

<code>

class Item(BaseModel):

    name: str

    price: float = Field(description="per-unit item price")

    quantity: float = Field(default=1, description="If not specified, assume 1")

class Receipt(BaseModel):

    establishment_name: str

    date: str = Field(description="YYYY-MM-DD")

    total: float = Field(description="The total amount of the receipt")

    currency: str = Field(description="The currency used for everything on the receipt")

    items: list[Item] = Field(description="The items on the receipt")

</code>

There needs to be a better evaluation and a better provided schema that captures the full details of what is expected to be captured.

> What kind of error should it return if there's no total listed on the receipt? Should it even return an error or is it OK for it to return total = null?

Additionally, the schema allows optional fields, so the LLM is free to skip missing fields if they are specified as such.

throw-qqqqq 12/21/2025||

Interesting! .TXT has the opposite conclusion, that structured output improves performance:

https://blog.dottxt.ai/say-what-you-mean.html

https://blog.dottxt.ai/prompt-efficiency.html

This also matches my own experiences.

Der_Einzige 12/21/2025||

Yup. I instantly linked these because the multiple papers who claim structured outputs harm quality are not just wrong, but fatally damaging to the whole AI ecosystem especially AI agents.

There are places where structured outputs harms creativity, but usually that's a decoding time problem which is similarly solved with better sampling, like they talk about in this paper: https://arxiv.org/abs/2410.01103

Claims of harmed reasoning performance are really evidence that 1. Your structured generation backend is bad or 2. Some shenanigans/interactions with temperature/samplers (this is the most common by far) or 3. You are bad at benchmarking.

flagos10 12/21/2025||

Same for me. Using structured output was much better than without.

kgeist 12/21/2025||

Just a week ago, I rewrote our RAG pipeline to use structured outputs, and the tests showed no significant difference in quality after a few tweaks (under vLLM). What helped was that we have a pipeline where another LLM automatically scores 'question-expected answer' pairs, so what we did was: tweak the schema/prompt => evaluate => tweak again, until we got good results in most cases, just like with free-form prompts.

Several issues were found:

1. A model may sometimes get stuck generating whitespace at the end forever (the JSON schema allows it), which can lock up the entire vLLM instance. The solution was to use xgrammer, because it has a handy feature that disallows whitespace outside of strings.

2. In some cases I had to fiddle with metainformation like minItems/maxItems for arrays, or the model would either hallucinate or refuse to generate anything.

3. Inference engines may reorder the fields during generation, which can impact the quality due to the autoregressive nature of LLMs (like, the "calculation" field must come before the "result" field). Make sure the fields are not reordered.

4. Field names must be as descriptive as possible, to guide the model to generate expected data in the expected form. For example, "durationInMilliseconds" instead of just "duration".

Basically, you can't expect a model to give you good results out of the box with structured outputs if the schema is poorly designed or underspecified.

Der_Einzige 12/22/2025|

Ding ding ding ding, we have another person who actually understands how to use this feature.

The fact that most people don't know any of these things that you are mentioning is one of the myriad reasons why the most killer feature of LLMs continues to languish in obscurity.

pizzathyme 12/21/2025||

The very first example, which is held up as an error, is actually arguably correct. If you asked a human (me) how many bananas were purchased, they clearly purchased one banana.

Yes the banana weighs 0.4 pounds. But the question was not to return the weight or the quantity, the question was to return the quantity.

It seems like more instructions are needed in the prompt that the author is not even aware of.

banandys 12/21/2025||

A very common peeled banana weight is 100g (“metric banana”). This is convenient for calorie counting. 0.4lbs for a single banana as the peeled weight is probably around 125g.

https://www.reddit.com/r/dataisbeautiful/comments/bs741l/oc_...

esafak 12/21/2025||

Or one batch of bananas, weighing 0.4 pounds. The number of bananas is not specified in the receipt, and I would not expect the model to estimate it.

AmbroseBierce 12/22/2025|||

The prompt literally tells it "if not specified assume 1"

banandys 12/22/2025|||

it is very unlikely for 0.4 lbs of bananas to be more than one.

https://fdc.nal.usda.gov/food-details/1105314/measures

Sugar bananas or apple bananas would weigh less, but would cost more and probably not just be listed as bananas.

esafak 12/22/2025||

That would require the model to know or look up the average weight of a banana, and do arithmetic.

dcastm 12/21/2025||

While I agree that you must be careful when using structured outputs, the article doesn't provide good arguments:

1. In the examples provided, the author compares freeform CoT + JSON output vs. non-CoT structured output. This is unfair and biases the results towards what they wanted to show. These days, you don't need to include a "reasoning" field in the schema as mentioned in the article; you can just use thinking tokens (e.g., reasoning_effort for OpenAI models). You get the best of both worlds: freeform reasoning and structured output. I tested this, and the results were very similar for both.

2. Let Me Speak Freely? had several methodological issues. I address some of them (and .txt's rebuttal) here: https://dylancastillo.co/posts/say-what-you-mean-sometimes.h...

3. There's no silver bullet. Structured outputs might improve or worsen your results depending on the use case. What you really need to do is run your evals and make a decision based on the data.

Der_Einzige 12/21/2025|

BTW, the structured outputs debate is significantly more complicated than even your own post implies.

You aren't testing structured outputs+model alone, you are testing

1. The structured outputs backend used. There are at least 3 major free ones, outlines, xgrammer, lm-format-enforcer and guidance. OpenAI, Anthropic, Google, and Grok will all have different ones. They all do things SIGNIFICANTLY differently. That's at least 8 different backends to compare.

2. The settings used for each structured output backend. Oh, you didn't know that there's often 5+ settings related to how they handle subtle stuff like whitespaces? Better learn to figure out what these settings do and how to tweak them!

3. The models underlying sampling settings, i.e. any default temperature, top_p/top_k, etc going on. Remember that the ORDER of application of samplers matters here! Huggingface transformers and vLLM have opposite defaults on if temperature happens before sampling or after!

4. The model, and don't forget about differences around quants/variants of the model!

Almost no one who does any kinds of these analysis even talk about these additional factors, including academics.

Sometimes it feels like I'm the only one in this world who actually uses this feature at the extremes of its capabilities.

A_SIGINT 12/21/2025||

> Chain-of-thought is crippled by structured outputs

I don't know if this is true. Libraries such as Pydantic AI and I would assume the model provider SDKs stream different events. If COT is needed then a <think> section would be emitted and then later the structured response would occur when the model begins its final response.

Structured outputs can be quite reliable if used correctly. For example, I designed an AST structure that allows me to reliably generate SQL. The model has tools to inspect data-points, view their value distributions (quartiles, medians, etc). Then once I get the AST structure back I can perform semantic validation easily (just walk the tree like a compiler). Once semantic validation passes (or forces a re-prompt with the error), I can just walk the tree again to generate SQL. This helps me reliably generate SQL where I know it won't fail during execution, and have a lot of control over what data-points are used together, and ensuring valid values are used for them.

I think the trick is just generating the right schema to model your problem, and understanding the depth of an answer that might come back.

Aurornis 12/21/2025||

Does anyone have more benchmarks or evals with data on this topic? The claimed 20% accuracy reduction is significant.

Structured output was one of the lesser known topics that AI consultants and course writers got a lot of mileage out of because it felt like magic. A lot of management people would use ChatGPT but didn’t know how to bridge the text output into a familiar API format, so using a trick to turn it into JSON felt like the missing link. Now that I think about it, I don’t recall seeing any content actually evaluating the impact of constrained output on quality though.

This blog post blurs the lines between output quality reduction and incorrect error handling, though. I’d like to see some more thorough benchmarking that doesn’t try to include obvious schema issues in the quality reduction measurements.

crystal_revenge 12/21/2025|

(repeating an earlier comment). The team behind Outlines has repeatedly provided evaluations that show constrained decoding improves the outputs:

- https://blog.dottxt.ai/performance-gsm8k.html

- https://blog.dottxt.ai/oss-v-gpt4.html

- https://blog.dottxt.ai/say-what-you-mean.html

rybosome 12/21/2025||

I have heard this argument before, but never actually seen concrete evals.

The argument goes that because we are intentionally constraining the model - I believe OAI’s method is a soft max (I think, rusty on my ML math) to get tokens sorted by probability then taking the first that aligns with the current state machine - we get less creativity.

Maybe, but a one-off vibes example is hardly proof. I still use structured output regularly.

Oh, and tool calling is almost certainly implemented atop structured output. After all, it’s forcing the model to respond with a JSON schema representing the tool arguments. I struggle to believe that this is adequate for tool calling but inadequate for general purpose use.

crystal_revenge 12/21/2025|

> but never actually seen concrete evals.

The team behind the Outlines library has produced several sets of evals and repeatedly shown the opposite: that constrained decoding improves model performance (including examples of "CoT" which the post claims isn't possible). [0,1]

There was a paper that claimed constrained decoding hurt performance, but it had some fundamental errors which they also wrote about [2].

People get weirdly superstitious when it comes to constrained decoding as though t somehow "limiting the model" when it's just a simple as applying a conditional probably distribution to the logits. I also suspect this post is largely to justify the fact that BAML parses the results (since the post is written by them).

0. https://blog.dottxt.ai/performance-gsm8k.html

1. https://blog.dottxt.ai/oss-v-gpt4.html

2. https://blog.dottxt.ai/say-what-you-mean.html

Der_Einzige 12/21/2025||

To be fair, there is "real harm" from constraining LLM outputs related to, for example, forcing lipograms or the letter "E" and a model responding with misspellings of words (deleted E) rather than words that don't actually have the letter "E" at all. This is why some authors propose special decoders to fix that diversity problem. See this paper and most of what it cites around it for examples of this: https://arxiv.org/abs/2410.01103

This is independent from a "quality" or "reasoning" problem which simply does not exist/happen when using structured generation.

Edit (to respond):

I am claiming that there is no harm to reasoning, not claiming that CoT reasoning before structured generation isn't happening.

crystal_revenge 12/21/2025||

> "reasoning" problem which simply does not exist/happen when using structured generation

The first article demonstrates exactly how to implement structured generation with CoT. Do you mean “reasoning” other than traditional CoT (like DeepSeek)? I’ll have to look for an reference but I recall the Outlines team also handling this latter case.

armcat 12/21/2025|

I really like BAML but this post seems a little too much like a BAML funnel. Here are three methods that worked for me consistently since constrained sampling first came out:

1. Add a validation step (using a mini model) right at the beginning - sub-second response times; the validation will either emit True/False or emit a function call

2. Use a sequence of (1) large model without structured outputs for reasoning/parsing, chained to (2) small model for constrained sampling/structured output

3. Keep your Pydantic models/schemas as flat (not too nested and not too many enumarations) and "help" the model in the system prompt as much as you can

More comments...