Structured outputs create false confidence

Posted by gmays 12/21/2025

Structured outputs create false confidence(boundaryml.com)

155 points | 66 commentspage 2

swe_dima 12/21/2025|

OpenAI structured outputs are pretty stable for me. Gemini sometimes responds with a completely different structure. Gemini 3 flash with grounding sometimes returns json inside ```json...``` causing parsing errors.

euazOn 12/21/2025||

In case you're using OpenRouter, check out their new Response Healing feature that claims to solve exactly this issue.

https://openrouter.ai/announcements/response-healing-reduce-...

red2awn 12/21/2025||

It does NOT. Response healing only fixes JSON syntax errors, not schema differences.

See https://news.ycombinator.com/item?id=46332119

codegladiator 12/21/2025||

https://github.com/josdejong/jsonrepair

might be useful ( i am not the author )

cmews 12/21/2025||

Structured outputs work well depending on the tasks. The example mentioned in the blog post output doesn’t say anything because we are missing the prompt/schema definition. Also quantity is quite ambiguous because it could be bananas as a term is readable once on the receipt.

I would love some more detailed and reproducible examples, because the claims don’t make sense for all use cases I had.

hamasho 12/21/2025||

Story time.

I used Python's Instructor[1], a package to force the model output to match the predefined Pydantic model. It's used like in the example below, and the output is guaranteed to fit the model.

    import instructor
    from pydantic import BaseModel

    class Person(BaseModel):
        name: str
        age: int

    client = instructor.from_provider("openai/gpt-5-nano")
    person = client.create(
        response_model=Person,
        messages=[{"role": "user", "content": "Extract: John is a 30-year-old"}]
    )
    print(person)

I defined a response model for chain of thought prompt with answers and its thinking process, then asked questions.

    class MathAnswer(BaseModel):
        value: int
        reasoning: str

    answer = client.create(
        response_model=MathAnswer,
        messages=[{"role": "user", "content": "What's the answer to 17*4+1? Think step by step"}]
    )
    print(f"answer={answer.value}, {answer.reasoning}")

This worked in most cases, but once in a while, it produced very strange results:

    67, First I calculated 17*4=68, then I added 1 so the answer is 69

The actual implementation was much more complicated with many and complex proerties, a lot of inserted context, and long, engineered prompt, and it happened only a few times, so I took hours to figure out if it's caused by a programming bug or just LLM's randomness.

Turned out, because I defined MathAnswer in that order, the model output was in the same order and it put the `reasoning` after the `answer`, so the thinking process didn't influence the answer like `{"answer": 67, "reasoning": "..."}` instead of `{"reasoning": "...", "answer": 69}`. I just changed the order of the model's properties and the problem was gone.

    class MathAnswer(BaseModel):
        reasoning: str
        value: int

[1] https://python.useinstructor.com/#what-is-instructor

ETA: Codex and Claude Code only said how shit my prompt and RAG system were, then suggested how to improve them, but it only made the problem worse. They really don't know how they work.

danwilsonthomas 12/22/2025|

I've seen something similar when using Pydantic to get a list from structured output. The input included (among many other things) a chunk of CSV formatted data. I found that if the CSV had an empty column - a literal ",," - then there was a high likelihood that the structured output would also have a literal ",," which is of course invalid Python syntax for a list and Pydantic would throw an error when attempting to construct the output. Simply changing all instances of ",," to ",NA," in the input CSV chunk reduced that to effectively zero. A simple hack, but one that fit within the known constraints.

NitpickLawyer 12/21/2025||

A 3rd alternative is to use the best of both worlds. Have the model respond in free-form. Then use that response + structured output APIs to ask it for json. More expensive, but better overall results. (and you can cross-check between your heuristic parsing vs. the structured output, and retry / alert on miss-matches)

theoli 12/21/2025|

I am doing this with good success parsing receipts with ministral3:14b. The first prompt describes the data being sought, and asks for it to be put at the end of the response. The format tends to vary between json, bulleted lists, and name: value pairs. I was never able to find a good way to get just JSON.

The second pass is configured for structured output via guided decoding, and is asked to just put the field values from the analyzer's response into JSON fitting a specified schema.

I have processed several hundred receipts this way with very high accuracy; 99.7% of extracted fields are correct. Unfortunately it still needs human review because I can't seem to get a VLM to see the errors in the very few examples that have errors. But this setup does save a lot of time.

polyrand 12/22/2025||

I enjoyed the post. I was about to link the "Let Me Speak Freely" paper and "Say What You Mean" response from dottxt, but that's already been posted in the comments.

I'm a huge fan of structured outputs, but also recently started splitting both steps, and I think it has a bunch of upsides normally not discussed:

1. Separate concerns, schema validation errors don't invalidate the whole LLM response. If the only error is in generating schema-compliant tokens (something I've seen frequently), retries are much cheaper.

2. Having the original response as free text AND the structured output has value.

3. In line with point 1, it allows using a more expensive (reasoning) model for free-text generation, then a smaller model like gemini-2.5-flash to convert the outputs to structured text.

softwaredoug 12/22/2025||

Without structured outputs, for some classification tasks you can use the hallucinate -> resolve pattern

Step one ask the LLM to classify something from the prompt “creatively”. For example, ask it to classify the color or category of a product in an e-commerce catalog or user request. Give examples of what valid instance of these entities look like, ask for output that looks like these examples (encourage the LLM to engage in creative hallucination). Often helps to get the LLM to pick more than one and for it to choose many different, realistic diverse labels.

Step two, with hallucinated entities, lookup via embedding similarity to find the most similar “real” entities. Then return these.

It can save you a lot of tokens (you don’t have to enumerate every legal value). And you can get by with a tiny model.

michaelgiba 12/21/2025||

It’s not surprising that there could be a very slight quality drop off for making the model return its answer in a constrained way. You’re essentially forcing the model to express the actual answer it wants to express in a constrained language.

However I would say two things: 1. I doubt this quality drop couldn’t be mitigated by first letting the model answer in its regular language and then doing a second constrained step to convert that into structured outputs. 2. For the smaller models I have seen instances where the constrained sampling of structured outputs actually HELPS with output quality. If you can sufficiently encode information in the structure of the output it can help the model. It can effectively let you encode simple branching mechanisms to execute at sample time

altmanaltman 12/21/2025|

> You’re essentially forcing the model to express the actual answer it wants to express in a constrained language.

You surely aren't implying that the model is sentient or has any "desire" to give an answer, right?

And how is that different from prompting in general? Isn't using english already a constraint? And isn't that what it is designed for, to work with prompts that provide limits in which to determine the output text? Like there is no "real" answer that you supress by changing your prompt.

So I don't think its a plausible explanation to say this happens because we are "making" the model return its answerr in a "constrained language" at all.

michaelgiba 12/21/2025||

> You surely aren't implying that the model is sentient or has any "desire" to give an answer, right?

The model is a probabilistic machine that was trained to generate completions and then fine tuned to generate chat style interactions. There is an output, given the prompt and weights, that is most likely under the model. That’s what one could call the model’s “desired” answer if you want to anthropomorphize. When you constrain which tokens can be sampled at a given timestep you by definition diverge from that

whakim 12/21/2025||

I don't really understand the point around error handling. Sure, with structured outputs you need to be explicit about what errors you're handling and how you're handling them. But if you ask the model to return pure text, you now have a universe of possible errors that you still need to handle explicitly (you're using structured outputs, so your LLM response is presumably being consumed programmatically?), including a whole bunch of new errors that structured outputs help you avoid.

Also, meta gripe: this article felt like a total bait-and-switch in that it only became clear that it was promoting a product right at the end.

raw_anon_1111 12/21/2025||

From what I have found text -> structured text works well. I do a lot of call center based projects where I need to get intents (what API I need to call to fulfill the user’s request) and add slots (the variable part of the message like addresses).

Even Amazon’s cheapest and fastest model does that well - Nova Lite.

But even without using his framework, he did give me an obvious in hindsight method of handling image understanding.

I should have used a more advanced model to describe the image as free text and then used a cheap model to convert text to JSON.

I also had the problem that my process hallucinated that it understood the “image” contained in a Mac .DS_Store file

andy12_ 12/21/2025|

It seems like this could be solved by partial structured output, where the structure of the JSON itself is constrained, but the values of the JSON entries is not (so even if "quantity" here is set to int, the model can output "52.2"). Of course, we would need additional parsing, but I think it's a fair compromise.

And about structured outputs messing with chain-of-thought... Is CoT really used with normal models nowadays? I think that if you need CoT you might as well use a reasoning model, and that solves the problem.

More comments...