Posted by gmays 2 days ago
https://openrouter.ai/announcements/response-healing-reduce-...
might be useful ( i am not the author )
I'm a huge fan of structured outputs, but also recently started splitting both steps, and I think it has a bunch of upsides normally not discussed:
1. Separate concerns, schema validation errors don't invalidate the whole LLM response. If the only error is in generating schema-compliant tokens (something I've seen frequently), retries are much cheaper.
2. Having the original response as free text AND the structured output has value.
3. In line with point 1, it allows using a more expensive (reasoning) model for free-text generation, then a smaller model like gemini-2.5-flash to convert the outputs to structured text.
I used Python's Instructor[1], a package to force the model output to match the predefined Pydantic model. It's used like in the example below, and the output is guaranteed to fit the model.
import instructor
from pydantic import BaseModel
class Person(BaseModel):
name: str
age: int
client = instructor.from_provider("openai/gpt-5-nano")
person = client.create(
response_model=Person,
messages=[{"role": "user", "content": "Extract: John is a 30-year-old"}]
)
print(person)
I defined a response model for chain of thought prompt with answers and its thinking process, then asked questions. class MathAnswer(BaseModel):
value: int
reasoning: str
answer = client.create(
response_model=MathAnswer,
messages=[{"role": "user", "content": "What's the answer to 17*4+1? Think step by step"}]
)
print(f"answer={answer.value}, {answer.reasoning}")
This worked in most cases, but once in a while, it produced very strange results: 67, First I calculated 17*4=68, then I added 1 so the answer is 69
The actual implementation was much more complicated with many and complex proerties, a lot of inserted context, and long, engineered prompt, and it happened only a few times, so I took hours to figure out if it's caused by a programming bug or just LLM's randomness.Turned out, because I defined MathAnswer in that order, the model output was in the same order and it put the `reasoning` after the `answer`, so the thinking process didn't influence the answer like `{"answer": 67, "reasoning": "..."}` instead of `{"reasoning": "...", "answer": 69}`. I just changed the order of the model's properties and the problem was gone.
class MathAnswer(BaseModel):
reasoning: str
value: int
[1] https://python.useinstructor.com/#what-is-instructorETA: Codex and Claude Code only said how shit my prompt and RAG system were, then suggested how to improve them, but it only made the problem worse. They really don't know how they work.
Step one ask the LLM to classify something from the prompt “creatively”. For example, ask it to classify the color or category of a product in an e-commerce catalog or user request. Give examples of what valid instance of these entities look like, ask for output that looks like these examples (encourage the LLM to engage in creative hallucination). Often helps to get the LLM to pick more than one and for it to choose many different, realistic diverse labels.
Step two, with hallucinated entities, lookup via embedding similarity to find the most similar “real” entities. Then return these.
It can save you a lot of tokens (you don’t have to enumerate every legal value). And you can get by with a tiny model.
I would love some more detailed and reproducible examples, because the claims don’t make sense for all use cases I had.
The second pass is configured for structured output via guided decoding, and is asked to just put the field values from the analyzer's response into JSON fitting a specified schema.
I have processed several hundred receipts this way with very high accuracy; 99.7% of extracted fields are correct. Unfortunately it still needs human review because I can't seem to get a VLM to see the errors in the very few examples that have errors. But this setup does save a lot of time.
However I would say two things: 1. I doubt this quality drop couldn’t be mitigated by first letting the model answer in its regular language and then doing a second constrained step to convert that into structured outputs. 2. For the smaller models I have seen instances where the constrained sampling of structured outputs actually HELPS with output quality. If you can sufficiently encode information in the structure of the output it can help the model. It can effectively let you encode simple branching mechanisms to execute at sample time
You surely aren't implying that the model is sentient or has any "desire" to give an answer, right?
And how is that different from prompting in general? Isn't using english already a constraint? And isn't that what it is designed for, to work with prompts that provide limits in which to determine the output text? Like there is no "real" answer that you supress by changing your prompt.
So I don't think its a plausible explanation to say this happens because we are "making" the model return its answerr in a "constrained language" at all.
The model is a probabilistic machine that was trained to generate completions and then fine tuned to generate chat style interactions. There is an output, given the prompt and weights, that is most likely under the model. That’s what one could call the model’s “desired” answer if you want to anthropomorphize. When you constrain which tokens can be sampled at a given timestep you by definition diverge from that
Also, meta gripe: this article felt like a total bait-and-switch in that it only became clear that it was promoting a product right at the end.
Even Amazon’s cheapest and fastest model does that well - Nova Lite.
But even without using his framework, he did give me an obvious in hindsight method of handling image understanding.
I should have used a more advanced model to describe the image as free text and then used a cheap model to convert text to JSON.
I also had the problem that my process hallucinated that it understood the “image” contained in a Mac .DS_Store file
And about structured outputs messing with chain-of-thought... Is CoT really used with normal models nowadays? I think that if you need CoT you might as well use a reasoning model, and that solves the problem.