VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

Posted by timhigins 17 hours ago

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO(arxiv.org)

353 points | 183 commentspage 2

sorenjan 6 hours ago|

How would you best utilize a model like this for coding? I take it it's not meant for vibe coding a full app, and the reasoning probably makes it unsuitable for autocomplete. Would you use it to implement specific functions? I looked at one of the coding benchmarks used, Live Code Bench, and it seems to be problem descriptions with sample input and output, and then a solution with a single function or class.

Seems like a really good model to use in an IDE when you still want control over the code structure then.

aswegs8 6 hours ago||

Not sure if it's suited for that. If you read the article it's stated that it is basically a research project to see how far they can push it with small models.

aero2146 16 hours ago||

I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...

fwipsy 16 hours ago||

I think this is predicted? Part of the story is how they were able to preserve core reasoning ability while cutting knowledge like "pelicans have wings."

> these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios.

sheepscreek 4 hours ago|||

So I think the takeaway here is, this is a super fast companion model to larger models, that reasons quickly. Perhaps this technique can be used to train a highly optimized reasoning "expert" in MoEs.

pylotlight 15 hours ago|||

The only real essential item here is tool calling capability is it not? So I assume they tested a strong read/write/edit tool consistency?

nsingh2 15 hours ago|||

This model doesn't support tool calling, was not part of its training. It's focused on Python (and I think C++) competitive programming and mathematics tasks, i.e. tasks with verifiable rewards. So if you have a task that fits that description, the size-to-capability ratio is good.

These kinds of models might be more useful as tools to be used by larger orchestrator models, than being the orchestrators themselves.

btown 15 hours ago|||

I'm not seeing any mention of tools in the paper, much less a bias towards "curiosity" to use those tools when it encounters gaps in its knowledge. So perhaps this is a good proof-of-concept that single-pass code generation is viable with this small a model - but we're still a long way from a viable solution.

kristopolous 13 hours ago|||

try it again but give a careful explanation of what a bicycle and a pelican is and how the pelican would sit atop the bicycle. Then give it a reference to the SVG tags you want it to use with documentation.

Here's what I got

https://9ol.es/tmp/pelican.png

with https://9ol.es/tmp/prompt_pelican.txt

using prithivMLmods/VibeThinker-3B-GGUF:Q4_K_M

physPop 16 hours ago|||

Its for reasoning not generating art?

websap 16 hours ago||

Can you explain this a bit more

tyre 16 hours ago|||

Imagine you want to make a smaller model that is really good at one thing, say, driving a car. You could remove the parameters that lead it to correctly answer, "What is the powerhouse of the cell?" or, "Who was the first president of the United States?"

It would look really dumb if someone asked it that, but that's fine. You're trying to make a model that is optimized for efficiency for a specific task. As much as possible, you should prune uncorrelated things.

pylotlight 15 hours ago|||

SVG generation is a useless test, what's there more to know?

steve_adams_86 15 hours ago||

What if you're reasoning about how to generate SVG correctly?

Mtinie 15 hours ago||

In this case, I’d expect it should make a web search tool call to find the Python library best suited for SVG generation and manipulation, and then use what it learns there to execute the task you’ve asked it to do (either asking if you’d like to incorporate the library as a dependency or to roll its own implementation of a subset of the features if that was your preference),

Assuming tool calling hasn’t been entirely stripped out of this model.

(Edit) No tool calling, per this comment: https://news.ycombinator.com/item?id=48640189

realitysballs 16 hours ago||

That’s all I needed to hear

pylotlight 15 hours ago||

As in, you learnt that a useless test that no one should be using was tested here, that's what you meant right?

fransje26 5 hours ago||

right?

nolist_policy 4 hours ago||

Notable:

  VibeThinker-3B is developed through a staged post-training pipeline built upon Qwen2.5-Coder-3B base, a compact 3B foundation model.

Qwen2.5 is ancient by LLM standards.

achrono 4 hours ago||

Beats Opus 4.5 on reasoning you say?

Prompt: If A goes to B who then goes to C, can A send something to C?

Response:

We need to interpret best. The phrase "If A goes to B who then goes to C, can A send something to C?" could be a puzzle about the concept of sending something (like passing a ball) and the relationships.

Scenario: A gives something to B, and B passes it on to C. Question: Can A also give the same thing to C? Answer: Only if A can obtain a second copy (e.g., the thing was duplicated). Otherwise, after handing it to B, A no longer holds it and cannot “send” it unless a copy exists.

[Lots of other unnecessary commentary and "scenarios" that make even lesser sense]

postalrat 30 minutes ago||

If A goes to B who then goes to C does C know A?

erdevs 4 hours ago|||

I am a human and I don't know how to interpret this prompt.

rapatel0 4 hours ago|||

Ran the same query and there is a ton of stuff, but it looks like it's reasoning through the ambiguity of the sentence. It still gets the right answer. Moreover, if we consider the FLOPs expended to get to the answer, and compare that to opus, I think it's still a net win.

My hunch is that Opus scale models probably have shortcuts encoded into the model that handle these ambiguities cases, wheres this model has learned a program to reason through the edge case (crystalized vs fluid intelligence). Remembering that probablity (frontier) vs calculating it on the fly (vibethink)

nolist_policy 4 hours ago||

> Multi-level Quality Control.

> [...]

> LLM-based Query Quality Filtering. We utilize capable LLMs to assess query quality, filtering out samples with incomplete descriptions, unreasonable conditions, invalid logic, or an inability to effectively assess target knowledge points.

andai 2 hours ago||

I tried actually talking to it. It reminded me of GPT-2.

virajk_31 7 hours ago||

SLM when trained for single use case often beats the LLM. That's both the advantage and limitation.

androiddrew 8 hours ago||

I have been thinking about how to use this. Since it doesn’t support tool calling I have been considering a dual model deployment, where a small tool calling llm drives the majority of the user experience, and vibe thinker is tapped for reasoning by the other llm.

So who has suggestions on small models with excellent tool calling capabilities?

smallerize 7 hours ago||

Gemma 4 E4B and Qwen 3 4B are pretty good, but fine-tuning makes them really good. There are tradeoffs at this size, so you'll have to find (or make) a finetune that does what you need.

scotty79 58 minutes ago|||

Qwen3.6-35B-A3B is pretty amazing. I'm using it with 96k context on 24GB VRAM through ollama.

j-bos 8 hours ago|||

Maybe bonsai 8b would make the duo, if you do try it, pls post here as I'm a bit curious too.

reddec 8 hours ago||

granite 4

iamgopal 6 hours ago||

Two model, one is optimised for system, reasoning etc, second is optimised for specific language ( rust or go ? ) , both small enough to run on local computer, will it work ?

jpcompartir 5 hours ago||

The absolute worst name for a model I've seen

makethembroke 2 hours ago|

I don't get this beating opus, It just hardcoded the tasks for bench , It does even respond normally

A alot randomness in it

Please don't hype

More comments...