VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

Posted by timhigins 19 hours ago

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO(arxiv.org)

353 points | 183 commentspage 3

SwellJoe 16 hours ago|

It's terrible at hunting security bugs (I expected it to be, but I wanted to be sure). I added it to a benchmark I made with a corpus of some Mythos-discovered bugs, and it found zero. The smallest pretty successful models remain Qwen 3.6 and Gemma 4 (but I haven't tested the very small variants of those yet).

https://swelljoe.com/post/will-it-mythos/

nsingh2 16 hours ago|

The lack of tool use will hinder it a lot I think, since bug hunting requires collecting context across a code base and stitching it together. It might be good in a more narrow sense, i.e "is there a bug in this block of code" and not considering how it interacts with the rest of the code base.

That's also more aligned to its leetcode style training data, the code under test is fully in the context window. It might be interesting to have a bigger tool use model go through the effort of collecting the context, and feeding it into this kind of model for analysis only. It becomes more of a thinking tool, instead of the orchestrator.

brainless 12 hours ago||

I recently came across this model and I would love to try it with my coding agent soon.

I really like the idea of small models that can reason but do not have too much knowledge. Also, no emphasis on tool calls. I think the agent should do the heavy lifting and reach half way.

I use really small models, like Qwen 3.5 0.8B to 9B - no tool calling, no MCP, no skills, nothing. No multi-turn chat even. Models are given very specific tasks using a vast number of system prompts and all the response handling is done in the agent(s).

https://github.com/brainless/nocodo

SubiculumCode 6 hours ago|

Maybe no tool calling, but seems it could be really good at deciding which tool to use and when?

brainless 4 hours ago||

That is a good point. I do think these models would be good in the decision making. The large models are trained to use tool calling. Perhaps the small models can generate the text that would express their decision but not generate good JSON to reply with correct syntax. I do not know but this is my hunch.

cold_harbor 7 hours ago||

GRPO skips the value network that makes PPO expensive — it scores candidates relative to each other within a group. that's what makes verifiable-reward training practical at 3B scale

uberex 9 hours ago||

What is the idiots guude to run this one local now?

yousif_123123 9 hours ago||

Use LM Studio.

uberex 21 minutes ago||

how do I get these weights in particular?

Landing7610 8 hours ago||

omlx makes it quite easy

unfirehose 9 hours ago||

this is a good model. I benchmark reasoned answers to qwen 3.6 27b (no think)+ bash and it held up.

diimdeep 5 hours ago||

BF16 with no QAT quants == half backed bread

scotty79 13 hours ago||

If you could pair it somehow with a model that can code and describe code this could be a very powerful combo.

anonyfox 14 hours ago||

Wake me up when it does OCaml fine.

4gotunameagain 7 hours ago|

What are the implications of local SOTA inference, given the insane datacenter "investing" ?

It surely cannot be justified only for training at this scale, and since models nowadays are improved more and more by fine tuning than re-training from scratch.

Will a viable local model crash the US economy ?

More importantly, are the LLM companies aware, and are they deliberately buying out all the RAM and GPUs in order to prolong the inevitable ? Probably not, but I wouldn't be surprised if that is the case.

More comments...