Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Posted by HenryNdubuaku 18 hours ago

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model(github.com)

Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.

We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.

Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...).

Training: - Pretrained on 200B tokens across 16 TPU v6e (27 hours) - Post-trained on 2B tokens of synthesized function-calling data (45 minutes) - Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

You can test it right now and finetune on your Mac/PC: https://github.com/cactus-compute/needle

The full writeup on the architecture is here: https://github.com/cactus-compute/needle/blob/main/docs/simp...

We found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to published.

While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope/capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly.

This is part of our broader work on Cactus (https://github.com/cactus-compute/cactus), an inference engine built from scratch for mobile, wearables and custom hardware. We wrote about Cactus here previously: https://news.ycombinator.com/item?id=44524544

Everything is MIT licensed. Weights: https://huggingface.co/Cactus-Compute/needle GitHub: https://github.com/cactus-compute/needle

496 points | 154 commentspage 3

syntaxing 13 hours ago|

This would be amazing for home assistant.

synesthesiam 13 hours ago|

On my list to check out tomorrow :D

syntaxing 10 hours ago|||

Wow can’t believe the voice engineer lead for Nabu Casa is here! Super excited to see if this works for HA!

HenryNdubuaku 13 hours ago|||

Thanks, keep me posted!

logdahl 15 hours ago||

I find this stuff super fascinating and been thinking about it myself. Maybe one could bootstrap tiny models on a rather 'pure' procedural data set. Neglecting [0] of course...

[0]: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

HenryNdubuaku 15 hours ago|

Sounds interesting, would love to see it too!

zamalek 15 hours ago||

Is the idea here to add function calling to models that don't have it, or even improve function calling (qwen quirks)?

HenryNdubuaku 15 hours ago|

So it’s a tiny model capable of function calling that could run locally on cheap devices.

efskap 11 hours ago||

No FFN is blowing my mind. This is pretty much "Attention Is ACTUALLY All You Need". Reminds me of BERT Q&A which would return indices into the input context, but even that had a FFN. Really exciting work.

krackers 10 hours ago|

I guess this had always been bugging me. I get while you need activation/non-linearities, but do you really need the FFN in Transformers? People say that without it you can't do "knowledge/fact" lookups, but you still have the Value part of the attention, and if your question is "what is the capital of france" the LLM could presumably extract out "paris" from the value vector during attention computation instead of needing the FFN for that. Deleting the FFN is probably way worse in terms of scaling laws or storing information, but is it an actual architectural dead-end (in the way that deleting activation layer clearly would be since it'd collapse everythig to a linear function).

Majromax 9 hours ago||

> if your question is "what is the capital of france" the LLM could presumably extract out "paris" from the value vector during attention computation instead of needing the FFN for that.

But how do you get 'Paris' into the value vector in that case? The value vector is just the result of a matrix multiplication, and without a nonlinearity it can't perform a data-dependent transformation. Attention still acts as a nonlinear mixer of previous values, but your new output is still limited to the convex combination of previous values.

krackers 7 hours ago||

> But how do you get 'Paris' into the value vector in that case?

Ok wait I think I see what you mean. Although maybe it's not getting paris _into_ the value vector that's hard, but isolating the residual stream to _only_ that instead of things like other capitals.

So as a naive example maybe at the very first layer consuming your tokens: Q{France} would have high inner product with K{capital} and so our residual would now mostly contain V{capital}, which maybe contains embeddings of all the capitals of all countries. You need some way to filter out all the other stuff, but can't do that without a FFN + activation.

Just throwing in a relu by itself won't help since that would still work on all the elements uniformly, you need some way to put weight on "paris" while suppressing the others, i.e. mixing within the residual stream itself.

Although maybe if you really stretch it, somewhere in a deeper layer you could have 1-hot encoded values with a "gain" coefficient so that when you do the residual addition it's something like {<paris>, <tokyo>, <dc>} + 10000*{<1>, <0>, <0>} and then if you softmax that you get something with most of its mass on "Paris". But it seems like this would not be practical, or it's just shifting the issue to how that the right 1-hot vector is chosen

isaisabella 6 hours ago||

Nice catch. Using agent for simple tasks is inefficient and wasteful, Needle really resolves this. Looking forward to future upgrades!

quadrature 15 hours ago||

Does the model have capacity for in context learning ?, if we give it examples of patterns can it follow them ?.

HenryNdubuaku 15 hours ago|

Not yet, for now. But it’s in the works!

dangoodmanUT 14 hours ago||

Why pick Gemini? It's probably the worst tool calling model of the major labs.

HenryNdubuaku 14 hours ago|

Cheaper APIs

sroussey 12 hours ago||

Can this be converted to onnx or otherwise be used in a browser?

casey2 7 hours ago|

Query: set a timer for 1 hour

Result: [{"name":"set_timer","arguments":{"time_human":"1 hour"}}]

Query: in 1 hour set a timer for 1 hour

Result: [{"name":"set_timer","arguments":{"time_human":"1 hour"}}]

I'd expect either a chain load or just a 2 hour timer. Further attempts humorously give two separate 1-hour-timers.

More comments...