Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Posted by HenryNdubuaku 17 hours ago

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model(github.com)

Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.

We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.

Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...).

Training: - Pretrained on 200B tokens across 16 TPU v6e (27 hours) - Post-trained on 2B tokens of synthesized function-calling data (45 minutes) - Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

You can test it right now and finetune on your Mac/PC: https://github.com/cactus-compute/needle

The full writeup on the architecture is here: https://github.com/cactus-compute/needle/blob/main/docs/simp...

We found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to published.

While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope/capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly.

This is part of our broader work on Cactus (https://github.com/cactus-compute/cactus), an inference engine built from scratch for mobile, wearables and custom hardware. We wrote about Cactus here previously: https://news.ycombinator.com/item?id=44524544

Everything is MIT licensed. Weights: https://huggingface.co/Cactus-Compute/needle GitHub: https://github.com/cactus-compute/needle

475 points | 154 commentspage 2

Liam_Simpkin 2 hours ago|

How could you use this for composability? I.e. chaining together multiple tools. For example web_search → summarize_url → send_email

Liam_Simpkin 2 hours ago|

Looks possible E.g.

Query: get the weather for san francisco and email the result to test@test.com

Result: [{"name":"get_weather","arguments":{"location":"san francisco"}},{"name":"send_email","arguments":{"to":"test@test.com","subject":"San Francisco","body":"Please find the weather attached."}}]

binyang_qiu 6 hours ago||

A lot of agent workflows really are just tool selection + argument extraction + structured output. How does this behave once workflows become multi-step and state starts accumulating across calls?

exabrial 12 hours ago||

Dumb questions, from someone not in the field...

What is a distilled model?

Why doesn't Google do this (to make their models smaller)?

Seems like you could make a competitor to Gemini?

HenryNdubuaku 12 hours ago||

No question is stupid!

1. Distilled means taking the intelligence of a big model and compacting into a tiny model.

2. Google already does so with FunctionGemma, but Needle argues that better performance could be achieved with 10x smaller model using our technologies.

jmalicki 5 hours ago|||

There are two answers already and neither is entirely adequate.

In normal LLM training, you take a set of documents and have it learn to predict the future, then have some private RLHF/RLVR etc. data that it learns to produce good chat outputs from.

In distillation, you take a set of prompts you are interested in, and record the big LLM's outputs, then train your small model to produce the same output as the big LLM.

This has a few advantages - you can get performance much more quickly on your documents/prompts of interest, with a much cheaper training budget, and you don't have to worry about acquiring very expensive RLHF/RLVR training data.

A lot of the very good Chinese LLMs got very good very quickly through distillation from frontier models, which is why Anthropic/Google/OpenAI are blocking it so aggressively.

NitpickLawyer 5 hours ago||

For completeness sake I'll add a bit more.

The concept of distillation is not new in ML, and there are nuances to it. Traditionally you would have access to the bigger model, and for LLMs specifically you can train the small model on the entire distribution of output logits at the same time. So this would train the small model to output scores for each token in a similar fashion to the large model. There's "more to learn" from the entire distribution, rather than just from the chosen token.

But since you don't have access to this from the API providers, the next best thing is to use the outputs themselves and train on those. That's more like a "poor man's distillation". It's still good, and as you mentioned worked fairly well for models catching up. But a lab that develops both the big model and the small model could make it better. (or you could choose to distill from an existing open model).

tintor 12 hours ago||

Model distillation is lossy compression of big model to produce a smaller model.

Smaller model requires less space on disk, less video memory, and less compute (cheaper hardware).

Downside is that distilled model performs worse on the same benchmarks compared to original model.

simonw 16 hours ago||

Looks like you need to open up access to https://huggingface.co/Cactus-Compute/datasets/needle-tokeni... - I get this error when trying to run the steps in your README:

> Repository Not Found for url: http s://huggingface.co/api/datasets/Cactus-Compute/needle-tokenizer/revision/main.

HenryNdubuaku 16 hours ago|

Fixed now, apologies!

simonw 15 hours ago||

Thanks, works now: https://gisthost.github.io/?4ff455792651fe755265b467800f47f3

Havoc 15 hours ago||

Sounds interesting.

Got a bunch of errors trying to run it on CPU though. Very likely connected to me running this in a container (unpriv LXC), but figured for 26M CPU would suffice.

https://pastebin.com/PYZJKTNk

dakolli 15 hours ago|

It better, considering its purpose is to run on devices with no GPU.

bityard 14 hours ago||

This is pretty much exactly what I want for Home Assistant. I yell out, "Computer! Lights!" and it toggles the lamp in the room on or off. (I mean I can do that now, I think, but probably with a much larger model.)

I haven't played with it yet, but does it ever return anything other than a tool call? What are the failure modes? What if it doesn't understand the request? Does it ever say it can't find a tool? Does it get confused if there are two similar (but different) tools? Can it chain tools together (e.g. one tool to look up and address and another to get directions to the address)?

I mean, I plan on downloading the model later tonight and finding out for myself, but since I'm stuck at work right now, I figured I'd ask anyway...

0cf8612b2e1e 12 hours ago||

How many lights are there?

kennywinker 11 hours ago||

… four. There are four lights.

xrd 8 hours ago|||

Hmm, I wonder if I can run this on my MyCroft II (now NeonOS) open source AI device...

HenryNdubuaku 13 hours ago||

Let me know what you think!

rsolva 14 hours ago||

Can it summarize text it fetches?

Come to think of it, this could be a nice model to have as the first pass in a more complex agent system where Needle hands of the results of a tool call to a larger model.

I will defiantly play around with this!

NordStreamYacht 11 hours ago||

> I will defiantly play around with this!

Are you Calvin or Hobbes?

rsolva 3 hours ago||

Haha, not what I meant to write, but this works too!

HenryNdubuaku 14 hours ago||

The codebase is fully open, feel free to play around!

alex7o 13 hours ago||

From all the models that do toolcalls the only thing I am confused is why did you pick the worst? Or maybe they are only bad in agentic work it fine for one shot toolcalls?

HenryNdubuaku 13 hours ago|

Gemini is pretty solid for 1-shot tool call and affordable as well.

pylotlight 7 hours ago|||

My general understanding of the concenus on most models these days is that people consider google models to be some of the worst at tool calling, so certainly an interesting choice. Did you do any evals on this?

BuyG1n 11 hours ago|||

Hi, would love to know where you get that impression on 1 shot tool calling, was there concrete evaluation carried out? pretty new to this and was a bit lost when trying to compare models on different capabilities.

murkt 15 hours ago||

Can this be a Siri-like core? Set me a timer, tell me what’s the weather, etc. Here is transcribed text and available list of tools for the model to call, and voice the output.

HenryNdubuaku 15 hours ago|

That was the goal!

z3ugma 13 hours ago|

I don't really understand what this is for... there is a lot of ML-researcher talk on the GH page about the model architecture, but how should I use it?

Is it a replacement for Kimi 2.7, Claude Haiku, Gemini Flash 3.1 lite, a conversational LLM for the situations where it's mostly tool-calling like coding and conversational AI?

HenryNdubuaku 13 hours ago|

It is for building agentic capabilities into very small devices like phones, glasses, watches and more. Does that make sense?

jcgrillo 12 hours ago||

[flagged]

hosh 8 hours ago|||

A local model that can do better than Siri or Alexa as a personal or home assistant is, in my eyes, very useful. Being able to run on a phone or watch or glasses translates to me, low-powered AI, and not necessarily that I want my phone, or watch, or glasses to run things for me.

My Siri use has narrowed down to just setting timers. And even then, I still have my phone call people in the middle of the night. Siri is pretty dumb and does not do what I want it. I’d rather be able to customize an assistant to myself.

I am also thinking of automation in my day to day workflow for work.

jcgrillo 7 hours ago||

OK.. but what would you have all this "automation" actually do? What is Siri failing to do that you want it to do? How would customizing an assistant (for whatever definition) help?

jasonjmcghee 10 hours ago||||

Throwing a few things out - HN has changed over the years, but people make stuff to make stuff. There don't need to be product use cases. The tone of the comment goes against the spirit of HN - likely the reason for downvotes.

That aside- a very small model that takes text and outputs structured json according to a spec is nice. It let's you turn natural language into a user action. For example, command palettes could benefit from this.

If you can do a tiny bit of planning (todo) and chain actions, it seems reasonable that you could traverse a rich state space to achieve some goal on behalf of a user.

Games could use something like it for free form dialog while stool enforcing predefined narrative graphs etc.

I'm sure you could come up with more. It's a fuzzy function.

jcgrillo 9 hours ago||

> people make stuff to make stuff. There don't need to be product use cases.

OK. Great! So it doesn't need to be a commercial product. But does it do something (anything?) interesting? I'm interested in your games example, I'd love to see it done in real life. IIUC, game AIs are actually much more constrained and predictable for play-ability reasons. If you let it go all free form a plurality of players have a "WTF??!?" experience which is super Not Good.

digdugdirk 9 hours ago||

It doesn't have to do any thing interesting - it's completely fascinating all on it's own. If you understand anything about the math and science behind LLMs, you'll understand that this is an achievement worthy of sharing to a community like HN.

That being said, small models like these have plenty of use cases. They allow for extra "slack" to be introduced into a programmatic workflow in a compute constrained environment. Something like this could help enable the "ever present" phone assistant, without scraping all your personal data and sending it off to Google/OpenAI/etc. Imagine if keywords in a chat would then trigger searches on your local data to bring up relevant notes/emails/documents into a cache, and then this cache directly powers your autocomplete (or just a sidebar that pops up with the most relevant information). Having flexible function calling in that loop is key for fault tolerance and adaptability to new content and contexts.

Its cool. Enjoy it.

jcgrillo 9 hours ago||

> Something like this could help enable the "ever present" phone assistant, without scraping all your personal data and sending it off to Google/OpenAI/etc

OK so show me what that's for. Show me something useful you can do with that ability.

> Imagine if keywords in a chat would then trigger searches on your local data to bring up relevant notes/emails/documents into a cache, and then this cache directly powers your autocomplete (or just a sidebar that pops up with the most relevant information).

I'm really trying but.. idgi? I truly cannot imagine how this would improve my life in any way...

> Its cool. Enjoy it.

No. It sounds like a useless complication on my watch. I don't fucking care if it can tell me the phase of the moon. I can look up at the sky and see the moon and know what phase it is.

EDIT: You say:

> If you understand anything about the math and science behind LLMs, you'll understand that this is an achievement worthy of sharing to a community like HN.

OK. So educate me. Tell me what I'm missing.

HenryNdubuaku 12 hours ago|||

You can think of “phone use” for instance, what Siri is supposed to be.

jcgrillo 12 hours ago||

I mean.. Siri basically works? When I'm driving I say "Hey Siri, find me a gas station along my route", and it does. Or I say "Hey Siri, call Joe Bob mobile" and it does. Or I say "Hey Siri, play me a podcast". This is kind of a solved problem already? When I'm driving this is literally as complicated of a distraction as I want--I'm not going to be dictating emails or texts. When I'm not driving, the touchscreen keyboard (as shitty an interface as that is) is 100x better than voiced natural language commands.

ilaksh 12 hours ago||

It does just barely work now after they spent billions, and they may still fall back to cloud LLMs for a significant number of things. This is a way that everyone can get that on the actual Apple Watch or local phone for any application they build.

jcgrillo 11 hours ago||

I get that, but I still can't imagine what it might be for. TBH I don't have a smart watch, because I can't think of anything I'd want one for--my mechanical watch keeps time to within a few seconds per month and the lume lasts all night. I don't know what making it "smarter" would do for me, it does an A+ job of being a watch. What are the things that "everyone" can build with this that actually matter? Like, what is the differentiator?

EDIT: To be clear, the monoculture of phone operating systems sucks. If this somehow enables more entrants into that space then I'm all for it. However, I don't see this in particular being the deciding factor... For example, the reason I don't run a 3rd party operating system on my phone isn't because it's lacking Siri or "OK Google" (if these things went away tomorrow I'd barely notice), it's because it would be a pain in the ass to make it be a phone.

More comments...