Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Posted by HenryNdubuaku 19 hours ago

Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model(github.com)

Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.

We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.

Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...).

Training: - Pretrained on 200B tokens across 16 TPU v6e (27 hours) - Post-trained on 2B tokens of synthesized function-calling data (45 minutes) - Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)

You can test it right now and finetune on your Mac/PC: https://github.com/cactus-compute/needle

The full writeup on the architecture is here: https://github.com/cactus-compute/needle/blob/main/docs/simp...

We found that the "no FFN" finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn't need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to published.

While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope/capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly.

This is part of our broader work on Cactus (https://github.com/cactus-compute/cactus), an inference engine built from scratch for mobile, wearables and custom hardware. We wrote about Cactus here previously: https://news.ycombinator.com/item?id=44524544

Everything is MIT licensed. Weights: https://huggingface.co/Cactus-Compute/needle GitHub: https://github.com/cactus-compute/needle

512 points | 154 commentspage 4

roggenbuck 14 hours ago|

This is some excellent work Henry! Very excited to try it out.

HenryNdubuaku 14 hours ago|

Thanks, let me know how it goes!

cmrdporcupine 17 hours ago||

This is very cool I'm going to try to carve out some time to try building this into my MOO system ( https://codeberg.org/timbran/moor / https://timbran.org/moor.html ) as alternative command parser front end.

Balinares 16 hours ago||

Man, I love that there are still people writing new MOO servers in 2026. Any game out there already running on mooR?

cmrdporcupine 15 hours ago||

Many people tease that they will, and start... but then kinda stop. But mostly just been building my own bespoke thing on my own bespoke platform, and kinda running out of steam because I need to make $$ instead.

Balinares 6 hours ago||

Ah, sad, but not surprising. The hard part of getting a game going is assembling and sustaining a community.

cmrdporcupine 3 hours ago||

My own interest / project isn't really in use for games, tbh. Historical background on MOO wasn't really on the gaming side, more social interaction. But similar constraints around community magnetism apply.

HenryNdubuaku 17 hours ago||

Thanks, let us know how it goes!

deepsquirrelnet 16 hours ago||

This is really cool. Any plans to release the dataset?

HenryNdubuaku 16 hours ago|

We include the dataset pipeline in the codebase so far, might release dataset.

theykk 13 hours ago||

hey nice work, is it possible to release the datasets?

HenryNdubuaku 13 hours ago|

We have so far released the dataset generation code

halyconWays 9 hours ago||

I assume this would only be useful as the second stage after a model like Whisper, as it can't understand speech where you'd want it, like on a phone or small device?

varispeed 15 hours ago||

What is the use case for this?

masafej536 9 hours ago||

Something like this together with MCP can replace APIs for 3rd party integrations. You just give it instructions to "post a message in slack" and provide it slack MCP tools and it figures out the rest on its own. No need to read up on slack API docs or worry about breaking changes.

HenryNdubuaku 14 hours ago||

Deploying AI on tiny devices like watches, earphones, glasses etc.

varispeed 13 hours ago||

Ok, but why? What is the use case?

chris_money202 13 hours ago||

I don't think the limit is just on tiny devices. It can also be used in apps on generic computers, because its so small anything can run it reasonably quick.

For example, I am thinking this could be helpful for say if you have a complicated build and test infrastructure, fine tune this model on that infrastructure and then people can say more generic things like build and run this library's test, rather than issuing the exact commands to do that or going to Claude, GHCP, etc

BoredPositron 15 hours ago||

I source old, defective high-end radios with timeless designs from brands like Grundig or Braun, and replace the original hardware with a Raspberry Pi while using the original audio parts to build custom smart speakers. Reliable hotword detection and voice command recognition have been a persistent challenge over the years, but whisper and other small models have helped enormously. At the moment I have ollama running on my server with qwen 9b which works fine but a 26M that could be deployed on the pi itself would be amazing.

HenryNdubuaku 14 hours ago|

Sounds cool, play with it and let uk know what you think!

nikhilpareek13 1 hour ago||

[dead]

raymondchau 5 hours ago||

[flagged]

JoheyDev888 4 hours ago|

[dead]

More comments...