Needle: Gemini Tool Calling Distilled Into a 26M Parameter Model — Tiny AI That Actually Calls Functions

Needle: Gemini Tool Calling Distilled Into a 26M Parameter Model — Tiny AI That Actually Calls Functions
📑 Table of Contents

A 26-million-parameter model just humiliated models ten times its size at the one thing that matters most for AI agents: calling the right tool at the right time.

Cactus Compute’s Needle — which rocketed to 423 points on Hacker News and 736 GitHub stars in under 24 hours — is the first model to prove that function-calling intelligence doesn’t need billions of parameters. By distilling Gemini 3.1’s tool-calling capability into a radically minimalist 26M-parameter Simple Attention Network (SAN), Needle beats models like FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function call benchmarks. And it can run on a smartwatch.

This isn’t a toy. It changes the economics of on-device AI agents.

What Needle Is — and Isn’t

Let’s be precise about what Needle does and doesn’t do.

What it does: Given a user query and a set of tool definitions (JSON schema), Needle outputs the correct tool call with arguments. It’s a single-turn function-calling engine — you get: [{"name": "get_weather", "arguments": {"location": "San Francisco"}}].

What it doesn’t do: Conversational chat, multi-turn reasoning, general knowledge QA, or anything that requires broad language understanding. That’s not the point. Needle is a specialized router — the part of an agent pipeline that decides which function to invoke and with what parameters — a key concept in our AI agent glossary of 55 essential terms.

This specialization is exactly what makes it so powerful for agentic workloads, fitting into the broader agent frameworks landscape.

The Architecture: Simple Attention Networks

Needle is built on a Simple Attention Network (SAN) — an architecture designed by the Cactus team specifically for tiny, efficient models. The full stack:

Component Spec
Parameters 26 million
Embedding dim 512
Heads 8H / 4KV (GQA)
Vocab size 8,192 (BPE)
Encoder layers 12 (no FFN)
Decoder layers 8
Pretraining 200B tokens on 16 TPU v6e (27 hours)
Post-training 2B tokens of single-shot function calls (45 minutes)

The encoder-decoder architecture is notable for what it leaves out: there are no feed-forward networks in the encoder. The team found that attention-only encoding with gated residual connections was sufficient for the function-calling task, dramatically reducing parameter count.

“Needle is an experimental run for Simple Attention Networks, geared at redefining tiny AI for consumer devices — phones, watches, glasses,” the team writes in the README.

Benchmarks: What the Numbers Say

The comparison against established small models is striking:

Model Size Single-shot Function Call Accuracy
Needle (SAN) 26M Best
FunctionGemma-270M 270M Beaten
Qwen-0.6B 600M Beaten
Granite-350M 350M Beaten
LFM2.5-350M 350M Beaten

A 10x parameter disadvantage — yet Needle wins on the narrow task it was designed for. This is the power of extreme distillation combined with architectural minimalism.

On production hardware (Cactus’s own inference stack), Needle achieves:

  • 6,000 tokens/sec prefill
  • 1,200 tokens/sec decode

These numbers make it viable for real-time agentic applications where every millisecond counts.

Why This Matters for AI Agents

Tool calling is the fundamental primitive of the agentic stack. Every time an agent needs to query a database, send an email, edit a file, or call an API, it goes through a function-calling layer. Getting this right — and getting it fast — has been a bottleneck for on-device agents.

The on-device implication: Until now, reliable function calling required either a cloud round-trip (latency, privacy, cost) or a model large enough to be impractical on consumer hardware. Needle changes that. A 26M model:

  • Fits in ~50MB of RAM (FP32) or ~13MB (quantized)
  • Runs on a phone CPU without a GPU
  • Can be fine-tuned on a laptop in minutes
  • Works on wearables — watches, glasses, earbuds

The Distillation Pipeline

The team used Gemini 3.1 as the teacher model, generating 2 billion tokens of single-shot function-call training data. The dataset covers a wide range of tool schemas — simple single-parameter functions through complex nested-object schemas with optional fields, enums, and arrays.

The post-training step took only 45 minutes on TPU hardware, meaning the distillation pipeline is reproducible enough that others could adapt it to their own tool domains.

Comparison with Other Tiny Models

Needle isn’t the only small model targeting function calling. The landscape includes:

  • FunctionGemma-270M — Google’s fine-tuned Gemma variant for function calling
  • Qwen-0.6B — Alibaba’s tiny general-purpose model with tool-calling fine-tunes
  • Granite-350M — IBM’s enterprise-focused small model
  • LFM2.5-350M — Linux Foundation’s AI model

All of these use conventional transformer architectures. Needle’s SAN approach is architecturally distinct — and the results suggest that for highly specialized tasks, simpler architectures with better training data can outperform bigger models with generic designs.

How to Use It

Getting started is straightforward:

git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground

This opens a web UI at http://127.0.0.1:7860 where you can test and fine-tune on your own tools. Weights are auto-downloaded.

For programmatic use:

from needle import SimpleAttentionNetwork, load_checkpoint, generate, get_tokenizer

params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()

result = generate(
    model, params, tokenizer,
    query="What's the weather in San Francisco?",
    tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
    stream=False,
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]

The weights are fully open on Hugging Face, and the training data generation pipeline is also open-sourced.

What’s Next

The Needle team — Henry Ndubuaku, Jakub Mroz, Karen Mosoyan, Roman Shemet, Parkirat Sandhu, Satyajit Kumar, Noah Cylich, and Justin H. Lee — positions this as an experimental run for Simple Attention Networks. But the results are already production-viable for single-shot function calling.

The implications for the agent ecosystem are clear:

  • Privacy-preserving agents that do tool routing entirely on-device
  • Ultra-low-latency agent loops where tool selection happens in microseconds
  • Wearable AI — the first generation of agentic smart glasses and watches that don’t need cloud connectivity for basic function calls
  • Specialized agent chips — if a 26M model can handle tool routing, it can be embedded in silicon

Needle proves something the AI agent community has suspected for months: the reasoning doesn’t have to be big. It just has to be focused.


Sources: Needle on GitHub, Hacker News Discussion (423 pts), Hugging Face Weights, Cactus Compute