Posted by zambelli 16 hours ago
I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.
What it does:
- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware
- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it
- Ships with an eval harness and interactive dashboard so you can reproduce every number
I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.
Demo video: https://youtu.be/MzRgJoJAXGc (side-by-side: same model, same task, with and without Forge guardrails)
The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:
- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.
- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.
- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.
I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).
The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.
One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.
Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.
Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.
How to try it:
- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.
- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.
- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.
Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.
Repo: https://github.com/antoinezambelli/forge
Paper: https://www.caisconf.org/program/2026/demos/forge-agentic-re... https://github.com/antoinezambelli/forge/blob/main/docs/forg...
Dashboard: https://github.com/antoinezambelli/forge/docs/results/dashbo...
I've been exploring this area and a project like https://github.com/itayinbarr/little-coder (not my work) lets me mix and match with my current setup or any plugins built for pi.
The proxy mode should integrate seamlessly, and the middleware guardrail mode could be lifted into pi.
As for little coder, I love it! I wanted forge to be more generic than just agentic coding as there's many more agentic workflows worth optimizing with small models.
Very early prototype, so I’m looking more for architectural/conceptual reactions than polish: https://wardwright.dev / https://github.com/bglusman/wardwright
The common thread I see is treating the harness around the model as first-class infrastructure. Forge seems focused on tool-call correctness and recovery; Wardwright is more about controlling what the agent is supposed to do, where work gets routed, and how the operator stays in the loop.
Curious whether you see those as complementary layers. I’m planning to try Forge and would be interested in seeing whether they fit together cleanly.
Forge is just trying to make sure that when the model decides to do something, thee execution is reliable.
As for software integration, let me know if you run into any issues and I'll be happy to take a look or try to patch something!
Harnesses as first class infra all the way. I'll take a look at your work and see if I spot any obvious tensions.
In a nutshell, it applies guardrails around LLM calls to make them more reliable - specifically small models but works on all: "on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction).".
It'll try to parse malformed tool calls, it'll automatically compact if needed, it'll enforce any workflow requirements you define (ie, read before edit) - and it does so with domain-agnostic guardrails. It catches and feeds errors back to the model in a structured way so the model self-corrects (hopefully).
Each guardrail can be removed as desired by a consumer. It can be used as a building block library (WorkflowRunner approach), it can be integrated into existing source (middleware), or it can be a drop-in addition to an exiting workflow (proxy mode).
Name was just a portmanteau of Calcifer's forge, because Howl’s moving castle seemed like a good metaphor for what I was trying to do… I had synthetic models as apiece there but I realized a) it was out of place and b) it was my favorite feature there
Big frontier models need this less than small models.
So basically the kind of thing I'd usually be doing manually with small models, over and over again, you just automate that nudging and off they go.
Sometimes LLMs have seemed to me like "computer programs with inertia" and in that frame what your tool does is identify and reduce friction at key points so the wheels can keep spinning.
Without forge, I'd guess a small model used for Hermes would have to retry entire workflows when an uncaught exception triggerd when it tried to reply with text when "calling a tool" ("Here is the tool call: [json blob]"). The issue there becomes partial successes can lead to state changes that need to be addressed (it booked the flight already, home it doesn't double-book).
Forge won't help with model reasoning quality though. If it the model thinks the right thing to do is to book 3 buses for your trip, forge doesn't care, it'll just make sure those api calls land.
The other insight was doing it at tool call level and not workflow level, which addresses the compounding math problem more directly.
Plus it's cool to see a little 8B model writing code :)
> python -m forge.proxy --backend-url http://localhost:8080 --port 8081
This is a good example because I've currently stuck with llama.cpp's UI. I can read your code (or throw Gemma at it =p ) but thought I'd ask anyway.
In this example, what is it exactly that your proxy is fortifying? The HTTP SSE requests? (Those would be `/chat/completions`.)
/v1/chat/completions is the entry point.
In proxy mode, here's what forge applies on each request (handler.py builds these):
Response validation: ResponseValidator(tool_names) checks each tool call against the declared tools array. If the model emits a call to a name not in tools[], or a malformed call shape, it's caught before the response goes back.
Rescue parsing: When the model emits tool calls in the wrong format — JSON in a code fence, [TOOL_CALLS]name{args} (Mistral), <tool_call>...</tool_call> (Qwen XML) — rescue parsers extract the structured call and re-emit it in the canonical OpenAI tool_calls schema. This is the biggest practical lift, especially on Mistral-family models that ignore native FC and emit their own bracket syntax.
Retry loop with error tracking: ErrorTracker(max_retries=N) — if validation fails, forge retries inference up to N times with a corrective tool-result message on the canonical channel, rather than returning a malformed response to your caller. From your perspective the proxy looks like a single request that just took a few extra ms.
What proxy mode does NOT do (because it's single-shot, not multi-turn): prerequisite/step enforcement (those need a workflow definition spanning turns), context compaction, session memory. For that surface you wrap the WorkflowRunner class in Python — proxy mode trades that depth for "use forge with your existing setup, no Python rewrite."
So yes — the proxy is fortifying the response shape and retry behavior of /v1/chat/completions. The full agentic guardrails are at the Python class level above it.
For greenfield projects, I've been building on forge native using WorkflowRunner so I get all guardrails. But obviously as a drop-in replacement in existing systems then proxy is the way to go.
I'm definitely still iterating on forge, but so far sending the model a friendly and gracefully handled error message works wonders (instead of barfing a stack trace or something).
And if you didn't mean that then please elaborate :)
Interested in using this for Home Assistant using a Mac Mini as my server. Does it run on MacOS?
How is the latency when using the proxy? I’m using Claude Haiku 4.5 for my voice assistant right now and it’s pretty fast, but if I could keep the LLM local, it’d be even better.
Latency is dependent on the guardrails firing, effectively. If nothing fires, it's a passthrough, for all intents and purposes, very little overhead. But if a retry nudge fires then that's another LLM call.
As a consumer for a home assistant, a retry nudge firing is something I'd catch, and have my voice model output a pre-baked "one sec, trying again" sort of filler message or something.