Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

Posted by zambelli 16 hours ago

Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks(github.com)

Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.

I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.

What it does:

- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware

- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it

- Ships with an eval harness and interactive dashboard so you can reproduce every number

I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.

Demo video: https://youtu.be/MzRgJoJAXGc (side-by-side: same model, same task, with and without Forge guardrails)

The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:

- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.

- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.

- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.

I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).

The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.

One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.

Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.

Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.

How to try it:

- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.

- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.

- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.

Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.

Repo: https://github.com/antoinezambelli/forge

Paper: https://www.caisconf.org/program/2026/demos/forge-agentic-re... https://github.com/antoinezambelli/forge/blob/main/docs/forg...

Dashboard: https://github.com/antoinezambelli/forge/docs/results/dashbo...

330 points | 123 commentspage 2

tempoponet 5 hours ago|

Why this entire tool chain instead of building within something like pi code?

I've been exploring this area and a project like https://github.com/itayinbarr/little-coder (not my work) lets me mix and match with my current setup or any plugins built for pi.

zambelli 5 hours ago|

Mainly because I have plenty of use cases and not all of them need or want pi. Forge isn't an orchestration framework and is not coding specific, it lives one level lower - if I understand pi correctly.

The proxy mode should integrate seamlessly, and the middleware guardrail mode could be lifted into pi.

As for little coder, I love it! I wanted forge to be more generic than just agentic coding as there's many more agentic workflows worth optimizing with small models.

blurbleblurble 4 minutes ago||

Wouldn't the nice place to integrate something like this be at the context protocol layer? For example in MCP / A2A?

bglusman 6 hours ago||

Funny timing. I’ve been building something adjacent, though from a different angle: not primarily local-model reliability, but a control layer around agent execution, tools, routing, and operator intent. I was calling these "synthetic models", but decided yesterday "LLM middleware" is a clearer description.

Very early prototype, so I’m looking more for architectural/conceptual reactions than polish: https://wardwright.dev / https://github.com/bglusman/wardwright

The common thread I see is treating the harness around the model as first-class infrastructure. Forge seems focused on tool-call correctness and recovery; Wardwright is more about controlling what the agent is supposed to do, where work gets routed, and how the operator stays in the loop.

Curious whether you see those as complementary layers. I’m planning to try Forge and would be interested in seeing whether they fit together cleanly.

zambelli 6 hours ago||

Conceptually I think definitely! Forge has no opinion on what the agent should be trying to do, that's the "middleware"'s job, so to speak.

Forge is just trying to make sure that when the model decides to do something, thee execution is reliable.

As for software integration, let me know if you run into any issues and I'll be happy to take a look or try to patch something!

Harnesses as first class infra all the way. I'll take a look at your work and see if I spot any obvious tensions.

esperent 3 hours ago|||

I've just read through your readme and I have zero clue what this does. Something about proxying model calls and applying "policies" to them? But what kind of things does it actually do, what benefits are there? That should be at the top of the readme.

zambelli 2 hours ago|||

I'm sorry to hear that! I'll take a fresh look at docs in my upcoming release.

In a nutshell, it applies guardrails around LLM calls to make them more reliable - specifically small models but works on all: "on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction).".

It'll try to parse malformed tool calls, it'll automatically compact if needed, it'll enforce any workflow requirements you define (ie, read before edit) - and it does so with domain-agnostic guardrails. It catches and feeds errors back to the model in a structured way so the model self-corrects (hopefully).

Each guardrail can be removed as desired by a consumer. It can be used as a building block library (WorkflowRunner approach), it can be integrated into existing source (middleware), or it can be a drop-in addition to an exiting workflow (proxy mode).

bglusman 1 hour ago||

I think that comment was aimed at my Wardwright link, not Forge, given mention of policies and proxying model calls! I think your docs are in much better shape ;-)

zambelli 1 hour ago||

lol - my bad! but thanks!

bglusman 1 hour ago|||

[flagged]

bglusman 5 hours ago||

Ironically, the project this idea emerged out of for me is also called Forge, actually Calciforge… https://calciforge.org / https://github.com/bglusman/calciforge

Name was just a portmanteau of Calcifer's forge, because Howl’s moving castle seemed like a good metaphor for what I was trying to do… I had synthetic models as apiece there but I realized a) it was out of place and b) it was my favorite feature there

tommica 9 hours ago||

What are "guardrails" in this context? Is it correctly understood that this would sit between my pi agent and llama-server, and it would do what exactly?

zambelli 9 hours ago|

It would help ensure that the model executes its tool call correctly. So if you give Pi a task like booking travel... Pi decides to book a flight, hotel, car. It gets the flight in one go, but then sends "here is the payload : [json blob]" to hotel booking API and the whole thing throws an error and the workflow dies, with partial completion. Forge would catch the error and nudge the model by injecting a message into the conversation history, with a helpful error message "You replied with text, you must call a tool", the model reads it, and submits a tool call.

Big frontier models need this less than small models.

blurbleblurble 6 minutes ago||

Nice explanation, thank you.

So basically the kind of thing I'd usually be doing manually with small models, over and over again, you just automate that nudging and off they go.

Sometimes LLMs have seemed to me like "computer programs with inertia" and in that frame what your tool does is identify and reduce friction at key points so the wheels can keep spinning.

roger_ 2 hours ago||

Would putting this between a small model and an agent like Hermes improve performance?

zambelli 2 hours ago|

I haven't specifically tested this with Hermes, but I would expect so. Hermes is orchestrating things - it decides it needs to...whatever you want, book a trip for you. Forge will help make sure that the API calls to hotel booking sites parse correctly or gracefully retry.

Without forge, I'd guess a small model used for Hermes would have to retry entire workflows when an uncaught exception triggerd when it tried to reply with text when "calling a tool" ("Here is the tool call: [json blob]"). The issue there becomes partial successes can lead to state changes that need to be addressed (it booked the flight already, home it doesn't double-book).

Forge won't help with model reasoning quality though. If it the model thinks the right thing to do is to book 3 buses for your trip, forge doesn't care, it'll just make sure those api calls land.

k__ 9 hours ago||

So, this basically ensures that models call the right tools with the correct format?

zambelli 9 hours ago|

In a nutshell, yes. It tries to anyways, but at the end of the day, some models get stuck and you hit a max iterations error that forge will raise, with some context, and the consumer can choose what it wants to do at that point.

k__ 9 hours ago||

Ah, so it a "smart" retry mechanism?

zambelli 8 hours ago||

I'd like to think so! ;). It has some brains, but the key insight was to send the model domain-agnostic nudges. I don't need to know what you're trying to do, the LLM already knows, I just need to nudge it back on the structural side: text response vs tool call, arg mismatch, etc. and let its knowledge of the context fill in the blanks (otherwise I'd need a massive library of every possible failure mode).

The other insight was doing it at tool call level and not workflow level, which addresses the compounding math problem more directly.

jimmySixDOF 7 hours ago||

Maybe similar to Instructor [1] which was a cool tool for json and structured output enforcement combining pydandic with ai retry loops very handy for when models don't have that covered

[1] https://github.com/567-labs/instructor

zambelli 7 hours ago||

Interesting! I'll look into that. Would mean another dep/integration but might be more robust.

ElenaDaibunny 1 hour ago||

guardrails this well-designed matter way more than just throwing bigger models at agent tasks tbh

zambelli 1 hour ago|

Thank you! I completely agree - especially for always-on systems like agents crawling databases or doing audits and the like. The sheer volume of calls will be enormous and being able to run it on simple hardware with a small model that fits instantly changes the economics of it.

Plus it's cool to see a little 8B model writing code :)

nzeid 7 hours ago||

> # External mode — you manage llama-server, forge proxies it

> python -m forge.proxy --backend-url http://localhost:8080 --port 8081

This is a good example because I've currently stuck with llama.cpp's UI. I can read your code (or throw Gemma at it =p ) but thought I'd ask anyway.

In this example, what is it exactly that your proxy is fortifying? The HTTP SSE requests? (Those would be `/chat/completions`.)

zambelli 7 hours ago|

Yes that's correct !

/v1/chat/completions is the entry point.

In proxy mode, here's what forge applies on each request (handler.py builds these):

Response validation: ResponseValidator(tool_names) checks each tool call against the declared tools array. If the model emits a call to a name not in tools[], or a malformed call shape, it's caught before the response goes back.

Rescue parsing: When the model emits tool calls in the wrong format — JSON in a code fence, [TOOL_CALLS]name{args} (Mistral), <tool_call>...</tool_call> (Qwen XML) — rescue parsers extract the structured call and re-emit it in the canonical OpenAI tool_calls schema. This is the biggest practical lift, especially on Mistral-family models that ignore native FC and emit their own bracket syntax.

Retry loop with error tracking: ErrorTracker(max_retries=N) — if validation fails, forge retries inference up to N times with a corrective tool-result message on the canonical channel, rather than returning a malformed response to your caller. From your perspective the proxy looks like a single request that just took a few extra ms.

What proxy mode does NOT do (because it's single-shot, not multi-turn): prerequisite/step enforcement (those need a workflow definition spanning turns), context compaction, session memory. For that surface you wrap the WorkflowRunner class in Python — proxy mode trades that depth for "use forge with your existing setup, no Python rewrite."

So yes — the proxy is fortifying the response shape and retry behavior of /v1/chat/completions. The full agentic guardrails are at the Python class level above it.

For greenfield projects, I've been building on forge native using WorkflowRunner so I get all guardrails. But obviously as a drop-in replacement in existing systems then proxy is the way to go.

cyanydeez 7 hours ago||

the funniest thing I see in opencode with tool calling is the model calls 10.0 and opencode says it's an error because the spec is an integer, even though it's obvious to anyone that if a float can be coerced properly to a integer, then that should be a success.

zambelli 6 hours ago||

Yeah it's a delicate balance between precise and silly, and too permissive.

I'm definitely still iterating on forge, but so far sending the model a friendly and gracefully handled error message works wonders (instead of barfing a stack trace or something).

MWil 2 hours ago||

have you considered implementing the addition of a leading canary sentinel that fires at the earliest/cheapest possible point instead of only on lag of some actual load-bearing constraint violation?

zambelli 2 hours ago|

Do you mean catching errors as tokens stream back versus waiting for the full message? If so, then no I hadn't looked into that. This was mostly geared towards local models so token cost isn't really a big deal, though latency might be.

And if you didn't mean that then please elaborate :)

jamesponddotco 6 hours ago||

This seems pretty awesome; being able to use an 8B model for tool calling would be perfect.

Interested in using this for Home Assistant using a Mac Mini as my server. Does it run on MacOS?

How is the latency when using the proxy? I’m using Claude Haiku 4.5 for my voice assistant right now and it’s pretty fast, but if I could keep the LLM local, it’d be even better.

zambelli 6 hours ago|

I have an open GitHub issue for macOS hardware detection. I don't have a Mac myself to do dev on but happy to accept a fork! I did assign a buddy to that issue but she's been slacking - call her out :p.

Latency is dependent on the guardrails firing, effectively. If nothing fires, it's a passthrough, for all intents and purposes, very little overhead. But if a retry nudge fires then that's another LLM call.

As a consumer for a home assistant, a retry nudge firing is something I'd catch, and have my voice model output a pre-baked "one sec, trying again" sort of filler message or something.

Topology1 2 hours ago|

The dashboard github link appears to be broken

zambelli 2 hours ago|

Yeah I'm sorry about that - I thought that link would work. Here is the fixed one (dashboard inside): https://github.com/antoinezambelli/forge/tree/main/docs/resu...

More comments...