Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

Posted by zambelli 17 hours ago

Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks(github.com)

Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.

I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.

What it does:

- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware

- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it

- Ships with an eval harness and interactive dashboard so you can reproduce every number

I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.

Demo video: https://youtu.be/MzRgJoJAXGc (side-by-side: same model, same task, with and without Forge guardrails)

The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:

- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.

- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.

- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.

I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).

The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.

One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.

Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.

Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.

How to try it:

- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.

- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.

- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.

Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.

Repo: https://github.com/antoinezambelli/forge

Paper: https://www.caisconf.org/program/2026/demos/forge-agentic-re... https://github.com/antoinezambelli/forge/blob/main/docs/forg...

Dashboard: https://github.com/antoinezambelli/forge/docs/results/dashbo...

347 points | 128 commentspage 3

lucrbvi 8 hours ago|

How does this differ from dottxt's Outlines[0] on the technical level? Are you using some JSON grammar to force the LM head distribution to follow it?

[0]: https://github.com/dottxt-ai/outlines

zambelli 7 hours ago|

I only just skimmed it, but will try to dive deeper in a bit.

I think we share a lot on tool definitions/schemas. Forge will let a consumer define a tool, set of tools, pydantic schema for each, etc. outlines seems to be similar with their task definition.

I think where we differ is what happens when that doesn't work...and the model still doesn't get the contract right. Something like a pydantic-valid string path for glob, that points to a non-existent thing. Glob will error, forge catches, and nudges the model. Forge does very little model output manipulation (just a basic regex parse to try to find json/XML), the core of it is in the retry mechanisms.

Once I dig into it more I'll try to highlight other deltas.

yieldcrv 1 hour ago||

impressive, we can get high tokens/s with 8B param models and doubling it with MTP

zambelli 11 minutes ago|

Yeah, throughput on small models can get really fun :). As for MTP, should work fine since forge just sits between model and consumer. As long as MTP didn't change the model endpoint contract (ie, you call llama.cpp the same way you would normally) then it should work out of the box. But I haven't tested MTP myself yet (or that commit of llama.cpp).

__mharrison__ 7 hours ago||

Curious if this would help larger local models? Qwen 3.6 varieties of deepseek4?

zambelli 7 hours ago|

Yes it does! I haven't published those evals yet, but I'm actually running 24-35B class models on a custom coding harness built on forge (even 120B class recently).

I just need more GPU wall clock time to get more evals done. ETA is...a few weeks? Got distracted by the coding harness.

But the results are the same. Reforged models do better than bare, even at those sizes. As for published results, I ran forge on Anthropic models and reforged doe better than bare for them as well :)

kgeist 4 hours ago|||

>But the results are the same. Reforged models do better than bare, even at those sizes

>I haven't published those evals yet

Don't forget to post the complete settings for those evals, please, because local LLMs' failure modes are often caused by incorrect setups (bad quants, bad chat templates, non-recommended temperatures, ridiculously small context, not enabling "preserve thinking" etc.). In my setup I've never seen Qwen3.6-27b get truly stuck so far. What it usually gets wrong are poor architectural decisions or forgetting to update something.

zambelli 3 hours ago||

Good call! The latest forge version has per-model-parameter configs sourced from official sources (can be overridden), that's what I'll use for evals and each eval set will be paired with a commit hash. But I'll make sure to call out the location of the params and maybe highlight some for the popular models.

For the paper - more academic in nature - I wanted to isolate the model performance variable from guardrail lift. The delta is what mattered more than final score. For the paper, everyone got temp=0.7 - that was intentional.

As for Qwen3.6, it's really solid. It'll do really well on forge I can call that now. When I pushed it into agentic coding specifically and the eval suite I use there (separate from forge), even it needed help on long-running tasks - but it's definitely a top model right now.

However, entirely possible there are better settings than the "official recommendations" I found - which would be a neat finding in itself.

happycube 7 hours ago||||

If it's worth it to you, you could try running it on Deepseek v4 flash which is very cheap right now...

trollbridge 7 hours ago|||

Exactly what I was thinking - even on frontier or near-frontier models I still see my agents get stuck in these pointless loops where it's very obvious to me what they need to do to get "unstuck".

zambelli 5 hours ago||

Yeah, it's a useful framework even with frontier. And it definitely lifts "cheap" frontier models like Haiku into more solid territory. I haven't done a ton of forge integrations into frontier (like pointing claude code into proxy mode) yet, but if you run into any issues let me know!

trollbridge 3 hours ago|||

I'm attempting to make a replica of your Anthropic method that will do the same for DeepSeek. I'll let you know how it goes.

For our local Qwen, your setup works great out of the box!

trollbridge 1 hour ago|||

And we're off! It's working great with DeepSeek V4, although DeepSeek V4 Pro tends not to really run into problems anyway being near-frontier, but I definitely see improvement with Flash.

zambelli 1 hour ago||

That was fast! It's great to hear it's working well :)

Did you notice any particular guardrails firing? Always curious about things I haven't tested on - especially if it has a different shape.

mholubowski 9 hours ago||

Hey I'm really impressed and hoping to connect. I followed you on X just now, is that a decent place to shoot you a DM? I don't want anything from you, we just seem to be working on similar things (I'm working on our internal agent harness here, at a healthcare startup).

zambelli 9 hours ago|

Neat! Historically I've been most active on LinkedIn but the AI community seems very X-leaning so I'll make sure to pay closer attention there. Good luck with the harness, happy to connect!

zambelli 10 hours ago||

Happy to answer questions about the eval methodology, the backend findings, or anything in the repo. I'll be around.

schaefer 8 hours ago||

super interesting work. It will take me a few days to dig in and really understand it. But I'm looking forward to it.

I run small models at home, so I'm very curious.

zambelli 8 hours ago||

That's awesome! Let me know if quick start is causing issues or anything else you'd like to dig into.

Out of curiosity, what models are you running?

fabian_shipamax 9 hours ago||

dashboard link is dead

zambelli 9 hours ago||

Does this work? https://github.com/antoinezambelli/forge/tree/main/docs/resu...

schaefer 9 hours ago||

yes, that link works for me.

dpweb 9 hours ago||

Hello. Interesting project! Haven't gone through it yet, but want to consider using this in my CS master's capstone. While you have benchmarks I may create my own specific scenarios and comparisons vis-a-vis hosted inference to highlight specific economic benefit. Any suggestions?

zambelli 9 hours ago|

Very cool! I would look at the tokens returned by each of the calls. You can map those to API costs per input/output tokens. Forge should be capturing those (or can, as passthrough from llama.cpp).

At least, if I understand your economic benefit angle correctly.

For scenarios to get inspired by I'd look at those tagged "model_quality" or "advanced_reasoning".

Topology1 3 hours ago||

The dashboard github link appears to be broken

zambelli 3 hours ago|

Yeah I'm sorry about that - I thought that link would work. Here is the fixed one (dashboard inside): https://github.com/antoinezambelli/forge/tree/main/docs/resu...

tim-projects 6 hours ago||

I've been working on the same thing and even nearly called it forge. Instead I called it hammer.

I'll be keen to look through the code on this!

zambelli 6 hours ago|

Oh no! I have code-hammer coming out soon :D. Everyone is building stuff these days :p.

Always happy to see folks looking into small local models!

pianopatrick 4 hours ago||

Do you think a similar approach would work with smaller models, like 1.5B models?

zambelli 4 hours ago|

I would expect so! I'm currently running Gemma 4 E4B evals and it's behaving the same. Better with guardrails. There might be a floor where any error nudge confuses the model more than helps, but I haven't found it across many 8B families and now Gemma 4 E4B.

xiaod 9 hours ago|

I'd be curious about the eval methodology. In production coding tasks, the gap between benchmark scores and actual workflow integration can be significant. What does the error recovery loop look like?

zambelli 9 hours ago|

Absolutely, benchmarks are a different breed. Forge's eval is deliberately scoped as a stress test of the recovery loop, not a measure of end-to-end agentic quality.

Scenarios range from basic 2-step workflows, to more complex ones with dead ends, breadcrumbs, misleading names.

Concrete example: Task: get, analyze and report on Q3 sales data.

Model emits: analyze_sales(quarter="Q3"). This skipped the fetch step. Forge's response validator catches it before the tool function runs. Instead of letting the bad call hit the real impl (which would error or hallucinate), forge replies on the canonical tool-result channel.

We send this to the model: tool_result: [PrereqError] analyze_sales requires fetch_sales_data to be called first. Available next steps: fetch_sales_data

Model emits a corrected fetch_sales_data(...) on the next turn.

Three enforcement paths use this same channel: prerequisite violations, premature terminal calls, unknown-tool retries.

We also have rescue parsing for known templates (Jason OpenAI style, XML like granite, etc) where we try to parse tool calls that might be malformed.

And lastly bare text response nudges. Small models love to chat, we need them to call tools!

More comments...