Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks

Posted by zambelli 18 hours ago

Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks(github.com)

Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.

I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.

What it does:

- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware

- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it

- Ships with an eval harness and interactive dashboard so you can reproduce every number

I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.

Demo video: https://youtu.be/MzRgJoJAXGc (side-by-side: same model, same task, with and without Forge guardrails)

The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:

- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.

- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.

- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.

I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).

The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.

One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.

Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.

Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.

How to try it:

- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.

- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.

- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.

Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.

Repo: https://github.com/antoinezambelli/forge

Paper: https://www.caisconf.org/program/2026/demos/forge-agentic-re... https://github.com/antoinezambelli/forge/blob/main/docs/forg...

Dashboard: https://github.com/antoinezambelli/forge/docs/results/dashbo...

371 points | 138 commentspage 4

pianopatrick 5 hours ago|

Do you think a similar approach would work with smaller models, like 1.5B models?

zambelli 5 hours ago|

I would expect so! I'm currently running Gemma 4 E4B evals and it's behaving the same. Better with guardrails. There might be a floor where any error nudge confuses the model more than helps, but I haven't found it across many 8B families and now Gemma 4 E4B.

xiaod 10 hours ago||

I'd be curious about the eval methodology. In production coding tasks, the gap between benchmark scores and actual workflow integration can be significant. What does the error recovery loop look like?

zambelli 10 hours ago|

Absolutely, benchmarks are a different breed. Forge's eval is deliberately scoped as a stress test of the recovery loop, not a measure of end-to-end agentic quality.

Scenarios range from basic 2-step workflows, to more complex ones with dead ends, breadcrumbs, misleading names.

Concrete example: Task: get, analyze and report on Q3 sales data.

Model emits: analyze_sales(quarter="Q3"). This skipped the fetch step. Forge's response validator catches it before the tool function runs. Instead of letting the bad call hit the real impl (which would error or hallucinate), forge replies on the canonical tool-result channel.

We send this to the model: tool_result: [PrereqError] analyze_sales requires fetch_sales_data to be called first. Available next steps: fetch_sales_data

Model emits a corrected fetch_sales_data(...) on the next turn.

Three enforcement paths use this same channel: prerequisite violations, premature terminal calls, unknown-tool retries.

We also have rescue parsing for known templates (Jason OpenAI style, XML like granite, etc) where we try to parse tool calls that might be malformed.

And lastly bare text response nudges. Small models love to chat, we need them to call tools!

simonw 5 hours ago||

This is a neat project, but the description made me realize that I don't actually know what the term "guardrails" means.

... which lead me to realize that it's one of those terms with multiple meanings - like "agent" or even "AI" itself - but where people who use it may not be aware of how many different definitions are floating around.

In this project it refers to validating tool calls - fixing invalid tool responses, making sure certain required tool calls have been made, maintaining an error budget after which the task is abandoned with an error.

Other projects might use "guardrails" to mean protecting against unsafe content (Llama Gaurd), refusing off-topic queries (NVIDIA NeMo Guardrails "topical rails", filtering PII, detecting jailbreaks, or human-in-the-loop checks of specific actions.

I've even seen people talk about running a coding agent in a sandbox (Docker, Firecracker etc) as a form of guardrail.

zambelli 5 hours ago|

That's a fair point, and frankly something that might not age well in my docs one day. I genuinely don't know what the industry will standardize on when it comes to the use of the term "guardrails". I've seen the sec definitions as well.

You're 100% right about how I meant it and what it means within Forge though, but it's something that might lead to doc changes as things evolve.

trollbridge 2 hours ago||

I'm thinking of it like a guardrail that keeps your car from driving off the edge of a road, but in this case, it keeps your tool calls from driving off a cliff.

rebekkamikkoa 9 hours ago||

Hi Antoine!

Interesting point about backend variance. Do you think serving layer should become part of standard LLM eval reporting?

zambelli 9 hours ago|

Hi! Yes, I definitely think so. I've seen variance across all model families I looked at. The magnitude changes, but the presence of variance is a constant.

GrinningFool 6 hours ago||

That's a huge gap for llama.cpp server - any idea why?

zambelli 5 hours ago|

Best guess is it's native mode. The function calling template is just broken for Nemo.

I did go with an extreme example in the post (but true). Other deltas are smaller but still statistically significant. 30 pt swing between llamserver prompt vs ollama, 4-5pt swing between llamafile and llamaserver prompt.

yieldcrv 2 hours ago||

impressive, we can get high tokens/s with 8B param models and doubling it with MTP

zambelli 1 hour ago|

Yeah, throughput on small models can get really fun :). As for MTP, should work fine since forge just sits between model and consumer. As long as MTP didn't change the model endpoint contract (ie, you call llama.cpp the same way you would normally) then it should work out of the box. But I haven't tested MTP myself yet (or that commit of llama.cpp).

jedisct1 6 hours ago||

Interesting!

The https://swival.dev harness already has retry nudges, step enforcement, error recovery, context awareness, etc. to try to support small models as much as possible.

Curious to see how it compares with forge, and if both could be combined.

zambelli 5 hours ago|

Oh interesting - I hadn't come across that!

I'd assume they could be combined. A coding harness would own the agentic workflow by nature, forge guardrails would help tool calling.

I haven't given it a thorough read yet but I think their guardrails might be more focused on the workflow level. They are doing error capture at tool level with warnings to the model, but I'd need to dig deeper. On the surface definitely the same design philosophy! Maybe Forge makes error nudges more of a first-class citizen?

Our compaction strategies might be the most similar of all the pieces. Cool find!

choonway 4 hours ago||

no different from how the mcdonalds system can turn any random person on the street to a smiling cog in the machine.

Bret_McKinney 4 hours ago||

[flagged]

snovv_crash 9 hours ago|

I get a strong LLM smell in your description. If you couldn't bother to write it, why should I bother to read it?

zambelli 9 hours ago||

I definitely use LLMs to help write things - but this is my draft!

Maybe I've been spending too much time reading the evals and I now sound like an LLM...

Either way, here I am - happy to answer any questions!

snovv_crash 9 hours ago||

I guess it's that, and yes, much as they learned speech patterns from us, now we start to learn from them.

I play with local models a lot but also have limited time and the conciseness, polish and human indication in presentation has become a major quality indicator. I've wasted too much time with slop projects or people's LLM-induced delusions and now take a pretty strict line on what I'm willing to spend my time on. Even if this ends up with some false positives, there's just so much happening these days it doesn't really matter...

Best of luck with Forge!

throwaway20222 9 hours ago|||

If you are so outright against using AI, why would he care if you read his article about AI?

snovv_crash 9 hours ago||

AI usage is great. The problem is the asymmetry in effort between generating text automatically, and then further amplifying this via posting it, while then expecting human eyeballs to spend the time reading it. It is antisocial.

If you're generating AI text you shouldn't expect humans that you aren't paying to bother reading it, purely out of politeness. Brian Cantrill has a great piece on this: https://rfd.shared.oxide.computer/rfd/0576

Karuma 2 hours ago||

Thank you for mentioning it. Too bad you got downvoted to hell as usual when anybody dares to do it.

The original post and every comment by OP is so full of AI slop ("the biggest surprise!", "one thing I didn't expect!", "the biggest challenge!", etc. etc.") that is absolutely painful to read. I still can't believe most people (especially here on HN, I thought we were a bit better than this) can't notice all this stuff.

What's much worse, it's that all these people posting this useless slop are so dishonest ("I definitely use LLMs to help write things - but this is my draft!") that it makes me really nauseous... This is the worst time to be an internet user if you have more than 2 points of IQ.

zambelli 1 hour ago||

I'm sorry you feel that way about my posts - hopefully you still find the work valuable. Still human here btw, and still 100% honest.