The goal is that you would assign roles to models based on tasks, capabilities and observed performance. The router would then take care of model selection in the background.
It's tricky though. Probably have another two weeks before I can release the runtime.
I have a preview up at https://role-model.dev/
You can follow me on Twitter if you want updates (see profile)
> “But Local Models Aren’t As Smart”
> Correct.
> But also so what?
> Most app features don’t need a model that can write Shakespeare, explain quantum mechanics, and pass the bar exam. They need a model that can do one of these reliably: summarize, classify, extract, rewrite, or normalize.
> And for those tasks, local models can be truly excellent.
I have tried quite a bunch of local models, and the reality is that it's not just a matter of of "it's a small model that should be hostable easily". Its also a matter of whats your acceptable prefill TTFT and decode t/s.
All the local models I used, on a _consumer grade_ server (32GB DDR5, AMD Ryzen) have been mostly unusable interactively (no use as coding agent decently possible), and even for things like classification, context size is immediatly an issue.
I say that with 6m experience running various local models for classifying and summarizing my RSS feeds. Just offline summarizing ans tagging HN articles published on the front page barely make the queue sustainable and not growing continuously.
2) It's probably not the time/place to trouble-shoot your "consumer grade server" LLM experience, but if you're running on CPU (you don't mention a GPU) then yeah, your inference speed will be slow.
3) Counterpoint: my consumer-grade Macbook Pro (M1 Max, 64GB) runs Qwen3.6-35B-A3B fast enough to be very usable for regular interactive coding support. (And it would fly with smaller models performing simpler tasks.)
Used to take me maybe 10-20 minutes per sheet.
Then I got codex to whip up a script that sends each sheet to a fairly low parameter locally running LLM and I have the yaml in a couple seconds.
My dream is to bootstrap myself to local productivity with providers… I know I’ll never get there because hedonic treadmill etc, but I do feel there’s lots more juice to squeeze. I just need to invest more time into AI engineering…
I tried Cline and couldn't get it working well and part of this was that at the time it expected OpenAIs output format.
Small models are still in their infancy, and there's still much to sort out about and around them, as well
informatics aren't magic, you'll never be able to compress """knowledge""" into a small model in a way equivalent to the 1.5 TB model
It runs by now on 8GB Vram, so a Legion 5 for about 1500$ could be a good workhorse.