Openrouter Fusion API

Posted by tdchaitanya 6 days ago

217 points | 85 commentspage 2

ElFitz 5 days ago|

I’ve been experimenting with two things on this:

- multi-model consensus, with multiple cross-review rounds. Obviously, the number of inference tasks explodes with the number of models. Led to some interesting results [^0].

- giving an agent "stray thoughts" produced by the same model, or another, giving the second model a selection of the agent’s context, with different triggers (random, loop detection,…)[^1]. So far has proven very helpful and much cheaper than the first.

[0]: https://github.com/lightless-labs/refinery

[1]: https://github.com/Lightless-Labs/skunkworks/tree/main/flux

kgeist 5 days ago||

I wonder if regenerating the same prompt with the same model multiple times at a higher temperature would be equivalent to running different models. I suspect the perceived variance among different frontier models may be largely due to randomness associated with non-zero temperatures.

Models seem to be trained to return nice, round numbers of items, like 5, 10, or 15 (because of interference from training on marketing materials?) Plus, recall is far from 100% on large contexts. So if your code has 27 bugs, each run may find a different set of 10 issues out of the 27, whether you use several models or call the same one repeatedly.

bsenftner 5 days ago||

I'm sure many have made something like this, I've done a few. I've found simply submitting one's prompt to multiple models to be kind of pointless. You're just going to get statistical noise from the variances in their training methods, as they are all training on pretty much the same data.

I get significantly better results by pre-prompting each LLM (they can be the same LLM too, just another instance), I pre-prompt them to approach from a different perspective. Basically, I create expert personas that each believe they are someone of a different career, different intellectual perspectives, and then that generates a real debate between experts.

andai 5 days ago||

I was reminded of "model alloys", where they randomly select a LLM for every agentic turn. This significantly boosted performance on security work.

(10 points on the benchmark, or a relative increase of over 20%)

https://news.ycombinator.com/item?id=44630724

TFA on the other hand tests two things at once: mixing models, and "fuse a model with itself",! the latter being just test time compute. e.g. Opus was able to match Fable on TFA, at the cost of costing twice as much money (and presumably time).

These two dimensions are orthogonal but can be combined for further gains.

It's not clear that every task benefits from it though. The only benched deep research, and their results are a bit weird. (e.g. they have DeepSeek outranking frontier models.)

More research needed!

Oras 5 days ago||

Agree, and I see opus and Gemini pro as “quality” on openrouter fusion, this would be super pricy if the prompts are dynamic and not optimised for caching.

I would love to hear why they have created it, what was the business case, what this is going to serve? As you said, this is pretty easy to replicate

bsenftner 5 days ago||

[dead]

genxy 5 days ago||

It should be called something else, maybe Ensemble? It doesn't fuse anything.

andai 5 days ago||

Yeah the Rio thing is a better candidate for that word, where they averaged the weights for two models:

https://news.ycombinator.com/item?id=48528371

all2 5 days ago||

Ensemble or 'mixture of experts' (before MoE model architecture was a thing).

genxy 4 days ago||

The good names all get applied to the wrong things. It would be awesome if we could stop every couple years and define in-domain terms we can all agree on. Like how hallucination has won out over confabulation. The precise and accurate definition almost never wins out. :(

monkeydust 5 days ago||

I have been experimenting with multi-agent llms for last month, as I put in the writeup for my repo and in the video the biggest value I have found is when you run a bunch of different agentic strategies in parallel then have a judge review the variance of them. So far that has uncovered interesting insights. The rest of it is so-so. Been fun but also expensive!

Repo with video: https://github.com/monkeydust/rightmind

eknkc 5 days ago||

I opened the page and prompted it `Which 3d printer is the best`. I mean this is a stupid question but I was looking at some 3d printers so it popped into my mind.

Seeing this log is interesting: https://link.ekin.dev/6RzYGGX7

It came up with a decent response but I guess Opus or GPT 5.5 would do fine anyway. Gotta try it on different stuff. But this feels like it would work great on some situations.

rektlessness 5 days ago||

I tried OpenRouter Fusion with the budget model option but swapped out DeepSeek v3.2 for DeepSeek V4 Pro. The results weren't that bad. An interesting take on quorums for sure. However I did notice a tool call to Claude Opus 4.8 for 1168 - 237 tokens, and $0.0118 cost, which I cannot account for because Opus was not in my selection and only revealed in logs. Strange.

maccam912 5 days ago|

Same for me! I bet they use opus to synthesize the final answer somehow? Regardless, it was unexpected.

SteveMorin 5 days ago||

Yes believe opus is the default judge

rektlessness 5 days ago||

Perhaps, but shouldn't be at my own cost if not disclosed before hand.

Pranavsingh431 3 days ago||

Everyone is focused on model ensembles, but the bigger bottleneck for me is state.

A panel of models can debate for 5 minutes. The harder problem is making an agent remember why a decision was made 5 weeks ago.

bushido 5 days ago||

Interestingly I've had a similar experience with agent teams/swarms, albeit they can get much more expensive depending on the workflow.

I found that Fable didn't have as much of an impact when put in a team.

But it was/is a very pleasant model to work with 1:1. And was the first time I didn't use my primary team based workhorse in months, across 10s of sessions last week.

chrisss395 5 days ago|

Is there any formal research in this space? I too have tried flavors of this approach, but I can't confidently say my results were better.

I worry its kind of like asking 2-3 different consultants what the optimal strategy is for your business...and I'm not sure merging the answers produces anything material better.

More comments...