- multi-model consensus, with multiple cross-review rounds. Obviously, the number of inference tasks explodes with the number of models. Led to some interesting results [^0].
- giving an agent "stray thoughts" produced by the same model, or another, giving the second model a selection of the agent’s context, with different triggers (random, loop detection,…)[^1]. So far has proven very helpful and much cheaper than the first.
[0]: https://github.com/lightless-labs/refinery
[1]: https://github.com/Lightless-Labs/skunkworks/tree/main/flux
Models seem to be trained to return nice, round numbers of items, like 5, 10, or 15 (because of interference from training on marketing materials?) Plus, recall is far from 100% on large contexts. So if your code has 27 bugs, each run may find a different set of 10 issues out of the 27, whether you use several models or call the same one repeatedly.
I get significantly better results by pre-prompting each LLM (they can be the same LLM too, just another instance), I pre-prompt them to approach from a different perspective. Basically, I create expert personas that each believe they are someone of a different career, different intellectual perspectives, and then that generates a real debate between experts.
(10 points on the benchmark, or a relative increase of over 20%)
https://news.ycombinator.com/item?id=44630724
TFA on the other hand tests two things at once: mixing models, and "fuse a model with itself",! the latter being just test time compute. e.g. Opus was able to match Fable on TFA, at the cost of costing twice as much money (and presumably time).
These two dimensions are orthogonal but can be combined for further gains.
It's not clear that every task benefits from it though. The only benched deep research, and their results are a bit weird. (e.g. they have DeepSeek outranking frontier models.)
More research needed!
I would love to hear why they have created it, what was the business case, what this is going to serve? As you said, this is pretty easy to replicate
Repo with video: https://github.com/monkeydust/rightmind
Seeing this log is interesting: https://link.ekin.dev/6RzYGGX7
It came up with a decent response but I guess Opus or GPT 5.5 would do fine anyway. Gotta try it on different stuff. But this feels like it would work great on some situations.
A panel of models can debate for 5 minutes. The harder problem is making an agent remember why a decision was made 5 weeks ago.
I found that Fable didn't have as much of an impact when put in a team.
But it was/is a very pleasant model to work with 1:1. And was the first time I didn't use my primary team based workhorse in months, across 10s of sessions last week.
I worry its kind of like asking 2-3 different consultants what the optimal strategy is for your business...and I'm not sure merging the answers produces anything material better.