Posted by bahaAbunojaim 12/23/2025
The problem: I pay for Claude Pro, ChatGPT Plus, and Gemini but only one could help at a time. On tricky architecture decisions, I wanted a second opinion.
The solution: Mysti lets you pick any two AI agents (Claude Code, Codex, Gemini) to collaborate. They each analyze your request, debate approaches, then synthesize the best solution.
Your prompt → Agent 1 analyzes → Agent 2 analyzes → Discussion → Synthesized solution
Why this matters: each model has different training and blind spots. Two perspectives catch edge cases one would miss. It's like pair programming with two senior devs who actually discuss before answering.
What you get: * Use your existing subscriptions (no new accounts, just your CLI tools) * 16 personas (Architect, Debugger, Security Expert, etc) * Full permission control from read-only to autonomous * Unified context when switching agents
Tech: TypeScript, VS Code Extension API, shells out to claude-code/codex-cli/gemini-cli
License: BSL 1.1, free for personal and educational use, converts to MIT in 2030 (would love input on this, does it make sense to just go MIT?)
GitHub: https://github.com/DeepMyst/Mysti
Would love feedback on the brainstorm mode. Is multi-agent collaboration actually useful or am I just solving my own niche problem?
[1] https://github.com/pchalasani/claude-code-tools?tab=readme-o...
My repo has other tools that leverage such headless agents; for example there’s a resume [1] functionality that provides alternatives to compaction (which is not great since it always loses valuable context details): The “smart-trim” feature uses a headless agent to find irrelevant long messages for truncation, and the “rollover” feature creates a new session and injects session lineage links, with a customizable extraction of context for the task to be continued.
[1] https://github.com/pchalasani/claude-code-tools?tab=readme-o...
I think my only real take-away from all of it was that Claude is probably the best at prototyping code, where Codex make a very strong (but pedantic) code-reviewer. Gemini was all over the place, sometimes inspired, sometimes idiotic.
0: https://github.com/pjlsergeant/captive-wifi-tool/tree/main
To people asking why would you want Claude to call Codex or Gemini, it’s because of orchestration. We have an architect skill we feed the first agent. That agent can call subagents or even use tmux and feed in the builder skill. The architect is harnessed to a CRUD application just keeping track of what features were built already so the builder is focused on building only.
Everything in the "What Claude Code Can Do With Tmux-CLI" section is already easily possible out of the box with vanilla tmux
It bakes in defaults that address these: Enter is sent automatically with a 1-second delay (configurable), pane targeting accepts simple numbers instead of session:window.pane, and there's built-in wait_idle to detect when a CLI is ready for input. Basically a wrapper that eliminates the common failure modes I kept hitting when having Claude Code interact with other terminal sessions.
Contributions will be highly appreciated and credited
But the “playing house” approach of experts is somewhere between pointless and actively harmful. It was all the rage in June and I thought people abandoned that later in the summer.
If you want the model to eg review code instead of fixing things, or document code without suggesting improvements (for writing docs), that’s useful. But there’s. I need for all these personas.
What I was hoping would be that I could effectively farm out work to my metaphorical AI intern while I get to focus on fun and/or interesting work. Sometimes that is what happens and it makes me very happy when it does. A lot of the time, however, it generates code that is wrong, or incomplete (while claiming it is complete), and so I end up having to babysit the code, either by further prompting or just editing the code.
And then it makes a lot of software engineering become "type prompt, sit and wait a minute, look at the code, repeat", which means I'm decidedly not focusing the fun part of the project and instead I'm just larping as a manager who backseat codes.
A friend of mine said that he likes to do this backwards: he writes a lot of the code himself and then he uses Claude Code to debug and automate writing tedious stuff like unit tests, and I think that might make it a little less mind numbing.
Also, very tangential, and maybe my prompting game isn't completely on point here, but Codex seems decidedly bad at concurrent code [1]. I was working on some lock-free data store stuff, and Codex really wanted to add a bunch of lock files that were wholly unnecessary. Oh, and it kept trying to add mutexes into Rust, no matter how many times I tell it I don't want locks and it should use one-shot channels instead. To be fair, when I went and fixed the functions myself in a few spots and then told it to use that as an example, it did get a little better.
[1] I think this particular case is because it's trained on example code from Github and most code involving concurrency uses locks (incorrectly or at least sub-optimally). I guess this particular problem may be more of the fault of American universities teaching concurrent programming incorrectly at the undergrad level.
Honest question. How is Mysti better than a simple Claude skill that does the same work?
I guess it depends?
You can usually count on Claude Code or Codex or Gemini CLI to support the model features the best, but sometimes having a consistent UI across all of them is also nice - be it another CLI tool like OpenCode (that was a bit buggy for me when it came to copying text), or maybe Cline/RooCode/KiloCode inside of VSC, so you don't also have to install a custom editor like Cursor but can use your pre-existing VSC setup.
Okay, that was a bit of a run on sentence, but it's nice to be able to work on some context and then to switch between different models inline: "Hey Sonnet, please look at the work of the previous model up until this point and validate its findings about the cause of this bug."
I'd also love it if I could hook up some of those models (especially what Cerebras Code offers) with autocomplete so I wouldn't need Copilot either, but most of the plugins that try to do that are pretty buggy or broken (e.g. Continue.dev). KiloCode also added autocomplete, but it doesn't seem to work with BYOK.
Will definitely try to add those features in a future release as well
Do they?
There was a paper about HiveMind in LLMs. They all tend to produce similar outputs when they are asked open ended questions.
Often, revolutions take longer to happen than we think they will, and then they happen faster than we think they will. And when the tipping point is finally reached, we find more people pushing back than we thought there would be.
On the other hand agentic teams will take over solo agents.
Ask any human client buying dev work from a web agency how "natural language" spec works out for them.
It's not clear to me at all that "natural language" alone is ideal -- unless you also have near real time iteration of builds. If you do, then the build is a concrete manifestation of the spec, and the natural language can say "but that's not what I meant".
This allows messy natural language to vector towards the solution, "I'll know it when I see it" style.
So, natural language shaping iteratively convergent builds.
Exactly.
Update:
I've already found a solution based on a comment, and modified it a bit.
Inside claude code i've made a new agent that uses the MCP gemini through https://github.com/raine/consult-llm-mcp. this seems to work!
Claude code:
Now let me launch the Gemini MCP specialist to build the backend monitoring server:
gemini-mcp-specialist(Build monitoring backend server) ⎿ Running PreToolUse hook…
“If”, oh, idk, just the tool 90% of potential users will have installed.
# Plan code changes (Claude, Gemini and GPT-5 consensus)
# All agents review task and create a consolidated plan
/plan "Stop the AI from ordering pizza at 3AM"
# Solve complex problems (Claude, Gemini and GPT-5 race)
# Fastest preferred (see https://arxiv.org/abs/2505.17813)
/solve "Why does deleting one user drop the whole database?"
# Write code! (Claude, Gemini and GPT-5 consensus)
# Creates multiple worktrees then implements the optimal solution
/code "Show dark mode when I feel cranky"
# Hand off a multi-step task; Auto Drive will coordinate agents and approvals
/auto "Refactor the auth flow and add device login"@gemini could you review the code and then provide a summary to @claude?
@claude can you write the classes based on an architectural review by @codex
What do you think? Does that make sense ?
Don’t quote me, but I think the other methods rely on passing general detail/commands and file paths to Gemini to avoid the context overhead you’re thinking about.
Do they? How much better are multiple agents on your evals, and what sort of evals are you running? I've also research that suggests that more agents degrades the output after a point.
I think what really degrades the output is the context length vs context window limits, check out NoLima
> coordination yields diminishing or negative returns once single-agent baselines exceed ~45%
This is going to be the big thing to overcome, and without actually measuring it all we're doing is AI astrology.
I agree on measuring and it is planned especially once we integrate the context optimization. I think the value of context optimization will go beyond just avoiding compacting and reducing cost to providing more reliable agents.
I LOLd at that. Things in AI space become obsolete much faster. I'd say just go with GPL or AGPL if you don't want proprietary software to be built on top of your code