Posted by mraniki 5 days ago
Compare and contrast https://aider.chat/docs/leaderboards/, https://web.lmarena.ai/leaderboard, https://livebench.ai/#/.
Does any LLM do this yet? I want to throw it at a project that’s in package and micro service hell and get a useful response. Some weeks I spend almost all my time cutting tickets to other teams, writing documents, and playing politics when the other teams don’t want me to touch their stuff. I know my organization is broken but this is the world I live in.
I'd like to see tests that are more complicated for AI things like refactoring an existing codebase, writing a program to auto play God of War for you, improving the response time of a keyboard driver and so on.
It really is miles ahead of anything else so far, but also really pricey so makes sense some people try to find something close to it with much lower costs.
I've seen this occasionally with older Claude models, but Gemini did this to me very recently. Pretty annoying.
This is on a Gemini 2.5 Pro free trial. Also - god damn is it slow.
For context this is on a 15k LOC project built about 75% using Claude.
Would love to see a similar article that uses LLMs to add a feature to Gimp, or Blender.