Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison

Posted by mraniki 3/31/2025

Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison(composio.dev)

483 points | 328 commentspage 3

skerit 3/31/2025|

I've been using Gemini 2.5 Pro with Roo-Code a lot these past few days. It has really helped me a lot. I managed to get it to implemented entire features. (With some manual cleaning up at the end)

The fact that it's free for now (I know they use it for training, that's OK) is a big plus, because I've had to restart a task from scratch quite a few time. If I calculate what this would have cost me using Claude, it would have been 200-300 euros.

I've noticed that as soon as it makes a mistake (messing up the diff format is a classic), the current task is basically a total loss. For some reason, most coding tools basically just inform the model they made a mistake and should try again... but at that point, it's broken response is part of the history, and it's basically multi-shotting itself into making more mistakes. They should really just filter these out.

hrudolph 4/1/2025|

Try this and watch it supercharge! https://docs.roocode.com/features/boomerang-tasks/

lherron 3/31/2025||

These one-shot prompts aren't at all how most engineers use these models for coding. In my experience so far, Gemini 2.5 Pro is great at generating code but not so great at instruction following or tool usage, which are key for any iterative coding tasks. Claude is still king for that reason.

jgalt212 3/31/2025|

Agreed. I've never successfully one-shotted anything non-trivial or non-pedagogical.

dysoco 3/31/2025||

Useful article but I would rather see comparisons where it takes a codebase and tries to modify it given a series of instructions rather than attempting to zero-shot implementations of games or solving problems. I feel like it fits better the real use cases of these tools.

dsign 3/31/2025||

I guess depends on the task? I have very low expectations for Gemini, but I gave it a run with a signal processing easy problem and it did well. It took 30 seconds to reason through a problem that would have taken me between 5 to 10 minutes to reason. Gemini's reasoning was sound (but it took me a couple of minutes to decide that), and it also wrote the functions with the changes (which took me an extra minute to verify). It's not a definitive win in time, but at least there was an extra pair of "eyes"--or whatever that's called with a system like this one.

All in all, I think we humans are well on our way to become legal flesh[].

[] The part of the system to whip or throw in jail when a human+LLM commit a mistake.

vonneumannstan 3/31/2025|

>I guess depends on the task? I have very low expectations for Gemini, but I gave it a run with a signal processing easy problem and it did well. It took 30 seconds to reason through a problem that would have taken me between 5 to 10 minutes to reason. Gemini's reasoning was sound (but it took me a couple of minutes to decide that), and it also wrote the functions with the changes (which took me an extra minute to verify). It's not a definitive win in time, but at least there was an extra pair of "eyes"--or whatever that's called with a system like this one.

I wonder if you treat code from a Jr engineer the same way? Seems impossible to scale a team that way. You shouldnt need to verify every line but rather have test harnesses that ensure adherence to the spec.

paradite 3/31/2025||

This is not a good comparison for real world coding tasks.

Based on my own experience and anectodes, it's worse than Claude 3.5 and 3.7 Sonnet for actual coding tasks on existing projects. It is very difficult to control the model behavior.

I will probably make a blog post on real world usage.

phforms 3/31/2025||

I like using LLMs more as coding assistents than have them write the actual code. When I am thinking through problems of code organization, API design, naming things, performance optimization, etc., I found that Claude 3.7 often gives me great suggestions, points me in the right direction and helps me to weigh up pros and cons of different approaches.

Sometimes I have it write functions that are very boilerplate to save time, but I mostly like to use it as a tool to think through problems, among other tools like writing in a notebook or drawing diagrams. I enjoy programming too much that I’d want an AI to do it all for me (it also helps that I don’t do it as a job though).

superkuh 3/31/2025||

What is most apparent to me (putting in existing code and asking for changes) is Gemini 2.5 Pro's tendency to refuse to actually type out subroutines and routinely replace them with either stubs or comments that say, "put the subroutines back here". It makes it so even if Gemini results are good they're still broken and require lots of manual work/thinking to get the subroutines back into the code and hooked up properly.

With a 1 million token context you'd think they'd let the LLM actually use it but all the tricks to save token count just make it... not useful.

Extropy_ 3/31/2025||

Why is Grok not in their benchmarks? I don't see comparisons to Grok in any recent announcements about models. In fact, I see practically no discussion of Grok on HN or anywhere except Twitter in general.

nathanasmith 3/31/2025|

Is there an API for Grok yet? If not that could be the issue.

mvkel 4/1/2025||

I really wish people would stop evaluating a model's coding capability with one-shots.

The vast majority of coding energy is what comes next.

Even today, sonnet-3.5 is still the best "existing code base" model. Which is gratifying (to Anthropic) and/or alarming to everyone else

evantbyrne 3/31/2025|

The common issue I run into with all LLMs is that they don't seem to be able to complete the same coding tasks where googling around also fails to provide working solutions. In particular, they seem to struggle with libraries/APIs that are less mainstream.

More comments...