Top
Best
New

Posted by mraniki 3/31/2025

Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison(composio.dev)
483 points | 328 commentspage 2
sfjailbird 3/31/2025|
Every test task, including the coding test, is a greenfield project. Everything I would consider using LLMs for is not. Like, I would always need it to do some change or fix on a (large) existing project. Hell, even the examples that were generated would likely need subsequent alterations (ten times more effort goes into maintaining a line of code than writing it).

So these tests are meaningless to me, as a measure of how useful these models are. Great for comparison with each other, but would be interesting to include some tests with more realistic work.

maxnevermind 4/1/2025|
Indeed, I surprised to see that is has been in top-10 on HN for today. I thought everyone already realized that all of those examples like "create a flappy bird game" are not realistic and do not reflect the actual usefulness of the model, very few professionals in the industry endlessly create flappy bird games for a living.
anonzzzies 3/31/2025||
For Gemini: play around with the temperature: the default is terrible: we had much better results with (much) lower values.
CjHuber 3/31/2025||
From my experience a temperature close to 0 creates the best code (meaning functioning without modifications). When vibe coding I now use a very high temperature for brainstorming and writing specifications, and then have the code written at a very low one.
SubiculumCode 3/31/2025||
What improved, specifically?
anonzzzies 3/31/2025||
Much better code.
MrScruff 3/31/2025||
The evidence given really doesn't justify the conclusion. Maybe it suggests 2.5 Pro might be better if you're asking it to build Javascript apps from scratch, but that hardly equates to "It's better at coding". Feels like a lot of LLM articles follow this pattern, someone running their own toy benchmarks and confidently extrapolating broad conclusions from a handful of data points. The SWE-Bench result carries a bit more weight but even that should be taken with a pinch of salt.
throwaway0123_5 3/31/2025||
> The SWE-Bench result carries a bit more weight

Although I have issues with it (few benchmarks are perfect), I tend to agree. Gemini's 63.8 from Sonnet's 62.3 isn't a huge jump though. To Gemini's credit, it solved a bug in my PyTorch code yesterday that o1 (through the web app) couldn't (or at least didn't with my prompts).

namaria 3/31/2025||
There are three things this hype cycle excels at. Getting money from investors for foundational model creators and startup.ai; spinning lay offs as a good sign for big corps; and trying to look like clever tech blogger for people looking for clout online.
amazingamazing 3/31/2025||
In before people post contradictory anecdotes.

It would be more helpful if people posted the prompt, and the entire context, or better yet the conversation, so we can all judge for ourselves.

pcwelder 3/31/2025||
Gemini 2.5 pro hasn't been as good as Sonnet for me.

The prompt I have tried repeatedly is creating a react-vite-todo app.

It doesn't figure out tailwind related issues. Real chats:

Gemini: https://github.com/rusiaaman/chat.md/blob/main/samples/vite-...

Sonnet 3.7: https://github.com/rusiaaman/chat.md/blob/main/samples/vite-...

Exact same settings, using MCP server for tool calling, using OpenAI api interface.

PS: the formatting is off, but '#%%' starts a new block, view it in raw.

amazingamazing 3/31/2025||
your links don't work
pcwelder 3/31/2025||
The repo was private, updated. Thanks!!
genewitch 3/31/2025|||
you have to dump a csv from the microsoft website. i linked the relevant parts below. I spent ~8 hours with copilot making a react "app" to someone else's spec, and most of it was moving things around and editing CSS back and forth because copilot has an idea of how things ought be, that didn't comport with what I was seeing on my screen.

However the MVP went live and everyone was happy. Code is on my github, "EMD" - conversation isn't. https://github.com/genewitch/emd

i'd link the site but i think it's still in "dev" mode and i don't really feel like restoring from a snapshot today.

note: i don't know javascript. At all. It looks like boilerplate and line noise to me. I know enough about programming to be able to fix things like "the icons were moving the wrong way", but i had to napkin it out (twice!) and then consult with someone else to make sure that i understood the "math", but i implemented the math correctly and copilot did not. Probably because i prompted it in a way that made its decision make more sense. see lines 2163-2185 in the link below for how i "prompt" in general.

note 2: http://projectftm.com/#I7bSTOGXsuW_5WZ8ZoLSPw is the conversation, as best i can tell. It's in reverse chronological order (#2944 - 2025-12-14 was the actual first message about this project, the last on 2025-12-15)

note 3: if you do visit the live site, and there's an error, red on black, just hit escape. I imagine the entire system has been tampered with by this point, since it is a public server running port 443 wide open.

Workaccount2 3/31/2025|||
This is also compounded by the fact that LLMs are not deterministic, every response is different for the same given prompt. And people tend to judge on one off experiences.
otabdeveloper4 3/31/2025||
> LLMs are not deterministic

They can be. The cloud-hosted LLMs add a gratuitous randomization step to make the output seem more human. (In vein with the moronic idea of selling LLM's as sci-fi human-like assistants.)

But you don't have to add those randomizations. Nothing much is lost if you don't. (Output from my self-hosted LLM's is deterministic.)

CharlesW 3/31/2025||
Even at temperature = 0, LLM output is not guaranteed to be deterministic. https://www.vincentschmalbach.com/does-temperature-0-guarant...
deeth_starr_v 3/31/2025||
This is the issue with these kind of discussions on HN. “It worked great for me” or “it sucked for me” without enough context. You just need to try it yourself to see if it’ll work for your use case.
HarHarVeryFunny 3/31/2025||
I'd like to see an honest attempt by someone to use one of these SOTA models to code an entire non-trivial app. Not a "vibe coding" flappy bird clone or minimal ioS app (call API to count calories in photo), but something real - say 10K LOC type of complexity, using best practices to give the AI all the context and guidance necessary. I'm not expecting the AI to replace the programmer - just to be a useful productivity tool when we move past demos and function writing to tackling real world projects.

It seems to me that where we are today, AI is only useful for coding for very localized tasks, and even there mostly where it's something commonplace and where the user knows enough to guide the AI when it's failing. I'm not at all convinced it's going to get much better until we have models that can actually learn (vs pre-trained) and are motivated to do so.

redox99 3/31/2025||
I use cursor agent mode with claude on my NextJS frontend and Typescript GraphQL backend. It's a real, reasonably sized, production app that's a few years old (pre-ChatGPT).

I vibe code the vast majority features nowadays. I generally don't need to write a single line of code. It often makes some mistakes but the agent figures out that the tests fail, or it doesn't build, fixes it, and basically "one shots" it after it doing its thing.

Only occasionally I need to write a few lines of code or give it a hint when it gets stuck. But 99% of the code is written by cursor.

orange_puff 3/31/2025|||
When you say "vibe code" do you mean the true definition of that term, which is to blindly accept any code generated by the AI, see if it works (maybe agent mode does this) and move on to the next feature? Or do you mean prompt driven development, where although you are basically writing none of the code, you are still reading every line and maintain high involvement in the code base?
redox99 3/31/2025||
Kind of in between. I accept a lot of code without ever seeing it, but I check the critical stuff that could cause trouble. Or stuff that I know the AI is likely to mess up.

Specifically for the front end I mostly vibe code, and for the backend I review a lot of the code.

I will often follow up with prompts asking it to extract something to a function, or to not hardcode something.

HarHarVeryFunny 3/31/2025|||
That's pretty impressive - a genuine real-world use case where the AI is doing the vast majority of the work.
kaiokendev 3/31/2025|||
I made this NES emulator with Claude last week [0]. I'd say it was a pretty non-trivial task. It involved throwing a lot of NESDev docs, Disch mapper docs, and test rom output + assembly source code to the model to figure out.

[0]: https://kaiokendev.github.io/nes/

nowittyusername 3/31/2025|||
I am considering training a custom Lora on atari roms and see if i could get a working game out of it with the Loras use. The thinking here is that atari, nes, snes, etc... roms are a lot smaller in size then a program that runs natively on whatever os. Lees lines of code to write for the LLM means less chance of a screw up. take the rom, convert it to assembly, perform very detailed captions on the rom and train.... if this works this would enable anyone to create games with one prompt which are a lot higher quality then the stuff being made now and with less complexity. If you made an emulator with the use of an llm, that means it understands assembly well enough so i think there might be hope for this idea.
kaiokendev 4/1/2025||
Well the assembly I put into it was written by humans writing assembly intended to be well-understood by anyone reading it. On the contrary, many NES games abuse quirks specific to the NES that you can't translate to any system outside of the NES. Understanding what that assembly code is doing also requires a complete understanding of those quirks, which LLMs don't seem to have yet (My Mapper 4 implementation still has some bugs because my IRQ handling isn't perfect, and many games rely on precise IRQ timing).
HarHarVeryFunny 3/31/2025|||
How would you characterize the overall structural complexity of the project, and degree of novelty compared to other NES emulators Claude may have seen during training ?

I'd be a bit suspect of an LLM getting an emulator right, when all it has to go on is docs and no ability to test (since pass criteria is "behaves same as something you don't have access to")... Did you check to see the degree to which it may have been copying other NES emulators ?

kaiokendev 3/31/2025||
> How would you characterize the overall structural complexity of the project, and degree of novelty compared to other NES emulators Claude may have seen during training ?

Highly complex, fairly novel.

Emulators themselves, for any chipset or system, have a very learnable structure: there are some modules, each having their own registers and ways of moving data between those registers, and perhaps ways to send interrupts between those modules. That's oversimplifying a bit, but if you've built an emulator once, you generally won't be blindsided when it comes to building another one. The bulk of the work lies in dissecting the hardware, which has already been done for the NES, and more open architectures typically have their entire pinouts and processes available online. All that to say - I don't think Claude would have difficulty implementing most emulators - it's good enough at programming and parsing assembly that as long as the underlying microprocessor architecture is known, it can implement it.

As far as other NES emulators goes, this project does many things in non-standard ways, for instance I use per-pixel rendering whereas many emulators use scanline rendering. I use an AudioWorklet with various mixing effects for audio, whereas other emulators use something much simpler or don't even bother fully implementing the APU. I can comfortably say there's no NES emulator out there written the way this one is written.

> I'd be a bit suspect of an LLM getting an emulator right, when all it has to go on is docs and no ability to test (since pass criteria is "behaves same as something you don't have access to")... Did you check to see the degree to which it may have been copying other NES emulators ?

Purely javascript-based NES emulators are few in number, and those that implement all aspects of the system even fewer, so I can comfortably say it doesn't copy any of the ones I've seen. I would be surprised if it did, since I came up with most of the abstractions myself and guided Claude heavily. While Claude can't get docs on it's own, I can. I put all the relevant documentation in the context window myself, along with the test rom output and source code. I'm still commanding the LLM myself, it's not like I told Claude to build an emulator and left it alone for 3 days.

HarHarVeryFunny 3/31/2025||
Interesting - thanks!

Even with your own expert guidance, it does seem impressive that Claude was able complete a project like this without getting bogged down in the complexity.

axkdev 3/31/2025|||
I dunno what you would consider non trivial. I am building a diffing plugin for neovim. The experience is.. mixed. The fast progression at the start was impressive, but now as the code base have grown the issues show up. The code is a mess. Adding one feature breaks another and so on. I have no problem in using the agent on code that I know very well, because I can stir it in the exact direction I want. But vibe coding something I don't fully understand is a pain.
Pannoniae 3/31/2025|||
I've been using Claude 3.7 for various things, including helping in game development tasks. The generated code usually requires editing and it can't do autonomously more than a few functions at once but it's a fairly useful tool in terms of productivity. And the logic part is also quite good, can design out various ideas/algorithms, and suggest some optimisations.

Tech stack is nothing fancy/rare but not the usual ReactJS slop either - it's C# with OpenGL.

I can't comment about the best practices though because my codebase follows none of them.

Yes, the user has to know enough to guide the AI when it's failing. So it can't exactly replace the programmer as it is now.

It really can't do niche stuff however - like SIMD. Maybe it would be better if I compiled a cheatsheet of .NET SIMD snippets and howtos because this stuff isn't really on the internet in a coherent form at all. So it's highly unlikely that it was trained on that.

HarHarVeryFunny 3/31/2025||
Interesting - thanks! This isn't the type of tech stack where I'd have expected it to do very well, so the fact that you're at least finding it to be productive is encouraging, although the (only) "function level competency" is similar to what I've experienced - enough to not have been encouraged to try anything more complex.
gedy 3/31/2025|||
I know they are capable of more, but I also tire of people being so enamored with "bootstrap a brand new app" type AI coding - like is that even a big part of our job? In 25 years of dev work, I've needed to do that for commercial production app like... twice? 3 times? Help me deal with existing apps and codebases please.
lordswork 3/31/2025||
I'm at 3k LOC on a current Rust project I'm mostly vibe coding with my very limited free time. Will share when I hit 10k :)
HarHarVeryFunny 3/31/2025||
Would you mind sharing what the project is, and which AI you are using? No sign so far of AI's usefulness slowing down as the complexity increases?
lordswork 4/1/2025|||
>Would you mind sharing what the project is

rust + wasm simulation of organisms in an ecosystem, with evolving neural networks and genes. super fun to build and watch.

>which AI you are using?

using chatgpt/claude/gemini with a custom tool i built similar to aider / claude code, except it's very interactive, like chatting with the AI as it suggests changes that I approve/decline.

>No sign so far of AI's usefulness slowing down as the complexity increases?

The AI is not perfect, there are some cases where it is unable so solve a challenging issue and i must help it solve the issue. this usually happens for big sweeping changes that touch all over the codebase. It introduces bugs, but it can also debug them easily, especially with the increased compile-time checking in rust. runtime bugs are harder, because i have to tell the ai the behavior i observe. iterating on UI design is clumsy and it's often faster for me to just iterate by making changes myself instead.

HarHarVeryFunny 4/1/2025||
Thanks - sounds like a fun project!

Given that you've built your own coding tool, I assume this is as much about testing what AI can do as it is about the project itself? Is it a clear win as far as productivity goes?

lordswork 4/2/2025||
I'm most interested in building cool projects, and I have found AI to be a major multiplier to that effort. One of those cool projects was a custom coding tool, which I now use with all my projects, and continue to polish as I use it.

As far as productivity, it's hard for me to quantify, but most of these projects would not be feasible for me to pursue with my limited free time without the force multiplier of AI.

genewitch 3/31/2025|||
No one links their ai code, you noticed?
SweetSoftPillow 3/31/2025||
Aider is written with AI, you're welcome.
raffkede 3/31/2025||
I had huge success letting Gemini 2.5 oneshot whole codebases in a single text file format and then split it up with a script. It's putting in work for like 5 minutes and spits out a working codebase, I also asked it to show of a little bit and it almost one shotted a java cloud service to generate pdf invoices from API calls, (made some minor mistakes but after feeding them back it fixed them)

I basically use two scripts one to flatten the whole codebase into one text file and one to split it, give it a shot it's amazing...

mvdtnz 3/31/2025||
Anything that can fit in a single LLM output is not a "codebase" it's just a start. Far too many people with no experience in real software projects think their little 1800 line apps are representative of real software development.
archeantus 3/31/2025||
Can you please expound on this? You’re using this approach to turn an existing codebase into a single file and then asking Gemini to make changes/enhancements? Does it also handle breaking the files back out? Would love more info!
ZeroTalent 3/31/2025|||
There is a better way that I'm using:

1. Cursor Pro with Sonnet to implement things the Cursor way.

2. Install the Gemini Code extension in Cursor.

3. Install the Gemini Coder Connector Chrome extension: https://chromewebstore.google.com/detail/gemini-coder-connec...

4. Get the free aistudio.google.com Gemini API and connect the extensions.

5. Feed your codebase or select files via the Cursor extension and get the implementation from aistudio.google.com.

I prefer having Sonnet implement it via Cursor rather than Gemini because it can automatically go through all the linting/testing loops without my extra input, run the server, and check if there are no errors.

raffkede 3/31/2025|||
I created a script that merges all files in a directory into this format, and a counterpart that splits it again. Below is just a small sample I asked it to create to show the format, but I did it with almost 80 files including lots of documentation etc.

When providing the flat format it was able to replicate it without much instructions for a blank prompt i had success with the prompt below

===FILE=== Index: 1 Path: src/main/java/com/example/myapp/Greeter.java Length: 151 Content: package com.example.myapp;

public class Greeter { public String getGreeting() { return "Hello from the Greeter class!"; } } ===ENDFILE=== ===FILE=== Index: 2 Path: src/main/java/com/example/myapp/Main.java Length: 222 Content: package com.example.myapp;

public class Main { public static void main(String[] args) { Greeter greeter = new Greeter(); String message = greeter.getGreeting(); System.out.println("Main app says: " + message); } } ===ENDFILE=== ===FILE=== Index: 3 Path: pom.xml Length: 659 Content: <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>my-simple-app</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>
</project> ===ENDFILE===

Prompt to request the format if starting from scratch: Present the entire codebase using the following multi-file format:

The codebase should be presented as a single, monolithic text output. Inside this output, represent each file of the project individually using the following structure:

Start Marker: Each file must begin with the exact line: ===FILE===

Metadata Block: Immediately following the start marker, include these four specific metadata lines, each on its own line:

Index: <N> (where <N> is a sequential integer index for the file, starting from 1).

Path: <path/to/file/filename.ext> (The full relative path of the file from the project's root directory, e.g., index.html, css/style.css, js/script.js, jobs.html, etc.).

Length: <L> (where <L> is the exact character count of the file's content that follows).

Content: (This literal line acts as a separator).

File Content: Immediately after the Content: line, include the entire raw content of the file. Preserve all original line breaks, indentation, and formatting exactly as it should appear in the actual file.

End Marker: Each file's section must end with the exact line: ===ENDFILE===

Ensure all necessary files for the project (HTML, CSS, JS) are included sequentially within the single output block according to this structure.

Crucially, enclose the entire multi-file output, starting from the very first ===FILE=== line down to the very last ===ENDFILE=== line, within a single Markdown fenced code block using exactly five backticks (`````) on the lines immediately before the first ===FILE=== and immediately after the last `===ENDFILE===`. This ensures that any triple backticks (```) within the generated file content are displayed correctly.

iammrpayments 3/31/2025||
Theo video detected = opinion rejected

Also I generally dislike thinking models for coding and prefer faster models, so if you have something easy gemini 2.0 is good

bn-l 3/31/2025||
Absolute golden age YouTube brain rot. I had to disable the youtube sidebar with a custom style because just seeing these thumbnails and knowing some stupid schmuck is clicking on them like an ape when they do touchscreen experiments really lowers my mood.
Workaccount2 3/31/2025||
If you find youtubers talking about it, they all fully agree that making these thumbnails is soul draining and they are totally aware how stupid they are. But they are also aware that click-through rates fall off a cliff when you don't use them. Humans are mostly dumb, it's up to you if you want to use it to your advantage or to your detriment.
bn-l 3/31/2025||
> Humans are mostly dumb, it's up to you if you want to use it to your advantage or to your detriment.

Is that true? I like to think it’s mostly kids. Honestly the world is a dark place if it’s adults doing the clicking.

SweetSoftPillow 3/31/2025||
You definitely underestimate kids and overestimate adults.
Kiro 3/31/2025|||
What's wrong with Theo?
hu3 3/31/2025|||
People say his technical opinions can/are bought for the right price or clicks.
greenchair 3/31/2025||||
vercel shill
iammrpayments 3/31/2025|||
Not only actively promotes React which is forgivable, but also every framework or unnecessary piece of npm software that pays him enough.

His videos also have 0 substance and now are mostly article reading, which is also forgivable if you add valuable input but that’s never the case with him.

bilekas 3/31/2025|||
Theo has some strange takes for my liking but to flat out reject the opinion isn't the way to go. Thinking models are okay for larger codebases though where some more context is important, this ensures the results are a bit more relevant than say for example Copilot which seems to be really quick at generating some well known algorythms etc.

They're just different tools for different jobs really.

arccy 3/31/2025||
rejecting an opinion doesn't mean you have to hold the opposite stance, just that their opinion should hold 0 weight.
mvdtnz 3/31/2025||
What's Theo?
Sol- 3/31/2025||
Maybe I don't feel the AI FOMO strongly enough and obviously these performance comparisons can be interesting in their own right to keep track of AI progress, but ultimately it feels as long as you have a pro subscription of one of the leading providers (OpenAI, Anthropic or Google), you're fine.

Sure, your provider of choice might fall behind for a few months, but they'll just release a new version eventually and might come out on top again. Intelligence seems commodified enough already that I don't care as much whether I have the best or second best.

jascha_eng 3/31/2025||
This is an incredibly bad test for real world use. everything the author tested was a clean slate project any LLM is going to excel on those.
veselin 3/31/2025|
I noticed a similar trends in selling on X. Put a claim, peg on some product A with good sales - Cursor, Claude, Gemini, etc. Then say, the best way to use A is with our best product, guide, being MCP or something else.

For some of these I see something like 15k followers on X, but then no LinkedIn page for example. Website is always a company you cannot contact and they do everything.

jpadkins 3/31/2025|
no linkedIn page is a green flag for me.
More comments...