Posted by mraniki 4 days ago
So these tests are meaningless to me, as a measure of how useful these models are. Great for comparison with each other, but would be interesting to include some tests with more realistic work.
Although I have issues with it (few benchmarks are perfect), I tend to agree. Gemini's 63.8 from Sonnet's 62.3 isn't a huge jump though. To Gemini's credit, it solved a bug in my PyTorch code yesterday that o1 (through the web app) couldn't (or at least didn't with my prompts).
It would be more helpful if people posted the prompt, and the entire context, or better yet the conversation, so we can all judge for ourselves.
The prompt I have tried repeatedly is creating a react-vite-todo app.
It doesn't figure out tailwind related issues. Real chats:
Gemini: https://github.com/rusiaaman/chat.md/blob/main/samples/vite-...
Sonnet 3.7: https://github.com/rusiaaman/chat.md/blob/main/samples/vite-...
Exact same settings, using MCP server for tool calling, using OpenAI api interface.
PS: the formatting is off, but '#%%' starts a new block, view it in raw.
However the MVP went live and everyone was happy. Code is on my github, "EMD" - conversation isn't. https://github.com/genewitch/emd
i'd link the site but i think it's still in "dev" mode and i don't really feel like restoring from a snapshot today.
note: i don't know javascript. At all. It looks like boilerplate and line noise to me. I know enough about programming to be able to fix things like "the icons were moving the wrong way", but i had to napkin it out (twice!) and then consult with someone else to make sure that i understood the "math", but i implemented the math correctly and copilot did not. Probably because i prompted it in a way that made its decision make more sense. see lines 2163-2185 in the link below for how i "prompt" in general.
note 2: http://projectftm.com/#I7bSTOGXsuW_5WZ8ZoLSPw is the conversation, as best i can tell. It's in reverse chronological order (#2944 - 2025-12-14 was the actual first message about this project, the last on 2025-12-15)
note 3: if you do visit the live site, and there's an error, red on black, just hit escape. I imagine the entire system has been tampered with by this point, since it is a public server running port 443 wide open.
They can be. The cloud-hosted LLMs add a gratuitous randomization step to make the output seem more human. (In vein with the moronic idea of selling LLM's as sci-fi human-like assistants.)
But you don't have to add those randomizations. Nothing much is lost if you don't. (Output from my self-hosted LLM's is deterministic.)
It seems to me that where we are today, AI is only useful for coding for very localized tasks, and even there mostly where it's something commonplace and where the user knows enough to guide the AI when it's failing. I'm not at all convinced it's going to get much better until we have models that can actually learn (vs pre-trained) and are motivated to do so.
I vibe code the vast majority features nowadays. I generally don't need to write a single line of code. It often makes some mistakes but the agent figures out that the tests fail, or it doesn't build, fixes it, and basically "one shots" it after it doing its thing.
Only occasionally I need to write a few lines of code or give it a hint when it gets stuck. But 99% of the code is written by cursor.
Specifically for the front end I mostly vibe code, and for the backend I review a lot of the code.
I will often follow up with prompts asking it to extract something to a function, or to not hardcode something.
I'd be a bit suspect of an LLM getting an emulator right, when all it has to go on is docs and no ability to test (since pass criteria is "behaves same as something you don't have access to")... Did you check to see the degree to which it may have been copying other NES emulators ?
Highly complex, fairly novel.
Emulators themselves, for any chipset or system, have a very learnable structure: there are some modules, each having their own registers and ways of moving data between those registers, and perhaps ways to send interrupts between those modules. That's oversimplifying a bit, but if you've built an emulator once, you generally won't be blindsided when it comes to building another one. The bulk of the work lies in dissecting the hardware, which has already been done for the NES, and more open architectures typically have their entire pinouts and processes available online. All that to say - I don't think Claude would have difficulty implementing most emulators - it's good enough at programming and parsing assembly that as long as the underlying microprocessor architecture is known, it can implement it.
As far as other NES emulators goes, this project does many things in non-standard ways, for instance I use per-pixel rendering whereas many emulators use scanline rendering. I use an AudioWorklet with various mixing effects for audio, whereas other emulators use something much simpler or don't even bother fully implementing the APU. I can comfortably say there's no NES emulator out there written the way this one is written.
> I'd be a bit suspect of an LLM getting an emulator right, when all it has to go on is docs and no ability to test (since pass criteria is "behaves same as something you don't have access to")... Did you check to see the degree to which it may have been copying other NES emulators ?
Purely javascript-based NES emulators are few in number, and those that implement all aspects of the system even fewer, so I can comfortably say it doesn't copy any of the ones I've seen. I would be surprised if it did, since I came up with most of the abstractions myself and guided Claude heavily. While Claude can't get docs on it's own, I can. I put all the relevant documentation in the context window myself, along with the test rom output and source code. I'm still commanding the LLM myself, it's not like I told Claude to build an emulator and left it alone for 3 days.
Even with your own expert guidance, it does seem impressive that Claude was able complete a project like this without getting bogged down in the complexity.
Tech stack is nothing fancy/rare but not the usual ReactJS slop either - it's C# with OpenGL.
I can't comment about the best practices though because my codebase follows none of them.
Yes, the user has to know enough to guide the AI when it's failing. So it can't exactly replace the programmer as it is now.
It really can't do niche stuff however - like SIMD. Maybe it would be better if I compiled a cheatsheet of .NET SIMD snippets and howtos because this stuff isn't really on the internet in a coherent form at all. So it's highly unlikely that it was trained on that.
rust + wasm simulation of organisms in an ecosystem, with evolving neural networks and genes. super fun to build and watch.
>which AI you are using?
using chatgpt/claude/gemini with a custom tool i built similar to aider / claude code, except it's very interactive, like chatting with the AI as it suggests changes that I approve/decline.
>No sign so far of AI's usefulness slowing down as the complexity increases?
The AI is not perfect, there are some cases where it is unable so solve a challenging issue and i must help it solve the issue. this usually happens for big sweeping changes that touch all over the codebase. It introduces bugs, but it can also debug them easily, especially with the increased compile-time checking in rust. runtime bugs are harder, because i have to tell the ai the behavior i observe. iterating on UI design is clumsy and it's often faster for me to just iterate by making changes myself instead.
Given that you've built your own coding tool, I assume this is as much about testing what AI can do as it is about the project itself? Is it a clear win as far as productivity goes?
As far as productivity, it's hard for me to quantify, but most of these projects would not be feasible for me to pursue with my limited free time without the force multiplier of AI.
I basically use two scripts one to flatten the whole codebase into one text file and one to split it, give it a shot it's amazing...
1. Cursor Pro with Sonnet to implement things the Cursor way.
2. Install the Gemini Code extension in Cursor.
3. Install the Gemini Coder Connector Chrome extension: https://chromewebstore.google.com/detail/gemini-coder-connec...
4. Get the free aistudio.google.com Gemini API and connect the extensions.
5. Feed your codebase or select files via the Cursor extension and get the implementation from aistudio.google.com.
I prefer having Sonnet implement it via Cursor rather than Gemini because it can automatically go through all the linting/testing loops without my extra input, run the server, and check if there are no errors.
When providing the flat format it was able to replicate it without much instructions for a blank prompt i had success with the prompt below
===FILE=== Index: 1 Path: src/main/java/com/example/myapp/Greeter.java Length: 151 Content: package com.example.myapp;
public class Greeter { public String getGreeting() { return "Hello from the Greeter class!"; } } ===ENDFILE=== ===FILE=== Index: 2 Path: src/main/java/com/example/myapp/Main.java Length: 222 Content: package com.example.myapp;
public class Main { public static void main(String[] args) { Greeter greeter = new Greeter(); String message = greeter.getGreeting(); System.out.println("Main app says: " + message); } } ===ENDFILE=== ===FILE=== Index: 3 Path: pom.xml Length: 659 Content: <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>my-simple-app</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
</project>
===ENDFILE===Prompt to request the format if starting from scratch: Present the entire codebase using the following multi-file format:
The codebase should be presented as a single, monolithic text output. Inside this output, represent each file of the project individually using the following structure:
Start Marker: Each file must begin with the exact line: ===FILE===
Metadata Block: Immediately following the start marker, include these four specific metadata lines, each on its own line:
Index: <N> (where <N> is a sequential integer index for the file, starting from 1).
Path: <path/to/file/filename.ext> (The full relative path of the file from the project's root directory, e.g., index.html, css/style.css, js/script.js, jobs.html, etc.).
Length: <L> (where <L> is the exact character count of the file's content that follows).
Content: (This literal line acts as a separator).
File Content: Immediately after the Content: line, include the entire raw content of the file. Preserve all original line breaks, indentation, and formatting exactly as it should appear in the actual file.
End Marker: Each file's section must end with the exact line: ===ENDFILE===
Ensure all necessary files for the project (HTML, CSS, JS) are included sequentially within the single output block according to this structure.
Crucially, enclose the entire multi-file output, starting from the very first ===FILE=== line down to the very last ===ENDFILE=== line, within a single Markdown fenced code block using exactly five backticks (`````) on the lines immediately before the first ===FILE=== and immediately after the last `===ENDFILE===`. This ensures that any triple backticks (```) within the generated file content are displayed correctly.
Also I generally dislike thinking models for coding and prefer faster models, so if you have something easy gemini 2.0 is good
Is that true? I like to think it’s mostly kids. Honestly the world is a dark place if it’s adults doing the clicking.
His videos also have 0 substance and now are mostly article reading, which is also forgivable if you add valuable input but that’s never the case with him.
They're just different tools for different jobs really.
Sure, your provider of choice might fall behind for a few months, but they'll just release a new version eventually and might come out on top again. Intelligence seems commodified enough already that I don't care as much whether I have the best or second best.
For some of these I see something like 15k followers on X, but then no LinkedIn page for example. Website is always a company you cannot contact and they do everything.