Posted by mfiguiere 3 days ago
Claude Code did great and wrote pretty decent docs.
Codex didn't do well. It hallucinated a bunch of stuff that wasn't in the code, and completely misrepresented the architecture - it started talking about server backends and REST APIs in an app that doesn't have any of that.
I'm curious what went so wrong - feels like possibly an issue with loading in the right context and attending to it correctly? That seems like an area that Claude Code has really optimized for.
I have high hopes for o3 and o4-mini as models so I hope that other tests show better results! Also curious to see how Cursor etc. incorporate o3.
I feel like people are sleeping on Claude Code for one reason or another. Its not cheap, but its by far the best, most consistent experience I have had.
These days I’m using Amazon Q Pro on the CLI. Very similar experience to Claude Code minus a few batteries. But it’s capped at $20/mo and won’t set my credit card on fire.
Seems 4x costlier than my Aider+Openrouter. Since I'm less about vibes or huge refactoring, my (first and only) bill is <5 usd with Gemini. These models will halve that.
It's very much a "Claude Code" in the sense that you have a "q chat" command line command that can do everything from changing files, running shell commands, reading and researching, etc. So I can say "q chat" and then tell it "read this repo and create a README" or whatever else Claude Code can do. It does everything by itself in an agentic way. (I didn't want to say like 'Aider' because the entire appeal of Claude Code is that it does everything itself, like figuring out what files to read/change)
(It's calling itself Q but from my testing it's pretty clear that it's a variant of Claude hosted through AWS which makes sense considering how much money Amazon pumped into Anthropic)
how is this appealing? I think I must be getting old because the idea of letting a language model run wild and run commands on my system -- that's unsanitized input! --horrifies me! What do you mean just let it change random files??
I'm going to have to learn a new trade, IDK
It only has access to files within the directory it’s run from, even if it calls tools that could theoretically access files anywhere on your system. Also had networking blocked, also in a sandboxes fashion so that things like curl don’t work either.
I wasn’t particularly impressed with my short test of Codex yesterday. Just the fact that it managed to make any decent changes at all was good, but when it messed up the code it took a long time and a lot of tokens to figure out.
I think we need fine tuned models that are good at different tasks. A specific fine tune for fixing syntax errors in Java would be a good start.
In general it also needs to be more proactive in writing and running tests.
4k loc per month seems terribly low? Any request I make could easily go over that. I feel like I'm completely misunderstanding (their fault though) what they actually meant.
Edit: No I don't think I'm misunderstanding, if you want to go over this they direct you to a pay-per-request plan and you are not capped at $20 anymore
I've been running this almost daily for the past months without any issues or extra cost. Still just paying $20
When I try gemmini 2.5 pro exp with cline it does very well but often fails to use the tools provided by cline which makes it way less expensive while failing random basic tasks sonnet does in its sleep. I pay the extra to save the time.
Do not get me wrong. Maybe I am totally outdated with my opinion. It is hard to keep up these days.
It’s too expensive for what it does though. And it starts failing rapidly when it exhausts the context window.
Be aware of the "cache".
Tell it to read specific files, never use /compact (that'll bust cache, if you need to, you're going back and forth too much or using too many files at once).
Never edit files manually during a session (that'll bust cache). THIS INCLUDES LINT.
Have a clear goal in mind and keep sessions to as few messages as possible.
Write / generate markdown files with needed documentation using claude.ai, and save those as files in the repo and tell it to read that file as part of a question.
I'm at about ~$0.5-0.75 for most "tasks" I give it. I'm not a super heavy user, but it definitely helps me (it's like having a super focused smart intern that makes dumb mistakes).
If i need to feed it a ton of docs etc. for some task, it'll be more in the few $, rather than < $1. But I really only do this to try some prototype with a library claude doesn't know about (or is outdated).
For hobby stuff, it adds up - totally.
For a company, massively worth it. Insanely cheap productivity boost (if developers are responsible / don't get lazy / don't misuse it).
Sure, it might cost a few dollars here and there. But what I've personally been getting from it, for that cost, is so far away from "expensive" it's laughable.
Not only does it do things I don't want to do, in a _super_ efficient manner. It does things I don't know how to do - contextually, within my own project, such that when it's done I _do_ know how to do it.
Like others have said - if you're exhausting the context window, the problem is you, not the tool.
Example, I have a project where I've been particularly lazy and there's a handful of models that are _huge_. I know better than to have Claude read those models into context - that would be stupid. Rather - I tell it specifically what I want to do within those models, give it specific method names and tell it not to read the whole file, rather search for and read the area around the method definition.
If you _do_ need it to work with very large files - they probably shouldn't be that large and you're likely better off refactoring those files (with Claude, of course) to abstract out where you can and reduce the line count. Or, if anything, literally just temporarily remove a bunch of code from the huge files that isn't relevant to the task so that when it reads it it doesn't have to pull all of that into context. (ie: Copy/paste the file into a backup location, delete a bunch of unrelated stuff in the working file, do your work with claude then 'merge' the changes to the backup file and copy it back)
If a few dollars here and there for getting tasks done is "too expensive" you're using it wrong. The amount of time I'm saving for those dollars is worth many times the cost and the number of times that I've gotten unsatisfactory results from that spending has been less than 5.
I see the same replies to these same complaints everywhere - people complaining about how it's too expensive or becomes useless with a full context. Those replies all state the same thing - if you're filling the context, you've already screwed it up. (And also, that's why it's so expensive)
I'll agree with sibling commenters - have claude build documentation within the project as you go. Try to keep tasks silo'd - get in, get the thing done, document it and get out. Start a new task. (This is dependent on context - if you have to load up the context to get the task done, you're incentivized to keep going rather than dump and reload with a new task/session, thus paying the context tax again - but you also are going to get less great results... so, lesson here... minimize context.)
100% of the time that I've gotten bad results/gone in circles/gotten hallucinations was when I loaded up the context or got lazy and didn't want to start new sessions after finishing a task and just kept moving into new tasks. If I even _see_ that little indicator on the bottom right about how much context is available before auto-compact I know I'm getting less-good functionality and I need to be careful about what I even trust it's saying.
It's not going to build your entire app in a single session/context window. Cut down your tasks into smaller pieces, be concise.
It's a skill problem. Not the tool.
It's almost always the users fault when it comes to tools. If you're using it and it's not doing its 'job' well - it's more likely that you're using it wrong than it is that it's a bad tool. Almost universally.
Right tool for the job, etc etc. Also important that you're using it right, for the right job.
Claude Code isn't meant to refactor entire projects. If you're trying to load up 100k token "whole projects" into it - you're using it wrong. Just a fact. That's not what this tool is designed to do. Sure.. maybe it "works" or gets close enough to make people think that is what it's designed for, but it's not.
Detailed, specific work... it excels, so wildly, that it's astonishing to me that these takes exist.
In saying all of that, there _are_ times I dump huge amounts of context into it (Claude, projects, not Claude Code - cause that's not what it's designed for) and I don't have "conversations" with it in that manner. I load it up with a bunch of context, ask my question/give it a task and that first response is all you need. If it doesn't solve your concern, it should shine enough light that you now know how you want to address it in a more granular fashion.
Is it a tool problem or a skill problem when a surgeon doesn't know how to use a robotic surgery assistant/robot?
I'm a paying customer and I know my time is sufficiently valuable that this kind of technology pays for itself.
As an analogy, I liken it to a scribe (author's assistant).
Your comment has lots of useful hints -- thanks for taking the time to write them up!
I mean, it was. Right up until it exhausted the context window. Then it suddenly required hand holding.
If I wanted to do that I might as well use Cursor.
Sometimes I see an area of AI/LLM that I thought even with 10x efficiency improvement and 10x hardware resources which is 100x in aggregate it will still be no where near good enough.
The truth is probably somewhere in the middle. Which is why I dont believe AGI will be here any time soon. But Assisted Intelligence is no doubt in its iPhone moment and continue for another 10 years before hopefully another breakthrough.
recommended read - https://transluce.org/investigating-o3-truthfulness
I wonder if this is what's causing it to do badly in these cases
this is a direct answer to claude code which has been shipping furiously: https://x.com/_catwu/status/1903130881205977320
and is not open source; there are unverified comments that they have DMCA'ed decompilations https://x.com/vikhyatk/status/1899997417736724858?s=46
by total coincidence we're releasing our claude code interview later this week that touches on a lot of these points + why code agent CLIs are an actually underrated point in the SWE design space
(TLDR you can use it like a linux utility - similar to @simonw's `llm` - to sprinkle intelligence in all sorts of things like CI/PR review without the overhead of buying a Devin or a Copilot SaaS)
if you are a Claude Code (and now OAI Codex) power user we want to hear use cases - CFP closing soon, apply here https://sessionize.com/ai-engineer-worlds-fair-2025
I have tried aider/copilot/continue/etc. But they lack in one way or the other.
In aider everything is loaded in memory I can add drop files in terminal, discuss in terminal, switch models, every change is a commit, run terminal commands with ! at the start.
Full codebase is more expensive and slower than relevant files. I understand when you don’t worry about the cost, but at reasonable size pasting full codebase can’t be really a thing.
- an embedded project for esp32 (100k tokens)
- visual inertial odometry algorithm (200k+ tokens)
- a web app (60k tokens)
- the tool itself mentioned above (~30k tokens)
it has worked well enough for me. Other methods have not.
Copilot used to be useless, but over the last few months has become quite excellent once edit mode was added.
Claude Projects, chatgpt projects, Sourcegraph Cody context building, MCP file systems, all of these are black boxes of what I can only describe as lossy compression of context.
Each is incentivized to deliver ~”pretty good” results at the highest token compression possible.
The best way around this I’ve found is to just own the web clients by including structured, concatenation related files directly in chat contexts.
Self plug but super relevant: I built FileKitty specifically to aid this, which made HN front page and I’ve continued to improve:
https://news.ycombinator.com/item?id=40226976
If you can prepare your file system context yourself using any workflow quickly, and pair it with appropriate additional context such as run output, problem description etc, you can get excellent results and you can pound away at OpenAI or Anthropic subscription refining the prompt or updating the file context.
I have been finding myself spending more time putting together prompt complexity for big difficult problems, they would not make sense to solve in the IDE.
Same. I used to run a bash script that concatenates files I'm interested in and annotates their path/name to the top in a comment. I haven't needed that recently as I think the # of attachments for Claude has increased (or I haven't needed as many small disparate files at once)
I have encountered this issue of reincorporation of LLM code recommendations back into a project so I’m interested in exploring your take.
I told a colleague that I thought excellent use of copy paste and markdown were some of the chief skills of working with gen AI for code right now.
This and context management are as important as prompting.
It makes the details of the UI choices for copying web chat conversations or their segments so strangely important.
I agree though that a lot of those agents are black boxes and hard to even learn how to best combine .rules, llms.txt, prd, mcp, web search, function call, memory. Most IDEs don't provide output where you can inspect final prompts etc to see how those are executed - maybe you have to use some MITMproxy to inspect requests etc but some tool would be useful to learn best practices.
I will be trying more roo code and cline since they open source and you can at least see system prompts etc.
You can choose files to include and they don't appear to be truncated in any way. Though to be fair, I haven't checked the network traffic, but it appears to operate in this fashion from day to day use.
The code completion is chefs kiss though.
Just checked to see how it works. It seems that it does all that you are describing. The difference is in the way that it provides the files - it doesn't use xml format.
If you wish you could /add * to add all your files.
Also deducing from this mode it seems that any file that you add to aider chat with /add has its full contents added to the chat context.
But hey I might be wrong. Did a limited test with 3 files in project.
I also understand having build your own tool to fit your own workflow. And being able to easily mold it to what you need.
I’m actually legitimately surprised how good it is, since other coding agents I’ve used before have mostly been a letdown, which made me only use Claude in direct change prompting with Zed (“implement xyz here”, “rewrite this function with abc”, etc), so very hands-on.
So I’ve went into trying out Claude Code rather pessimistically, and now I’m using it all the time! Sure, it ends up costing a bunch, but it’s easy to justify $15 for a prompting session if the end result is a mostly complete PR, done much faster.
All that is to say - competition is good, fingers crossed for codex!
There is fork named Anon Kode https://github.com/dnakov/anon-kode which can use more models and non-Anthropic ones. But the license is unclear for it.
It's interesting to see codex to be Apache License. Maybe somebody extends it to be usable with competing models.
Now whether or not anthropic care enough to enforce their license is separate issue, but it seems unwise to make much of an investment in it.
But it has one downside: It's not so good on unknown big complex code bases where you don't know how it's structured. I wished they (or somebody else) would add an AI or an automation to add files dynamically or in a smart way when you don't know the codebase structure (with the expense of burning more tokens).
I'm thinking Codex (have not checked it yet), Claude Code, Anon Kode and all the AI editors/plugins doing a better job there (and potentially burning more tokens).
But that's the only downside I can think of about aider.
Hope more competition can bring price down.
With Claude Code I can stay in Goland, and have Claude Code in the terminal.
Moreover, there’s no way to bring your own key, with the highest subscription tier being $20 per month flat it seems, which is the cost of just 1-3 sessions with Claude Code. Thus, without evidence to the contrary, I’m not holding my breath for now.
It's also much easier to control execution in a structured and reliable way in the terminal. Here's an automated debugging use case, for example: https://www.youtube.com/watch?v=g-_76U_nK0Y
Given how much time these models can save me, I'd rather optimize for capability and just accept whatever the price is as a cost of doing business. (Within reason I guess—I probably wouldn't go beyond $2-3k per month at this point, unless there was very clear ROI on that spend.)
Also, it's not only about saving time. More powerful AI tools allow me to build things it would otherwise be impossible to build... that's just as important as the time/cost equation.
I mean, you pour money down the drain if you think it's helping, have at it :P
You're not actually getting all the files you add in the context window, you're getting a RAG'd version of it, which is generally much worse if the un-RAG'd code is still within the effective context limit.
1. Claude Code 2. Cursor 3. Cline. 4. Windsurf
I'll stick with Windsurf, especially given their upcoming announcement.
> More powerful AI tools allow me to build things it would otherwise be impossible to build...
You're being paid to type? I want your job.
export CLAUDE_CODE_USE_BEDROCK=1
export ANTHROPIC_MODEL=us.anthropic.claude-3-7-sonnet-20250219-v1:0
export ANTHROPIC_API_TYPE=bedrock
This is all ignoring the controversies that pop up around e.g. Cursor seemingly every week. As an IDE, they're both getting there -- but I have objectively better results in Claude Code.
seriously though, anything that makes me smarter and more productive has a threshold in the thousands-of-dollars range, not hundreds
Or they are burning VC money.
1. Default model used doesn't work and you get error: system OpenAI rejected the request (request ID: req_06727eaf1c5d1e3f900760d10ca565a7). Please verify your settings and try again.
2. You have to switch to model o4-mini-2025-04-16 or some other model using /model. Now if you exit codex, you are back to default model and again have to switch everytime.
3. Crashed the first time with NodeJS error.
But after initial hickups seems to work and still checking how good/bad it is compared to claude code (which I love except for context size limits)
It’s exceptionally good at coding. Amazing software, really, I’m sure the cost hurdles will be resolved. Yet still often worth the spend
This.. isn't true.
I also spent a couple hours picking apart Codex with the goal of adding Sonnet 3.7 support (almost there). The actual agent loop they're using is very simple. Not to say that's a bad thing, but they're offloading all planning and workflow execution to the agent itself. That's probably the right end state to shoot for long-term, but given the current state of these models I've had much better success offloading task tracking to some other thing - even if that thing is just a markdown checklist. (I wrote about my experience [1] building AI Agents last year.)
Deepseek is about 1/20th of the price and only slightly behind Claude.
Both have a tendency to over engineer. It's like a junior engineer who treats LOC as a KPI.
export OPENAI_API_KEY="your-api-key-here"
Note: This command sets the key only for your current terminal session. To make it permanent, add the export line to your shell's configuration file (e.g., ~/.zshrc).
Can't any 3rd party utility running in the same shell session phone home with the API key? I'd ideally want only codex to be able to access this var
OPENAI_API_KEY="your-api-key-here" codex
You could however wrap the export variable and codex command in a script and just call that. This way the variable would only be part of that script’s environment.
Now I know I should be careful examining code not formatted in a code block.
People downvoting legitimate questions on HN should be ashamed of themselves.