Posted by hansonw 8 hours ago
One huge difference I notice between Codex and Claude code is that, while Claude basically disregards your instructions (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly persistent in following every last character of them - to the point that i've seen it work for 30 minutes to convolute some solution that was only convoluted because of some sentence I threw in the instructions I had completely forgotten about.
I imagine Codex as the "literal genie" - it'll give you exactly what you asked for. EXACTLY. If you ask Claude to fix a test that accidentally says assert(1 + 1 === 3), it'll say "this is clearly a typo" and just rewrite the test. Codex will rewrite the entire V8 engine to break arithmetic.
Both these tools have their uses, and I don't think one approach is universally better. Because Claude just hacks its way to a solution, it is really fast, so I like using it for iterate web work, where I need to tweak some styles and I need a fast iterative loop. Codex is much worse at that because it takes like 5 minutes to validate everything is correct. Codex is much better for longer, harder tasks that have to be correct -- I can just write some script to verify that what it did work, and let it spin for 30-40 minutes.
I was unconvinced it had actually, fully ripped out the floating origin logic, so I had it write up a summary and then used that as a high level guide to pick through the code and it had, as you said, followed the instructions to the letter. Hugely impressive. In march of 2023 OpenAI's products struggled to draw a floating wireframe cube.
A friend of mine tells Claude to always address him as “Mr Tinkleberry”, he says he can tell when Claude is not paying attention to the instructions on CLAUDE.md when Claude stops calling him “Mr Tinkleberry” consistently
This guy has a good write up on the topic
I'd be wary of using any canary material that wouldn't be at home in the sort of work you're doing.
I don't know the merit to what parent is saying, but it does make some intuitive sense if you think about it. As the context fills up, the LLM places less attention on further and further back in the context, that's why the LLM seems dumber and dumber as a conversation goes on. If you put 5 instructions in the system prompt or initial message, where one acts as a canary, then you can easier start to see when exactly it stops following the instructions.
Personally, I always go for one-shot answer, and if it gets it wrong or misunderstands, restart from the beginning. If it doesn't get it right, I need to adjust the prompt and retry. Seems to me all current models do get a lot worse quickly, once there is some back and forth.
This isn't an exaggeration either. Codex acts as if it is the last programmer on Earth and must accomplish its task at all costs. This is great for anyone content to treat it like a black box, but I am not content to do that. I want a collaborator with common sense, even if it means making mistakes or bad assumptions now and then.
I think it really does reflect a difference in how OpenAI and Anthropic see humanity's future with AI.
This feels very strange to me. I use Claude a lot and it follows the instructions very well. What's in your CLAUDE.md file? it's supposed to be fairly concise/brief and not use up too much context.
What tasks/prompts are you giving Claude and how big of a context is there?
EDIT: Also which model are you using?
Does anyone know of a way to fix this? Claude constantly disregards my CLAUDE.md. I put a decent amount of time into it and it's pretty much worthless without explicitly telling it to reference it before each prompt.
(search for effective context problem for more info. e.g. https://arxiv.org/abs/2509.21361)
To solve it, you just don't allow your current context to use more than 50% of the total window size
To do that in Claude code, you have to use subagents and design small enough agents
Then you can use skills to make it remember every time the little details or the steps
More effectively, you use skills to tell the main thread when you go to use which agent.
If you don't understand anything I said, try to restate the important things to the model periodically, and keep your tasks small.
Use plan mode and make the model store, keep track of the progress on a markdown file, and when context is polluted, call /compact and then make it re-read the context from the files created
You can prompt it as simply as:
First, understand the login feature on the repo using subagents and create a document on docs/ for future reference. Then, understand the task at hand and create an implementation plan. <task> blah blah </task>
Also, using XML tags makes the attention remember easily
Which is why i made the feature request for hooks (claude code implemented, as did cursor, hopefully codex will too)
And will soon release https://github.com/eqtylab/cupcake
In my experience, for some reason adherence is not even close to 100%. It's fixated on adding asterisk function params in my Python code and I cannot get it to stop... Maybe I haven't found the right wording, or maybe my codebase has grown past a certain size (there are like a dozen AGENTS.md files dancing around).
I'm still very happy with the tool, though.
I'll give Gemini direction, it'll research... start trying to solve it as I've told it to... and then exclaim, "Oh! It turns out that <X> isn't what <user> thought!" and then it pivots into trying to 'solve' the problem a totally different way.
The issue however... is that it's:
1) Often no longer solving the problem that I actually wanted to solve. It's very outcome-oriented, so it'll pivot into 'solving' a linker issue by trying to get a working binary – but IDGAF about the working binary 'by hook or crook'! I'm trying to fix the damn linker issue!
2) Just... wrong. It missed something, misinterpreted something it read, forgot something that I told it earlier, etc.
So... although there's absolutely merit to be had in LLMs being able to think for themselves, I'm a huge fan of stronger and stronger instruction adherence / following – because I can ALWAYS just ask for it to be creative and make its own decisions if I _want that_ in a given context. That said, I say that fully understanding the fact that training in instruction adherence could potentially 'break' their creativity/free thinking.
Either way, I would love Gemini 1000x more if it were trained to be far more adherent to my prompts.
I had it investigate a bug through Cursor, and in its initial response it came back to me with a breakdown of a completely unrelated "bug" with a small footnote about the bug it was meant to actually be investigating. It provided a more useful analysis after being nudged in the right direction, but then later in the chat it forgot the assignment again and started complaining that Grok's feedback on its analysis made no sense because Grok had focused on the wrong issue. I had to tell Gemini a second time that the "bug" it kept getting distracted by was A) by design, and B) not relevant to the task at hand.
Ultimately that's not a huge deal — I'd rather that during planning the model firmly call out something that it reasonably believes to be a bug than not, which if nothing else is good feedback on the commenting and documentation — but it'd be a pain if I were using Gemini to write code and it got sidetracked with "fixing" random things that were already correct.
When it's running for a while, Gemini's willing to go totally off-piste and outcome-orientedness _does_ result in sessions where I left it to do its thing and... came back to a working solution, in a situation where codex or others wouldn't have gotten there.
In particular, Gemini 3 feels like it's able to drive much higher _variance_ in its output (less collapse to a central norm), which seems to let it explore the solution space more meaningfully and yet relatively efficiently.
- New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on SWE-Lancer, and 58.1% on TerminalBench 2.0
- Natively trained to work across many hours across multiple context windows via compaction
- 30% more token-efficient at the same reasoning level across many tasks
Let us know what you think!
how much more token efficient is this compared to 5.0
had to use 5.0 because 5.1 was eating tokens like crazy and seemed like a slight incremental improvement barely noticeable
I really like the "subagent" feature in Claude Code — it's super useful to manage context in complex codebases. Here are some examples of agents that can be useful: https://github.com/humanlayer/humanlayer/tree/main/.claude/a...
Would it make sense to have a similar feature in Codex CLI? I often do "spec-driven development", which is basically a loop of:
research -> implementation plan -> actual implementation (based on research + plan) -> validation
I have multiple subagents that I use for each phase that (based on subjective judgement) improve the output quality (vs keeping everything, every tool use etc. in the "main" context window).Codex CLI is great and I use it often but I'd like to have more of these convenient features for managing context from CC. I'm super happy that compaction is now available, hopefully we'll get more features for managing context.
What does it even mean?
Is this saying that said summarization now happens at the model level? Or are there other differences?
- As a general observation, Gemini is less easy to work with as a collaborator. If I ask the same question to both models, Codex will answer the question. Gemini will read some intention behind the question, write code to implement the intention, and only then answer the question. In one case, it took me five rounds of repeatedly rewriting my prompt in various ways before I could get it to not code but just answer the question.
- Subjectively, it seemed to me that the code that Gemini wrote was more similar to code that I, as a senior-level developer, would have written than what I have been used to from recent iterations of GPT-5.1. The code seemed more readable-by-default and not merely technically correct. I was happy to see this.
- Gemini seems to have a tendency to put its "internal dialogue" into comments. For example, "// Here we will do X because of reason Y. Wait, the plan calls for Z instead. Ok, we'll do Z.". Very annoying.
I did two concrete head-to-head comparisons where both models had the same code and the same prompt.
First, both models were told to take a high-level overview of some new functionality that we needed and were told to create a detailed plan for implementing it. Both models' plans were then reviewed by me and also by both models (in fresh conversations). All three of us agreed that Codex's plan was better. In particular, Codex was better at being more comprehensive and at understanding how to integrate the new functionality more naturally into the existing code.
Then (in fresh conversations), both models were told to implement that plan. Afterwards, again, all three of us compared the resulting solutions. And, again, all three of us agreed that Codex's implementation was better.
Notably, Gemini (1) hallucinated database column names, (2) ignored parts of the functionality that the plan called for, and (3) did not produce code that was integrated as well with the existing codebase. In its favor, it did produce a better version of a particular finance-related calculation function than Codex did.
Overall, Codex was the clear winner today. Hallucinations and ignored requirements are big problems that are very annoying to deal with when they happen. Additionally, Gemini's tendencies to include odd comments and to jump past the discussion phase of projects both make it more frustrating to work with, at this stage.
"For Gemini 3, we strongly recommend keeping the temperature parameter at its default value of 1.0.While previous models often benefited from tuning temperature to control creativity versus determinism, Gemini 3's reasoning capabilities are optimized for the default setting. Changing the temperature (setting it below 1.0) may lead to unexpected behavior, such as looping or degraded performance, particularly in complex mathematical or reasoning tasks."
https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high
They were probably sitting on this for a while. That makes me think this is a fairly incremental update for Codex.
Thinking level xhigh: https://tools.simonwillison.net/svg-render#%20%20%3Csvg%20xm...
Also, thanks for the posts— it’s hugely helpful to have a continuity of insightful perspective throughout.
> a new step towards becoming a reliable coding partner
> GPT‑5.1-Codex-Max is built for long-running, detailed work
Does this not sound contradictory? It’s been the shorter form work that has built what little confidence I have in these as a coding partner - a model that goes off and does work without supervision is not a partner to me.
This is definitely one of the biggest issues with coding agents at the moment.
That said, from my experience, Codex so often does things that are so useful and save me so much time that the occasional "oh god what the hell did it just go off and do" are an acceptable cost for me.
I regularly get great results with open-ended prompts and agents that spend 15+ minutes working on the task. I'm sure they'll eventually get better at common sense understanding of what kind of work is wasteful/absurd.
Codex feels like a tool designed to run after all the humans are gone.
The "# of model-generated tokens per response" chart in [the blog introducing gpt-5-codex](https://openai.com/index/introducing-upgrades-to-codex/) shows an example of how we're improving the model good at both.
Claude: they barely have a signin system at all. Multiple account support doesn’t exist. The minimum seat count for business is nonsense. The data retention policies are weak.
OpenAI: Make ZDR a thing you can use or buy without talking to sales, already. And for those using containers or a remote system or really anything other than local development with the codex CLI, you really really need to fix this bug. I bet Codex could do at least the client part for you!
https://github.com/openai/codex/issues/2798
(Hint: Claude Code gets this right by default, despite the fact that everything else about Claude sign-in is a joke.)
Google: get all your B2B AI product managers in one room and tell them that they need to make one single product menu on one single webpage with all the pricing on that page and that the Google Cloud people are not permitted to make anything that isn’t actually logically Google Cloud depend on Google Cloud Billing. Your product cannot compete with OpenAI or Anthropic if people need to ask an LLM to figure out what your product is and if your own fancy LLMs can’t give a straight answer. My company pays for a non-Google product primarily because it’s too complicated to pay for the Google product! Right now, trying to use Google’s AI is like trying to ride Bay Area public transit before the Clipper Card.
I just won’t even waste my time with the google stuff cuz I can’t figure out how to pay with it.
And that’s a problem everywhere at google. Our google play account is suspended cuz I can’t verify the company. It won’t let me cuz it says I’m not the owner. I’ve always been the owner of my company. For 18 years. There is no one else.
Once some error said make sure the owner email matches your profile in google payments and I was like, what is google payments and where do I even begin with that? I’ve never paid for google play so what does payments have to do with anything?
It’s totally random stuff. Get your shit together, google. Make your products and payment systems coherent, rather than it obviously looking like it was designed by a fiefdom full of territorial managers.
Also, re "Google Payments", I tried to transfer an app from my personal/solo Google Play account to a new business one I set up for my LLC and it was like pulling teeth. They wanted me to find some payment id from the original $20 purchase I made to get access to Google Play, something I did right around when they first launched and while I still have/use the same email, Google came out with approximately 1 googol different "payment solutions" in the interim and their engineers don't care about data migrations. Finally, after many support emails, they just transferred it without me giving that code which just shows how silly the whole thing was from the start.
Utterly ridiculous.
YES I had this and eventually fixed it. I really don't know what I did but lots of clicking on random links and signing into things in different orders and then one day it somehow worked.
So frustrating.
What's harder than herding cats? Herding cats with MBAs and OKRs.
Sad part is Google does offer a ChatML/OpenAI compliant endpoint to do LLM calls and I believe they in an experiment also reduced friction in getting an API key to start making calls right away but discoverability ever remains a challenge with google services.
This part is very easy now: you sign into https://aistudio.google.com/ and then click "Get API key" in the lower left corner.
The problem is that features and docs are still scattered all over. Some thing can only be done via Vertex, for example.
Trying to pay for Gemini-3 is confusing. Maybe an AI Ultra personal subscription? I already pay for OpenAI and Anthropic’s pro/max plans and would happily pay Google too. But the only obvious option is a $250/month tier, and its documentation indicates Google can train on your code unless you find and enable the correct opt-out. If that opt-out exists in all the products, it’s not obvious where it lives or what products it applies to.
Workspace complicates it further. Google advertises that with business workspace accounts your data isn’t used for training. So, I was going to try Antigravity on our codebase. At this point I know I can't trust Google, so I read the ToS carefully. They train on your prompts and source code, and there doesn't appear to be a way to pay them and opt out right now. Be careful, paying for Google Workspace does not protect you, always read the ToS.
Be careful with AI-studio and your Google Workspace accounts. They train on your prompts unless you switch it to API mode.
The result is a lot of uncertainty. I genuinely have no idea how to pay Google for Gemini without risking my code being used for training. And if I do pay, I can’t tell whether they’ll train on my prompts anyway.
The marketing for their coding products does not clearly state when they do or do not train on your prompts and code.
I had to run deep research to understand the risks with using Gemini 3 for agentic work, and I still don't feel confident that I understand the risks. I might have said some incorrect things above, but I am just so confused. I feel like I have a <75% grasp on the situation.
I don't have a lot of trust. And honestly, this feels confusing and deceptive. One could easily confuse it as deliberate strategy to gather training data through ambiguity and dark patterns, it certainly looks like this could be Google's strategy to win the AI race. I assume this is just how it looks, and that they aren't being evil on purpose.
OpenAI in particular has my trust. They get it. They are carefully building the customer experience, they are product and customer driven from the top.
I wouldn't trust Sam Altman. Or any of the big players really.
Hahaha...HAHAhaha. HAHAHHAHAHAHAHAHAHA!!!
https://github.com/google-gemini/gemini-cli/issues/12121
It is far too easy to accidentally end up under the wrong privacy agreement, to the point of where some workplaces are banning use of the Gemini CLI!
Please give me an option for a password (or passkey) or literally anything else that doesn't require either linking with google or going through an email flow for every login
I'd love to see the Gemini models being available by other providers :) or if they just build a simple prepaid wallet like OpenAI and Anthropic.
Now you CAN NOT get the Google One stuff if your account is part of a workspace. I thought: how awful. I want to pay, but I simply can't?
Oh, but then I noticed: You CAN add a _Gemini AI Ultra_ license via the Google Workspace Admin area, great!
Turns out: you fucking can't. That's _Google AI Ultra FOR BUSINESS_ and that IS NOT supported.
So I had to get the Google One subscription on my personal account after all.
Combine that with the _pathetic_ usage limits: somehow not token-based, but amount of requests per 24 hour window (which is 500 for Gemini 3) and Gemini 3's incredible chattiness (it uses A LOT more requests to get something done compared to Claude) and you hit the usage limits in just 2 hours.
Peering into my crystal ball: once all "workers" have been replaced, all humans will spend all of their working hours on nothing but office politics.
Currently, I either need a fast agent that does what I want faster than I can type it (CRUD, forms, etc) or I need an agent to discuss a plan, ups and downs.
Whenever I try to give it a bigger task it takes a lot of time, and often is not what I’ve expected, which might be totally my fault or context specific, but as soon as I’m able to define the task properly I would prefer a faster model as it will be good enough, but faster. I really don’t have problems anymore that I can’t reasonable solve fast enough with this approach.
I’ve run multiple gpt-5 codex concurrent sessions in the cloud, but I didn’t accept one thing they did.
Eventually thinking through it, reading hack boom is faster than outsourcing the work for 30 minutes + 30 minutes to digest +30 minutes to change..
Treat it as a developer that just joined the project and isn't aware of the conventions.
Provide hints for the desired API design, mention relevant code locations that should be read to gain context on the problem, or that do similar things.
An AGENTS.md that explains the project and provides some general guidelines also helps a lot.
Codex can be incredibly strong when prompted the right way.
Then I made the mistake of saying "run npm run build and fix all issues" (something I've run probably 50 times across codex and cc in the past 2 months). CC does it pretty much 100% of the time. I walked away from Codex, and when I came back, it had installed 2 new node packages, and gone down some crazy rabbit hole with eslint and something else. (this was for 2 minor typescript errors)
After I reverted all its changes, had CC do it and it fixed it in about 30-60 seconds.
I'll try a few more times. Let's see.