Top
Best
New

Posted by hansonw 11/19/2025

Building more with GPT-5.1-Codex-Max(openai.com)
483 points | 319 commentspage 3
freediver 11/19/2025|
First time that there is a worthy alternative to Claude Code. Codex Max solved a problem I had Claude Code fail multiple times. Gemini CLI was never a contender (between log in/activation/rate limits - wth), will say though that Gemini CLI has the nicest terminal UI.
simianwords 11/19/2025||
> Compaction enables GPT‑5.1-Codex-Max to complete tasks that would have previously failed due to context-window limits, such as complex refactors and long-running agent loops by pruning its history while preserving the most important context over long horizons. In Codex applications, GPT‑5.1-Codex-Max automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.

Wouldn't the model automatically do that using attention techniques? Why do you need to do it at the token layer and not leave it to the model to automatically decide which tokens are worth paying attention to?

adastra22 11/19/2025||
Attention is quadratic, so you have to pick a cutoff for context window size. In addition, the error/noise in state space increases with longer contexts, resulting in poorer performance. So even if you're willing to take the O(n^2) slowdown of a larger context window, it still won't work.
fancy_pantser 11/19/2025||
> Attention is quadratic

Exactly. Standard Multi-Head Attention uses a matrix that grows to 4B parameters for a 64K sequence as a starting place. FlashAttention v2 helps slightly, but as you grow to 128K context length, you still need over 1TB/s memory bandwidth to stay compute-bound in practice even with this optimization.

So there has been a lot of research in this area and model architectures released this year are showing some promising improvements. Sliding windows lose context fidelity and if you go fully linear, you sacrifice math, logic, and long multi-turn (agentic) capabilities, so everyone is searching for a good alternative compromise.

MiniMax-M1 had lightning attention to scale up to 1M context lengths. It's "I/O aware" via tiling and calculates attention two ways block-wise (intra-block traditional attention and inter-block linear attention), thereby avoiding the speed-inhibiting cumulative summation.

DeepSeek V3.2 uses DeepSeek Sparse Attention (DSA), which is sub-linear by only computing "interesting" pairs. For example, in 128K context lengths this requires only 10-20% of attention pairs to be materialized.

Both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which is borrowed from Mamba2. In Qwen3-Next it alternates three Gated DeltaNet (linear attention) layers for every one gated [full] attention. The speedup is from a delta rule, which basically amounts to caching in a hand-wavy way.

There's no universally-adopted solution yet, as these are all pretty heavy-duty compromises, but the search is going strong right now for linear or better attention mechanisms that still perform well.

qsort 11/19/2025||
> due to context-window limits
simianwords 11/19/2025||
context window is not some physical barrier but rather the attention just getting saturated. what did i get wrong here?
qsort 11/19/2025|||
> what did i get wrong here?

You don't know how an LLM works and you are operating on flawed anthropomorphic metaphors.

Ask a frontier LLM what a context window is, it will tell you.

Palmik 11/19/2025|||
It's a fair question, even if it might be coming from a place of misunderstanding.

For example, DeepSeek 3.2, which employs sparse attention [1], is not only faster with long context than normal 3.1, but also seems to be better (perhaps thanks to reducing the noise?).

[1] It uses still quadratic router, but it's small, so it scales well in practice. https://api-docs.deepseek.com/news/news250929

ed 11/19/2025|||
Parent is likely thinking of sparse attention which allows a significantly longer context to fit in memory
qsort 11/19/2025||
My comment was harsher than it needed to be and I'm sorry, I think I should have gotten my point across in a better way.

With that out of the way, parent was wondering why compaction is necessary arguing that "context window is not some physical barrier but rather the attention just getting saturated". We're trying to explain that 3+2=2+3 and you people are sitting in the back going "well, actually, not all groups are abelian".

paradite 11/19/2025||||
In theory, auto-regressive models should not have limit on context. It should generate the next token with all previous tokens.

In practice, when training a model, people select a context window so that during inference, you know how much GPU memory to allocate for a prompt and reject the prompt if it exceeds the memory limit.

Of course there's also degrading performance as context gets longer, but I suspect memory limit is the primary factor of why we have context window limits.

kenjackson 11/19/2025|||
I think attention literally doesn't see anything beyond the context window. Even within the context window you may start to see attentional issues, but that's a different problem.
tunesmith 11/19/2025||
I've been dealing with Codex CLI for a while and I love it, but I'm wondering if my thinking is just limited. While I'm starting discussions and creating plan docs, I've never been able to ask it to do anything that takes it longer than 25 minutes or so. Usually far less. I'm having trouble imagining what I can ask it to do that would make it take hours - like, wouldn't that require putting together an absolutely massive planning doc that would take hours to put together anyway? I'd rather just move incrementally.
GenerWork 11/19/2025||
Perhaps they're combining an incredibly complex product that has a lot of interactive features, a big codebase, test creation, and maybe throwing some MCP stuff in there such as creating creating a ticket in Jira if a test fails?
CuriouslyC 11/19/2025|||
Easy way to get an agent to run a long time is just to get it to babysit CI/CD, tell it to iterate on it until it passes. I got Sonnet 4 to run for >6 hours that way.
aerhardt 11/19/2025||
The idea of giving it a task that may take six hours and reviewing it also gives me shivers.

I'm a very happy Codex customer, but everything turns to disgusting slop if I don't provide:

(1) Up-to-date AGENTS.md and an excellent prompt

(2) A full file-level API with function signatures, return types and function-level guidance if it's a complex one

(3) Multiple rounds of feedback until the result is finely sculpted

Overall it's very small units of work - one file or two, tops.

I've been letting the above standards go for the last couple of weeks due to crunch and looking at some of the hotspots of slop now lying around has me going all Homelander-face [1] at the sight of them.

Those hotspots are a few hundred lines in the worst cases; I'm definitely not ready to deal with the fallout of any unit of work that takes even more than 20min.

[1] https://i.kym-cdn.com/entries/icons/original/000/050/702/ab7...

jillesvangurp 11/19/2025||
I've been doing a few fairly big refactorings on our code base in the last few days. It does a decent job and I generally don't put a lot of effort in my prompts.

It seems to pick a lot up from my code base. I do have an Agents.md with some basics on how to run stuff and what to do that seems to help it going off on a wild goose chase trying to figure out how to run stuff by doing the wrong things.

I think from first using codex around July to now has been quite a journey where it improved a lot. It actually seems to do well in larger code bases where it has a lot of existing structure and examples of how things are done in that code base. A lot of things it just does without me asking for them just because there's a lot of other code that does it that way.

After recent experiences, I have some confidence this might work out well.

spmartin823 11/19/2025||
I still want something no one has, which is the ability to launch agents in different git worktrees simultaneously and check the results out on my main branch for testing when they are finished.
agentifysh 11/19/2025||
lots of tools that do this and I ended up going down this rabbit hole something that could just plug in to codex instead of requiring a fork

http://github.com/agentify-sh/10x

does minimal overhead with agent orchestration (its just a bash/typescript) as its main focus was adding enhancements to codex like double redundant checkpoint via git and jj (lessons learned from codex being git reset --hard happy), something like claude skills (just a bunch of mds that steer it towards specific activity like think, plan, execute), timeout wrappers (to get you unstuck if codex waits a long time), blacklist commands during yolo (rm -rf, git reset banned even if it by small chance run it) MIT licensed

you can work sequentially (subagents launch one after the other) or parallel (worktrees) but tbh sequentially is better because you understand what is going on with parallel it might be best for dealing with tests and UI.

poly2it 11/19/2025||
Your link is a 404.
lysecret 11/19/2025|||
Cursor has this too
cube2222 11/19/2025|||
I think I’ve described how I achieve kinda your desired workflow in a comment yesterday [0].

[0]: https://news.ycombinator.com/item?id=45970668

agentifysh 11/19/2025||
ha! very interesting how slept on jj is

its been essential to my workflow as well

i use both jj and git and jj is great for just creating a snapshot that i can revert to incase it fails

im still exploring it to see what else i can do with it for agentic use

rane 11/19/2025|||
tmux users might find this useful: https://github.com/raine/workmux
bradly 11/19/2025|||
Would this be similar to how Charlie and Jules work?
ygouzerh 11/20/2025||
I am curious: why would you you like to have that? (Genuine question, I am personally so scared about the AI going crazy and putting slop everywhere that I often ask it to focus on a single well defined area first)
epolanski 11/19/2025||
Small ot question on the GPT cli tool.

I gave it a shot last month but I did not enjoy it due to the lack of a proper planning mode and being able to accept each edit independently, has it improved?

theshrike79 11/21/2025|
No. Claude is still the only CLI agent tool with a planning mode.

Crush, Gemini, Codex and Copilot don't have it for some reason. Can't be that difficult

NickFORGE 11/20/2025||
We’ve been experimenting with a similar idea but in a browser-native environment — running real containers + a WebSocket terminal + multi-agent workflows. GPT-5.1 (Codex Max especially) seems to handle multi-step refactors a lot more cleanly, and chaining it through CLI agents has been surprisingly reliable.

Curious if anyone else is trying agent orchestration beyond the editor itself?

tptacek 11/19/2025||
Is "compaction" a trained-in feature of the model, or just tooling around the model calls? Agents already do compaction.
highfrequency 11/20/2025||
Is GPT-5.1-Codex better or worse than GPT-5.1 (Thinking) for straight up mathematical reasoning (ie if it is optimized for making code edits)? Said another way: what is the set of tasks where you expect GPT 5.1 to be better suited than GPT-5.1 Codex? Is it non-coding problems or non-technical problems?
rolisz 11/19/2025||
I got prompted to try it out on the web. It gave me this after 5 minutes:

"I wasn’t able to finish creating the new base homepage module template and updating every module to inherit from it within the available time. I did not make any changes or commits."

Told it to get back to work. Let's see how that goes.

hereme888 11/19/2025|
It's getting so cut-throat for who has the current SOTA model. Seems to be the big income driver.
More comments...