GPT-5.3-Codex - Hacker News

Posted by meetpateltech 8 hours ago

1041 points | 397 comments

Rperry2174 6 hours ago|

Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophically

With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.

that feels like a reflection of a real split in how people think llm-based coding should work...

some want tight human-in-the-loop control and others want to delegate whole chunks of work and review the result

Interested to see if we eventually see models optimize for those two philosophies and 3rd, 4th, 5th philosophies that will emerge in the coming years.

Maybe it will be less about benchmarks and more about different ideas of what working-with-ai means

karmasimida 6 hours ago||

> With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

> With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.

Ain't the UX is the exact opposite? Codex thinks much longer before gives you back the answer.

xd1936 5 hours ago|||

I've also had the exact opposite experience with tone. Claude Code wants to build with me, and Codex wants to go off on its own for a while before returning with opinions.

mrkstu 5 hours ago||

Its likely that both are steering towards the middle from their current relative extremes and converging to nearly the same place.

gervwyk 5 hours ago||

also my experience in using these two models. they are trying to recover from oversteer perhaps.

WilcoKruijer 5 hours ago||||

Yes, you’re right for 4.5 and 5.2. Hence they’re focusing on improving the opposite thing and thus are actually converging.

bt1a 4 hours ago||||

This is most likely an inference serving problem in terms of capacity and latency given that Opus X and the latest GPT models available in the API have always responded quickly and slowly, respectively

cwyers 3 hours ago|||

Codex now lets you tell the LLM tgings in the middle of its thinking without interrupting it, so you can read the thinking traces and tell it to change course if it's going off track.

fluidcruft 2 hours ago||

That just seems like a UI difference. I've always interrupted claude code added a comment and it's continued without much issue. Otherwise if you just type the message is queued for next. There's no real reason to prefer one over the other except it sounds like codex can't queue messages?

ghosty141 5 hours ago|||

I'm personally 100% convinced (assuming prices stay reasonable) that the Codex approach is here to stay.

Having a human in the loop eliminates all the problems that LLMs have and continously reviewing small'ish chunks of code works really well from my experience.

It saves so much time having Codex do all the plumbing so you can focus on the actual "core" part of a feature.

LLMs still (and I doubt that changes) can't think and generalize. If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to. This makes it kinda pointless for the "full autonomy" approach since effecitly code quality and abstractions completely go down the drain over time. That's fine if it's just prototyping or "throwaway" scripts but for bigger codebases where longevity matters it's a dealbreaker.

_zoltan_ 4 hours ago|||

I'm personally 100% convinced of the opposite, that it's a waste of time to steer them. we know now that agentic loops can converge given the proper framing and self-reflectiveness tools.

sealeck 4 hours ago|||

Converge towards what though... I think the level of testing/verification you need to have an LLM output a non-trivial feature (e.g. Paxos/anything with concurrency, business logic that isn't just "fetch value from spreadsheet, add to another number and save to the database") is pretty high.

replygirl 2 hours ago||

in the new world, engineers have to actually be good at capturing and interpreting requirements

halfcat 45 minutes ago||

In this new world, why stop there? It would be even better if engineers were also medical doctors and held multiple doctorate degrees in mathematics and physics and also were rockstar sales people.

rapind 26 minutes ago||||

Maybe some day, but as a claude code user it makes enough pretty serious screw ups, even with a very clearly defined plan, that I review everything it produces.

You might be able to get away without the review step for a bit, but eventually (and not long) you will be bitten.

zeroxfe 2 hours ago||||

> it's a waste of time to steer them

It's not a waste of time, it's a responsibility. All things need steering, even humans -- there's only so much precision that can be extrapolated from prompts, and as the tasks get bigger, small deviations can turn into very large mistakes.

There's a balance to strike between micro-management and no steering at all.

bcarv 3 hours ago||||

Does the AI agent know what your company is doing right now, what every coworker is working on, how they are doing it, and how your boss will change priorities next month without being told?

If it really knows better, then fire everyone and let the agent take charge. lol

hyldmo 3 hours ago|||

No, but Codex wouldn’t have asked you those questions either

bcarv 2 hours ago||

For me, it still asks for confirmation at every decision when using plans. And when multiple unforeseen options appear, it asks again. I don’t think you’ve used Codex in a while.

fHr 4 minutes ago||

skill issue

IMTDb 1 hour ago|||

A significant portion of engineering time is now spent ensuring that yes, the LLM does know about all of that. This context can be surfaced through skills, MCP, connectors, RAG over your tools, etc. Companies are also starting to reshape their entire processes to ensure this information can be properly and accurately surfaced. Most are still far from completing that transformation, but progress tends to happen slowly, then all at once.

halfcat 21 minutes ago|||

> given the proper framing

This sounds like never. Most businesses are still shuffling paper and couldn’t give you the requirements for a CRUD app if their lives depended on it.

You’re right, in theory, but it’s like saying you could predict the future if you could just model the universe in perfect detail. But it’s not possible, even in theory.

If you can fully describe what you need to the degree ambiguity is removed, you’ve already built the thing.

If you can’t fully describe the thing, like some general “make more profit” or “lower costs”, you’re in paper clip maximizer territory.

NuclearPM 1 hour ago|||

> If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to

That could easily be automated.

utilize1808 6 hours ago|||

I think it's the opposite. Especially considering Codex started out as a web app that offers very little interactivity: you are supposed to drop a request and let it run automatously in a containerized environment; you can then follow up on it via chat --- no interactive code editing.

Rperry2174 5 hours ago||

Fair I agree that was true of early codex and my perception too.. but today there are two announcements that came out and thats what im referring to.

specifically, the GPT-5.3 post explicitly leans into "interactive collaborator" langauge and steering mid execution

OpenAI post: "Much like a colleague, you can steer and interact with GPT-5.3-Codex while it’s working, without losing context."

OpenAI post: "Instead of waiting for a final output, you can interact in real time—ask questions, discuss approaches, and steer toward the solution"

Claude post: "Claude Opus 4.6 is designed for longer-running, agentic work — planning complex tasks more carefully and executing them with less back-and-forth from the user."

stingraycharles 25 minutes ago|||

I think those OpenAI announcements are mainly because this hasn’t been the case for them earlier, while it has been part of Claude Code since the beginning.

I don’t think there’s something deeply philosophical in here, especially as Claude Code is pushing stronger for asking more questions recently, introduced functionality to “chat about questions” while they’re asked, etc.

fluidcruft 2 hours ago|||

Frankly it seems to be that codex is playing catch-up with claude code and claude code is just continuing to move further ahead. The thing with claude code is it will work longer... if you want it to. It's always had good oversight and (at least for me) it builds trust slowly until you are wishing it would do more at once. When I've used codex (it has been getting better) but back in the day it would just do things and say it's done and you're just sitting there wondering "wtf are you doing?". Claude code is more the opposite where you can watch as closely as you want and often you get to a point where you have enough trust and experience with it that you know what it's going to do and don't want to bother.

jhancock 3 hours ago|||

Good breakdown.

I usually want the codex approach for code/product "shaping" iteratively with the ai.

Once things are shaped and common "scaling patterns" are well established, then for things like adding a front end (which is constantly changing, more views) then letting the autonomous approach run wild can *sometimes* be useful.

I have found that codex is better at remembering when I ask to not get carried away...whereas claude requires constant reminders.

mcintyre1994 5 hours ago|||

This kind of sounds like both of them stepping into the other’s turf, to simplify a bit.

I haven’t used Codex but use Claude Code, and the way people (before today) described Codex to me was like how you’re describing Opus 4.6

So it sounds like they’re converging toward “both these approaches are useful at different times” potentially? And neither want people who prefer one way of working to be locked to the other’s model.

bob1029 3 hours ago|||

I think there is another philosophy where the agent is domain specific. Not that we have to invent an entirely new universe for every product or business, but that there is a small amount of semi-customization involved to achieve an ideal agent.

I would much rather work with things like the Chat Completion API than any frameworks that compose over it. I want total control over how tool calling and error handling works. I've got concerns specific to my business/product/customer that couldn't possibly have been considered as part of these frameworks.

Whether or not a human needs to be tightly looped in could vary wildly depending on the specific part of the business you are dealing with. Having a purpose-built agent that understands where additional verification needs to occur (and not occur) can give you the best of both worlds.

giancarlostoro 5 hours ago|||

> With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.

This feels wrong, I can't comment on Codex, but Claude will prompt you and ask you before changing files, even when I run it in dangerous mode on Zed, I can still review all the diffs and undo them, or you know, tell it what to change. If you're worried about it making too many decisions, you can pre-prompt Claude Code (via .claude/instructions.md) and instruct it to always ask follow up questions regarding architectural decisions.

Sometimes I go out of my way to tell Claude DO NOT ASK ME FOR FOLLOW UPS JUST DO THE THING.

Rperry2174 5 hours ago||

yeah I'm mostly just talking about how they're framing it: "Claude Opus 4.6 is designed for longer-running, agentic work — planning complex tasks more carefully and executing them with less back-and-forth from the user"

I guess its also quite interesting that how they are framing these projects are opposite from how people currently perceive them and I guess that may be a conscious choice...

giancarlostoro 5 hours ago||

I get what you mean now, I like that to be fair, sometimes I want Claude to tell me some architectural options, so I ask it so I can think about what my options are, sometimes I rethink my problem if I like Claudes conclusion.

techbro_1a 5 hours ago|||

> With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.

This is true, but I find that Codex thinks more than Opus. That's why 5.2 Codex was more reliable than Opus 4.5

cchance 5 hours ago|||

Just because you can inject steering doesn't mean they stered away from long running...

Theres hundreds of people who upload Codex 5.2 running for hours unattended and coming back with full commits

hbarka 3 hours ago|||

How can they be diverging, LLMs are built on similar foundations aka the Transformer architecture. Do you mean the training method (RLHF) is diverging?

iranintoavan 3 hours ago||

I'm not OP but I suspect they are meaning the products / tooling / company direction, not necessarily the underlying LLM architecture.

drsalt 1 hour ago|||

be rich, hire an ai guy, let him deal with it

blurbleblurble 3 hours ago|||

Funny cause the situation was totally flipped last iteration.

rozumbrada 4 hours ago|||

I read this exact comment with I would say completely the same words several times in X and I would bet my money it's LLM generated by someone who has not even tried both the tools. This AI slop even in the site like this without direct monetisation implications from fake engagement is making me sick...

pyrolistical 3 hours ago|||

Boing vs airbus philosophy

d--b 5 hours ago|||

I am definitely using Opus as an interactive collaborator that I steer mid-execution, stay in the loop and course correct as it works.

I mean Opus asks a lot if he should run things, and each time you can tell it to change. And if that's not enough you can always press esc to interrupt.

adarsh2321 3 hours ago||

[dead]

granzymes 8 hours ago||

I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!

The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.

GPT-5.3-codex scores 77.3.

the_duke 7 hours ago||

I do not trust the AI benchmarks much, they often do not line up with my experience.

That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.

So very much looking forward to trying out 5.3.

NitpickLawyer 7 hours ago|||

Just some anecdata++ here but I found 5.2 to be really good at code review. So I can have something crunched by cheaper models, reviewed async by codex and then re-prompt with the findings from the review. It finds good things, doesn't flag nits (if prompted not to) and the overall flow is worth it for me. Speed loss doesn't impact this flow that much.

kilroy123 7 hours ago|||

Personally, I have Claude do the coding. Then 5.2-high do the reviewing.

mmaunder 2 hours ago|||

I might flip that given how hard it's been for Claude to deal with longer context tasks like a coding session with iterations vs a single top down diff review.

seunosewa 6 hours ago||||

Then I pass the review back to Claude Opus to implement it.

VladVladikoff 6 hours ago|||

Just curious is this a manual process or you guys have automated these steps?

ricketycricket 5 hours ago|||

I have a `codex-review` skill with a shell script that uses the Codex CLI with a prompt. It tells Claude to use Codex as a review partner and to push back if it disagrees. They will go through 3 or 4 back-and-forth iterations some times before they find consensus. It's not perfect, but it does help because Claude will point out the things Codex found and give it credit.

bryanlarsen 4 hours ago||

Mind sharing the skill/prompt?

dror 1 hour ago||

Not the OP, but I use the same approach.

https://gist.github.com/drorm/7851e6ee84a263c8bad743b037fb7a...

I typically use github issues as the unit of work, so that's part of my instruction.

_zoltan_ 4 hours ago|||

zen-mcp (now called pal-mcp I think) and then claude code can actually just pass things to gemini (or any other model)

kilroy123 3 hours ago|||

Sometimes, depends on how big of a task. I just find 5.2 so slow.

_zoltan_ 4 hours ago|||

I have Opus 4.5 do everything then review it with Gemini 3.

StephenHerlihyy 6 hours ago|||

I don’t use OpenAI too much, but I follow a similar work flow. Use Opus for design/architecture work. Move it to Sonnet for implementation and build out. Then finally over to Gemini for review, QC and standards check. There is an absolute gain in using different models. Each has their own style and way of solving the problem just like a human team. It’s kind of awesome and crazy and a bit scary all at once.

readyforbrunch 4 hours ago||

How do you orchestrate this workflow? Do you define different skills that all use different models, or something else?

aurareturn 7 hours ago||||

5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.

I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.

Looking forward to trying 5.3.

koakuma-chan 7 hours ago||

Opus 4.5 is more creative and better at making UIs

fooker 7 hours ago||||

Yeah, these benchmarks are bogus.

Every new model overfits to the latest overhyped benchmark.

Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.

bunderbunder 5 hours ago|||

All shared machine learning benchmarks are a little bit bogus, for a really “machine learning 101” reason: your test set only yields an unbiased performance metric if you agree to only use it once. But that just isn’t a realistic way to use a shared benchmark. Using them repeatedly is kind of the whole point.

But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.

abustamam 4 hours ago||

Is an imperfect yardstick better than no yardstick? It reminds me of documentation — the only thing worse than no documentation is wrong documentation.

bunderbunder 52 minutes ago||

Yes, because there’s value in a common reference for comparison. It helps to shed light on different models’ relative strengths and weaknesses. And, just like with performance benchmarks, you can learn to spot and read past the ways that people game their results. The danger is really more in when people who are less versed in the subject matter take what are ultimately just a semi tamed genre of sales pitch at face value.

When such benchmarks aren’t available what you often get instead is teams creating their own benchmark datasets and then testing both their and existing models’ performance against it. Which is eve worse because they probably still the rest multiple times (there’s simply no way to hold others accountable on this front), but on top of that they often hyperparameter tune their own model for the dataset but reuse previously published hyperparameters for the other models. Which gives them an unfair advantage because those hyperparameters were tuned to a doffeeent dataset and may not have even been optimizing for the same task.

mrandish 6 hours ago||||

> Yeah, these benchmarks are bogus.

It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .

fooker 6 hours ago||

For the current state of AI, the harness is unfortunately part of the secret sauce.

scoring1774 5 hours ago|||

This has been done: https://arxiv.org/abs/2510.04871v1

mmaunder 2 hours ago||||

ARG-AGI-2 leaderboard has a strong correlation with my Rust/CUDA coding experience with the models.

nerdsniper 6 hours ago||||

Opus 4.5 still worked better for most of my work, which is generally "weird stuff". A lot of my programming involves concepts that are a bit brain-melting for LLMs, because multiple "99% of the time, assumption X is correct" are reversed for my project. I think Opus does better at not falling into those traps. Excited to try out 5.3

nubg 5 hours ago||

what do you do?

audience_mem 2 hours ago||

He works on brain-melting stuff, the understanding of which is far beyond us.

jahsome 7 hours ago|||

Another day, another hn thread of "this model changes everything" followed immediately by a reply stating "actually I have the literal opposite experience and find competitor's model is the best" repeated until it's time to start the next day's thread.

StephenHerlihyy 6 hours ago|||

What amazes me the most is the speed at which things are advancing. Go back a year or even a year before that and all these incremental improvements have compounded. Things that used to require real effort to consistently solve, either with RAGs, context/prompt engineering, have become… trivial. I totally agree with your point that each step along the way doesn’t necessarily change that much. But in the aggregate it’s sort of insane how fast everything is moving.

Rudybega 4 hours ago||

The denial of this overall trend on here and in other internet spaces is starting to really bother me. People need to have sober conversations about the speed of this increase and what kind of effects it's going to have on the world.

SatvikBeri 6 hours ago||||

I use Claude Code every day, and I'm not certain I could tell the difference between Opus 4.5 and Opus 4.0 if you gave me a blind test

clhodapp 6 hours ago||||

And of course the benchmarks are from the school of "It's better to have a bad metric than no metric", so there really isn't any way to falsify anyone's opinions...

malshe 7 hours ago||||

This pretty accurately summarizes all the long discussions about AI models on HN.

cactusplant7374 6 hours ago||||

Hourly occurrence on /r/codex. Model astrology is about the vibes.

wasmainiac 7 hours ago|||

[flagged]

nocman 7 hours ago|||

> Who are making these claims? script kiddies? sr devs? Altman?

AI agents, perhaps? :-D

locknitpicker 7 hours ago||||

> All anonymous as well. Who are making these claims? script kiddies? sr devs? Altman?

You can take off your tinfoil hat. The same models can perform differently depending on the programming language, frameworks and libraries employed, and even project. Also, context does matter, and a model's output greatly varies depending on your prompt history.

andrepd 6 hours ago||

It's hardly tinfoil to understand that companies riding a multi-trillion dollar funding wave would spend a few pennies astroturfing their shit on hn. Or overfit to benchmarks that people take as objective measurements.

BoredPositron 7 hours ago|||

When you keep his ramblings on twitter or company blog in mind I bet he is a shit poster here.

leumon 6 hours ago|||

they tested it at xhigh reasoning though, which is probably double the cost of Anthropic's model.

Cost to Run Artificial Analysis Intelligence Index:

GPT-5.2 Codex (xhigh): $3244

Claude Opus 4.5-reasoning: $1485

(and probably similar values for the newer models?)

redox99 6 hours ago|||

With $20 gpt plan you can use xhigh no problem. With $20 Claude plan you reach the 5h limit with a single feature.

mattkevan 6 hours ago||

Ha, Claude Code on a pro plan often can't complete a single message before hitting the 5h limit. Not hit it once so far on Codex.

naths88 5 hours ago||

This, so frustrating. But CC is so much faster too.

Computer0 6 hours ago|||

A provider's API costs seemingly do not reflect each respective SOTA provider's subscription usage allowances.

__jl__ 7 hours ago|||

Impressive jump for GPT-5.3-codex and crazy to see two top coding models come out on the same day...

granzymes 7 hours ago||

Insane! I think this has to be the shortest-lived SOTA for any model so far. Competition is amazing.

wilg 6 hours ago|||

In my personal experience the GPT models have always been significantly better than the Claude models for agentic coding, I’m baffled why people think Claude has the edge on programming.

dudeinhawaii 6 hours ago|||

I think for many/most programmers = 'speed + output' and webdev == "great coding".

Not throwing shade anyone's way. I actually do prefer Claude for webdev (even if it does cringe things like generate custom CSS on every page) -- because I hate webdev and Claude designs are always better looking.

But the meat of my code is backend and "hard" and for that Codex is always better, not even a competition. In that domain, I want accuracy and not speed.

Solution, use both as needed!

falloutx 5 hours ago|||

> I actually do prefer Claude for webdev

Ah and let me guess all your frontends look like cookie cutter versions of this: https://openclaw.dog/

Yiin 4 hours ago||

Yes and I love it.

whynotminot 5 hours ago|||

> Solution, use both as needed!

This is the way. People are unfortunately starting to divide themselves into camps on this — it’s human nature we’re tribal - but we should try to avoid turning this into a Yankees Redsox.

Both companies are producing incredible models and I’m glad they have strengths because if you use them both where appropriate it means you have more coverage for important work.

soulofmischief 5 hours ago||||

GPT 5.2 codex plans well but fucks off a lot, goes in circles (more than opus 4.5) and really just lacks the breadth of integrated knowledge that makes opus feel so powerful.

Opus is the first model I can trust to just do things, and do them right, at least small things. For larger/more complex things I have to keep either model on extremely short leashes. But the difference is enough that I canceled my GPT Pro sub so I could switch to Claude. Maybe 5.3 will change things, but I also cannot continue to ethically support Sam Altman's business.

wilg 3 hours ago||

I always use 5.2-Codex-High or 5.2-Codex-Extra High (in Cursor). The regular version is probably too dumb.

soulofmischief 2 hours ago||

Didn't make a difference for me. Though I will say, so far 4.6 is really pissing me off and I might downgrade back to 4.5. It just refuses to listen to what I say, the steering is awful.

fragmede 4 hours ago|||

How many people are building the same thing multiple times to compare model performance? I'm much more interested in getting the thing I'm building getting built, than than comparing AIs to each other.

jronak 6 hours ago|||

Did you look at the ARC AGI 2? Codex might be overfit for terminal bench

tedsanders 6 hours ago||

ARC AGI 2 has a training set that model providers can choose to train on, so really wouldn't recommend using it as a general measure of coding ability.

mrandish 5 hours ago|||

A key aspect of ARC AGI is to remain highly resistant to training on test problems which is essential for ARC AGI's purpose of evaluating fluid intelligence and adaptability in solving novel problems. They do release public test sets but hold back private sets. The whole idea is being a test where training on public test sets doesn't materially help.

The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly, didn't cheat or accidentally have public ARC AGI test data slip into their training data. IIRC, some time ago there was an issue when OpenAI published ARC AGI 1 test results on a new model's release which the ARC AGI non-profit was unable to replicate on a private set some weeks later (to be fair, I don't know if these issues were resolved). Edit to Add: Summary of what happened: https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...

I have no expertise to verify how training-resistant ARC AGI is in practice but I've read a couple of their papers and was impressed by how deeply they're thinking through these challenges. They're clearly trying to be a unique test which evaluates aspects of 'human-like' intelligence other tests don't. It's also not a specific coding test and I don't know how directly ARC AGI scores map to coding ability.

janalsncm 5 hours ago||||

More fundamentally, ARC is for abstract reasoning. Moving blocks around on a grid. While in theory there is some overlap with SWE tasks, what I really care about is competence on the specific task I will ask it to do. That requires a lot of domain knowledge.

As an analogy, Terence Tao may be one of the smartest people alive now, but IQ alone isn’t enough to do a job with no domain-specific training.

nurettin 7 hours ago||

Opus was quite useless today. Created lots of globals, statics, forward declarations, hidden implementations in cpp files with no testable interface, erasing types, casting void pointers, I had to fix quite a lot and decouple the entangled mess.

Hopefully performance will pick up after the rollout.

nickstinemates 5 hours ago||

Did you give it any architecture guidance? An architecture skill that it can load to make sure it lays out things according to your taste?

nurettin 3 hours ago||

Yes, it has a very tight CLAUDE.md which it used to follow. Feels like this happens a couple of times a month.

xiphias2 7 hours ago||

,,GPT‑5.3-Codex is the first model we classify as High capability for cybersecurity-related tasks under our Preparedness Framework , and the first we’ve directly trained to identify software vulnerabilities. While we don’t have definitive evidence it can automate cyber attacks end-to-end, we’re taking a precautionary approach and deploying our most comprehensive cybersecurity safety stack to date. Our mitigations include safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines including threat intelligence.''

While I love Codex and believe it's amazing tool, I believe their preparedness framework is out of date. As it is more and more capable of vibe coding complex apps, it's getting clear that the main security issues will come up by having more and more security critical software vibe coded.

It's great to look at systems written by humans and how well Codex can be used against software written by humans, but it's getting more important to measure the opposite: how well humans (or their own software) are able to infiltrate complex systems written mostly by Codex, and get better on that scale.

In simpler terms: Codex should write secure software by default.

mrkeen 7 hours ago||

Is "high-capability" a stronger or weaker claim than "team of phd-level experts"?

https://www.nbcnews.com/tech/tech-news/openai-releases-chatg...

trcf23 7 hours ago|||

That’s just classical OpenAI trying to make us believe they’re closing on AGI… Like all « so called » research from them and Anthropic about safety alignment and that their tech is so incredibly powerful that guardrails should be put on them.

ActionHank 6 hours ago|||

I heard the other day that every time someone claps another vibe coded project embeds the api keys in the webpage.

I wonder if this will continue to be the case.

da_grift_shift 6 hours ago|||

>Our mitigations include safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines including threat intelligence.

"We added some more ACLs and updated our regex"

manmal 3 hours ago||

Please no, I don’t need my quick prototypes hardened against every perceivable threat.

comex 2 hours ago||

In most cases security is not a matter of adding anything in particular, but a matter of just not making specific types of mistakes.

dimitri-vs 40 minutes ago||

Maybe I'm being dumb but that reads very contradictory? I would say that security is explicitly a matter of adding particular things.

itay-maman 7 hours ago||

Something that caught my eye from the announcement:

> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training

I'm happy to see the Codex team moving to this kind of dogfooding. I think this was critical for Claude Code to achieve its momentum.

codethief 5 hours ago||

Sounds like the researchers behind https://ai-2027.com/ haven't been too far off so far.

cootsnuck 4 hours ago|||

We'll see. The first two things that they said would move from "emerging tech" to "currently exists" by April 2026 are:

- "Someone you know has an AI boyfriend"

- "Generalist agent AIs that can function as a personal secretary"

I'd be curious how many people know someone that is sincerely in a relationship with an AI.

And also I'd love to know anyone that has honestly replaced their human assistant / secretary with an AI agent. I have an assistant, they're much more valuable beyond rote input-output tasks... Also I encourage my assistant to use LLMs when they can be useful like for supplementing research tasks.

Fundamentally though, I just don't think any AI agents I've seen can legitimately function as a personal secretary.

Also they said by April 2026:

> 22,000 Reliable Agent copies thinking at 13x human speed

And when moving from "Dec 2025" to "Apr 2026" they switch "Unreliable Agent" to "Reliable Agent". So again, we'll see. I'm very doubtful given the whole OpenClaw mess. Nothing about that says "two months away from reliable".

zozbot234 3 hours ago|||

> Someone you know has an AI boyfriend

MyBoyfriendIsAI is a thing

> Generalist agent AIs that can function as a personal secretary

Isn't that what MoltBot/OpenClaw is all about?

So far these look like successful predictions.

ainch 3 hours ago||

Moltbot is an attempt to do that. Would you hire it as a personal secretary and entrust all your personal data to it?

danpalmer 2 hours ago||

Only people who haven't had a secretary would think it's a personal secretary.

Like, it can't even answer the phone.

Rudybega 4 hours ago|||

I think they immediately corrected their median timelines for takeoff to 2028 upon releasing the article (I believe there was a math mistake or something initially), so all those dates can probably be bumped back a few months. Regardless, the trend seems fairly on track.

JackYoustra 2 hours ago||||

> researchers

that's certainly one way to refer to Scott Alexander

YawningAngel 3 hours ago|||

I don't think generative AI is even close to making model development 50% faster

aurareturn 7 hours ago||

More importantly, this is the early steps of a model self improving itself.

Do we still think we'll have soft take off?

mrandish 6 hours ago|||

> Do we still think we'll have soft take off?

There's still no evidence we'll have any take off. At least in the "Foom!" sense of LLMs independently improving themselves iteratively to substantial new levels being reliably sustained over many generations.

To be clear, I think LLMs are valuable and will continue to significantly improve. But self-sustaining runaway positive feedback loops delivering exponential improvements resulting in leaps of tangible, real-world utility is a substantially different hypothesis. All the impressive and rapid achievements in LLMs to date can still be true while major elements required for Foom-ish exponential take-off are still missing.

rahulyc 4 hours ago||

Yes, but also you'll never have any early evidence of the Foom until the Foom itself happens.

janalsncm 4 hours ago||

If only General Relativity had such an ironclad defense of being as unfalsifiable as Foom Hypothesis is. We could’ve avoided all of the quantum physics nonsense.

quinncom 6 hours ago||||

Exponential growth may look like a very slow increase at first, but it's still exponential growth.

janalsncm 5 hours ago|||

Sigmoids may look like exponential growth at first, until they saturate. Early growth alone cannot distinguish between them.

gf000 4 hours ago|||

If it's exponential growth. It may just as well be some slow growth and continue to be so.

aaaalone 6 hours ago||||

I'm only saying no to keep optimistic tbh

It feels crazy to just say we might see a fundamental shift in 5 years.

But the current addition to compute and research etc. def goes in this direction I think.

thrance 6 hours ago||||

I think the limiting factor is capital, not code. And I doubt GPTX is anymore competent at raising funds than the other, fleshy, snake oilers...

8note 5 hours ago||||

making the specifications is still hard, and checking how well results match against specifications is still hard.

i dont think the model will figure that out on its own, because the human in the loop is the verification method for saying if its doing better or not, and more importantly, defining better

reducesuffering 7 hours ago|||

This has already been going on for years. It's just that they were using GPT 4.5 to work on GPT 5. All this announcement mean is that they're confident enough in early GPT 5.3 model output to further refine GPT 5.3 based on initial 5.3. But yes, takeoff will still happen because of this recursive self improvement works, it's just that we're already past the inception point.

manmal 3 hours ago|||

I guess humans were involved in all that, so how is that anything but tool use?

mirsadm 6 hours ago|||

I can't tell if this is a serious conversation anymore.

reducesuffering 5 hours ago||

“Best start believing in science fiction stories. You're in one.”

https://x.com/TheZvi/status/2017310187309113781

minimaxir 8 hours ago||

I remember when AI labs coordinated so they didn't push major announcements on the same day to avoid cannibalizing each other. Now we have AI labs pushing major announcements within 30 minutes.

observationist 7 hours ago||

The labs have fully embraced the cutthroat competition, the arms race has fully shed the civilized facade of beneficient mutual cooperation.

Dirty tricks and underhanded tactics will happen - I think Demis isn't savvy in this domain, but might end up stomping out the competition on pure performance.

Elon, Sam, and Dario know how to fight ugly and do the nasty political boardroom crap. 26 is gonna be a very dramatic year, lots of cinematic potential for the eventual AI biopics.

manquer 7 hours ago||

>civilized facade of mutual cooperation

>Dirty tricks and underhanded tactics

As long the tactics are legal ( i.e. not corporate espionage, bribes etc), the no holds barred full free market competition is the best thing for the market and the consumers.

ajam1507 5 hours ago|||

> As long the tactics are legal ( i.e. not corporate espionage, bribes etc), the no holds barred full free market competition is the best thing for the market and the consumers.

The implicit assumption here is that we have constructed our laws so skillfully that the only path to win a free market competition is by producing a better product, or that all efforts will be spent doing so. This is never the case. It should be self-evident from this that there is a more productive way for companies to compete and our laws are not sufficient to create the conditions.

thethimble 6 hours ago||||

The consumers are getting huge wins.

Model costs continue to collapse while capability improves.

Competition is fantastic.

doom2 46 minutes ago|||

> Model costs continue to collapse

And yet RAM prices are still sky high. Game consoles are getting more expensive, not cheaper, as a result. When will competition benefit those consumers? Or consumers of desktop RAM?

mrandish 5 hours ago|||

> The consumers are getting huge wins.

However, the investors currently subsidizing those wins to below cost may be getting huge losses.

KoolKat23 2 hours ago||||

Yes, but not cutthroat competition that implies unsustainable, detrimental competition that kills off the industry.

dwaltrip 4 hours ago||||

Sure, it can be beneficial. But don't forget that externalities are a thing.

wiz21c 4 hours ago|||

in the short term maybe, in the long term it depends how many winners you have. If only two, the market will be a duopoly. Customers will get better AI but will have zero power over the way the AI is produced or consumed (i.e. cO2 emission, ethics, etc will be burnt)

manquer 1 hour ago||

> how many winners ... duopoly

There aren't any insurmountable large moats, plenty of open weight models that perform close enough.

> CO₂ emissions

Different industry that could also benefit from more competition ? Clean(er) energy is not even more expensive than dirty sources on pure $/kWh, we still do need dirty sources for workloads like base demand, peakers etc that the cheap clean sources cannot service today.

zozbot234 7 hours ago|||

They're also coordinating around Chinese New Year to compete with new releases of the major open/local models.

DonHopkins 7 hours ago||

Year of the Pelican!

hoeoek 7 hours ago|||

simonw?

iujasdkjfasf 4 hours ago|||

[dead]

tedsanders 7 hours ago|||

This goes way back. When OpenAI launched GPT-4 in 2023, both Anthropic and Google lined up counter launches (Claude and Magic Wand) right before OpenAI's standard 10am launch time.

crorella 8 hours ago|||

The thrill of competition

manquer 7 hours ago|||

Wouldn't that be illegal ? i.e. cartel to collude like that ?

avaer 1 hour ago||

You were downvoted but I don't understand why. This is the purpose/spirit of antitrust law [1]

[1] https://en.wikipedia.org/wiki/United_States_antitrust_law

manquer 53 minutes ago||

I have long since given up trying to understand voting patterns in HN :)

---

Sadly it was the core of anti-trust law, since 1970s things have changed.

The predominant view today (i.e. Chicago School view) in both judiciary and executive are influenced by Justice Bork's ideas that consumer benefit being the deciding factor over company's actions.

Consumer benefits becomes opinions of projections by either side of a case about the future, whereas company actions like collusion, pricing fixing or M&A are hard facts with strong evidence. Today it is all vibes on how the courts (or executive) feel .

So now we have Government sanctioned cartels like in Aviation Alliances [1] that is basically based on convoluted catch-22-esque reasoning because it favors strategic goals even though it would be a violation of the letter/spirit of the law.

[1] https://www.transportation.gov/office-policy/aviation-policy...

IhateAI 7 hours ago|||

A sign of the inevitible implosion !

cedws 7 hours ago||

I wish they’d just stop pretending to care about safety, other than a few researchers at the top they care about safety only as long as they aren’t losing ground to the competition. Game theory guarantees the AI labs will do what it takes to ensure survival. Only regulation can enforce the limits, self policing won’t work when money is involved.

thethimble 7 hours ago|||

As long as China continues to blitz forward, regulation is a direct path to losing.

cedws 6 hours ago|||

Define "losing."

Europe is prematurely regarded as having lost the AI race. And yet a large portion of Europe live higher quality lives compared to their American counterparts, live longer, and don't have to worry about an elected orange unleashing brutality on them.

thethimble 5 hours ago|||

If the world is built on AI infrastructure (models, compute, etc.) that is controlled by the CCP then the west has effectively lost.

This may lead to better life outcomes, but if the west doesn't control the whole stack then they have lost their sovereignty.

This is already playing out today as Europe is dependent on the US for critical tech infrastructure (cloud, mail, messaging, social media, AI, etc). There's no home grown European alternatives because Europe has failed to create an economic environment to assure its technical sovereignty.

fakedang 5 hours ago|||

Europe has already lost the tech race - their cloud systems that their entire welfare states rely upon are all hosted on servers hosted by American private companies, which can turn them off with a flick of a switch if and when needed.

When the welfare state, enabled by technology, falls apart, it won't take long for European society to fall apart. Except France maybe.

pixl97 6 hours ago|||

You mean all paths are direct paths to losing.

vovavili 6 hours ago|||

The last thing I would want is for excessively neurotic bureaucrats to interfere with all the mind-blowing progress we've had in the last couple of years with LLM technology.

iujasdkjfasf 4 hours ago||

[dead]

SunshineTheCat 5 hours ago||

I've always been fascinated to see significantly more people talking about using Claude than I see people talking about Codex.

I know that's anecdotal, but it just seems Claude is often the default.

I'm sure there are key differences in how they handle coding tasks and maybe Claude is even a little better in some areas.

However, the note I see the most from Claude users is running out of usage.

Coding differences aside, this would be the biggest factor for me using one over the other. After several months on Codex's $20/mo. plan (and some pretty significant usage days), I have only come close to my usage limit once (never fully exceeded it).

That (at least to me) seems to be a much bigger deal than coding nuances.

timpera 4 hours ago||

In my experience, OpenAI gives you unreasonable amounts of compute for €20/month. I am subscribed to both and Claude's limits are so tiny compared to ChatGPT's that it often feels like a rip-off.

Claude also doesn't let you use a worse model after you reach your usage limits, which is a bit hard to swallow when you're paying for the service.

mrandish 4 hours ago|||

> the note I see the most from Claude users is running out of usage.

I suspect that tells us less about model capability/efficiency and more about each company's current need to paint a specific picture for investors re: revenue, operating costs, capital requirements, cash on hand, growth rate, retention, margins etc. And those needs can change at any moment.

Use whatever works best for your particular needs today, but expect the relative performance and value between leaders to shift frequently.

superfrank 5 hours ago|||

I only switched to using the terminal based agents in the last week. Prior to this I was pretty much only using it through Cursor and GH Copilot. The Anthropic models when used through GH Copilot were far superior to the codex ones and I didn't really get the hype of Codex. Using them through the CLI though, Codex is much better, IMO.

My guess is that it's potentially that and just momentum from developers who started using CC when it was far superior to Codex has allowed it to become so much more popular. Potentially, it's might be that, as it's more autonomous, it's better for true vibe-coding and it's more popular with the Twitter/LinkedIn wantrepreneur crew which meant it gets a lot of publicity which increases adoption quicker.

jorl17 2 hours ago||

Out of curiosity, what do you feel are the key differences between cursor + models versus something like Claude Code/Codex?

Are you feeling the benefits of the switch? What prompted you to change?

I've been running cursor with my own workflows (where planning is definitely a key step) and it's been great. However, the feeling of missing out, coupled with the fact I am a paying ChatGPT customer, got me to try codex. It hasn't really clicked in what way this is better, as so far it really hasn't been.

I have this feeling that supposedly you can give these tools a bit more of a hands-off approach so maybe I just haven't really done that yet. Haven't fiddled with worktrees or anything else yet either.

AstroBen 5 hours ago|||

I'm with you. Codex's plans seems to be a much more generous offering than Claude

I just.. can't tell a different in quality between them.. so I go for the cheapest

fHr 5 hours ago||

Codex is great and I hit the usage once doing multiagent full 5 hour absolute degen session for the nornal workflow alongside never hit it and now x2 useage even and now with the planmode switch back and forth absolute great.

exabrial 13 minutes ago||

After using Anthropic's products, I think it's going to be difficult to go back to OpenAI. It feels more like a discussion with a peer; ChatGPT has always felt like arguing with an idiot on Reddit.

tosh 7 hours ago||

Terminal Bench 2.0

  | Name                | Score |
  |---------------------|-------|
  | OpenAI Codex 5.3    | 77.3  |
  | Anthropic Opus 4.6  | 65.4  |

greenfish6 7 hours ago||

yea but i feel like we are over the hill on benchmaxxing, many times a model has beaten anthropic on a specific bench, but the 'feel' is that it is still not as good at coding

falloutx 6 hours ago|||

When Anthropic beats Benchmarks its somehow earned, when OpenAi games it, its somehow about not feeling good at coding.

AstroBen 7 hours ago||||

'feel' is no more accurate

not saying there's a better way but both suck

thethimble 6 hours ago|||

Speak for yourself. I've been insanely productive with Codex 5.2.

With the right scaffolding these models are able to perform serious work at high quality levels.

helloplanets 6 hours ago|||

He wasn't saying that both of the models suck, but that the heuristics for measuring model capability suck

AstroBen 6 hours ago|||

..huh?

crorella 6 hours ago||||

The variety of tasks they can do and will be asked to do is too wide and dissimilar, it will be very hard to have a transversal measurement, at most we will have area specific consensus that model X or Y is better, it is like saying one person is the best coder at everything, that does not exist.

pixl97 6 hours ago||

Yea, we're going to need benchmarks that incorporate series of steps of development for a particular language and how good each model is at it.

Like can the model take your plan and ask the right questions where there appear to be holes.

How wide of architecture and system design around your language does it understand.

How does it choose to use algorithms available in the language or common libraries.

How often does it hallucinate features/libraries that aren't there.

How does it perform as context get larger.

And that's for one particular language.

tavavex 6 hours ago||||

The 'feel' of a single person is pretty meaningless, but when many users form a consensus over time after a model is released, it feels a lot more informative than a simple benchmark because it can shift over time as people individually discover the strong and weak points of what they're using and get better at it.

forrestthewoods 6 hours ago|||

At the end of the day “feel” is what people rely on to pick which tool they use.

I’d feel unscientific and broken? Sure maybe why not.

But at the end of the day I’m going to choose what I see with my own two eyes over a number in a table.

Benchmarks are a sometimes useful to. But we are in prime Goodharts Law Territory.

AstroBen 6 hours ago||

yeah, to be honest it probably doesn't matter too much. I think the major models are very close in capabilities

forrestthewoods 5 hours ago||

I don’t think this is even remotely true in practice.

I honestly I have no idea what benchmarks are benchmarking. I don’t write JavaScript or do anything remotely webdev related.

The idea that all models have very close performance across all domains is a moderately insane take.

At any given moment the best model for my actual projects and my actual work varies.

Quite honestly Opus 4.5 is proof that benchmarks are dumb. When Opus 4.5 released no one was particularly excited. It was better with some slightly large numbers but whatever. It took about a month before everyone realized “holy shit this is a step function improvement in usefulness”. Benchmarks being +15% better on SWE bench didn’t mean a damn thing.

karmasimida 6 hours ago|||

Your feeling is not my feeling, codex is unambiguously smarter model for me

xyst 5 hours ago||

Benchmarks are useless compared to real world performance.

Real world performance for these models is a disappoint.

bgirard 6 hours ago||

> Using the develop web game skill and preselected, generic follow-up prompts like "fix the bug" or "improve the game", GPT‑5.3-Codex iterated on the games autonomously over millions of tokens.

I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?

I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.

[1] https://factory-gpt.vercel.app/

veb 6 hours ago|

I just wanted to say that's a pretty cool demo! I hadn't realised people were using it for things like this.

bgirard 6 hours ago||

Thank you. There's a demo save to get the full feel of it quickly. There's also a 2D-ASCII and 3D render you can hotswap between. The 3D models are generated with Meshy. The entire game is 'AI slop'. I intentionally did no code reviews to see where that would get me. Some prompts were very specific but other prompts were just 'add a research of your choice'.

This was built using old versions of Codex, Gemini and Claude. I'll probably work on it more soon to try the latest models.

nananana9 6 hours ago|

I've been listening to the insane 100x productivity gains you all are getting with AI and "this new crazy model is a real game changer" for a few years now, I think it's about time I asked:

Can you guys point me ton a single useful, majority LLM-written, preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code?

pkoiralap 6 hours ago||

In the 1930s, when electronic calculators were first introduced, there was a widespread belief that accounting as a career was finished. Instead, the opposite became true. Accounting as a profession grew, becoming far more analytical/strategic than it had been previously.

You are correct that these models primarily address problems that have already been solved. However, that has always been the case for the majority of technical challenges. Before LLMs, we would often spend days searching Stack Overflow to find and adapt the right solution.

Another way to look at this is through the lens of problem decomposition as well. If a complex problem is a collection of sub-problems, receiving immediate solutions for those components accelerates the path to the final result.

For example, I was recently struggling with a UI feature where I wanted cards to follow a fan-like arc. I couldn't quite get the implementation right until I gave it to Gemini. It didn't solve the entire problem for me, but it suggested an approach involving polar coordinates and sine/cosine values. I was able to take that foundational logic turn it into a feature I wanted.

Was it a 100x productivity gain? No. But it was easily a 2x gain, because it replaced hours of searching and waiting for a mental breakthrough with immediate direction.

There was also a relevant thread on Hacker News recently regarding "vibe coding":

https://news.ycombinator.com/item?id=45205232

The developer created a unique game using scroll behavior as the primary input. While the technical aspects of scroll events are certainly "solved" problems, the creative application was novel.

suddenlybananas 5 hours ago||

The story you're describing doesn't seem much better than one could get from googling around and going on stackoverflow

strokirk 5 hours ago|||

It doesn’t have to be, really. Even if it could replace 30% of documentation and SO scrounging, that’s pretty valuable. Especially since you can offload that and go take a coffee.

pkoiralap 3 hours ago|||

I think the 'better than googling' part is less about the final code and more about the friction.

For example, consider this game: The game creates a target that's randomly generated on the screen and have a player at the middle of the screen that needs to hit the target. When a key is pressed, the player swings a rope attached to a metal ball in circles above it's head, at a certain rotational velocity. Upon key release, the player has to let go of the rope and the ball travels tangentially from the point of release. Each time you hit the target you score.

Now, I’m trying to calculate the tangential velocity of a projectile from a circular path, I could find the trig formulas on Stack Overflow. But with an LLM, I can describe the 'vibe' of the game mechanic and get the math scaffolded in seconds.

It's that shift from searching for syntax to architecting the logic that feels like the real win.

comex 1 hour ago||

The downside is that you miss the chance to brush up on your math skills, skills that could help you understand and express more complicated requirements.

...This may still be worth it. In any case it will stop being a problem once the human is completely out of the loop.

edit: but personally I hate missing out on the chance to learn something.

pkoiralap 1 hour ago||

That would indeed be the case if one has never learned the stuff. And I am all in for not using AI/LLM for homework/assignments. I don't know about others, but when I was in school, they didn't let us use calculators in exams.

Today, I know very well how to multiply 98123948 and 109823593 by hand. That doesn't mean I will do it by hand if I have a calculator handy.

Also, ancient scholars, most notably Socrates via Plato, opposed writing because they believed it would weaken human memory, create false wisdom, and stifle interactive dialogue. But hey, turns out you learn better if you write and practice.

xandrius 6 hours ago|||

Why even come to this site if you're so anti-innovation?

Today with LLMs you can literally spend 5 minutes defining what you want to get, press send, go grab a coffee and come back to a working POC of something, in literally any programming language.

This is literally stuff of wonders and magic that redefines how we interface with computers and code. And the only thing you can think of is to ask if it can do something completely novel (that it's so hard to even quantity for humans that we don't have software patents mainly for that reason).

And the same model can also answer you if you ask it about maths, making you an itinerary or a recipe for lasagnas. C'mon now.

legulere 5 hours ago|||

I don't think that the user you are responding to is anti-innovation, but rather points out that the usefulness of AI is oversold.

I'm using Copilot for Visual Studio at work. It is useful for me to speed some typing up using the auto-complete. On the other hand in agentic mode it fails to follow simple basic orders, and needs hand-holding to run. This might not be the most bleeding-edge setup, but the discrepancy between how it's sold and how much it actually helps for me is very real.

ifwinterco 3 hours ago||

I think copilot is widely considered to be fairly rubbish, your description of agentic coding was also my experience prior to ~Q3 2025, but things have shifted meaningfully since then

pcloadlett3r 2 hours ago||

Copilot has access to the latest models like Opus 4.6 in agentic mode as well. It's got certain quirks and I prefer a TUI myself but it isn't radically different.

svantana 5 hours ago|||

There are different kinds of innovation.

I want AI that cures cancer and solves climate change. Instead we got AI that lets you plagiarize GPL code, does your homework for you, and roleplay your antisocial horny waifu fantasies.

Rover222 5 minutes ago|||

Not 100x but absolutely 4x to 5x increase in productivity for everyone on team on a large enterprise codebase that serves the military a lot of serious clients.

To deny at least that level of productivity at this point, you have to have your head in the sand.

rohit89 5 hours ago|||

> that hasn't been solved before a bunch of times in publicly available code?

And this matters because? Most devs are not working on novel never before seen problems.

kevstev 5 hours ago||

Heh, I agree. There is a vast ocean of dev work that is just "upgrade criticalLib to v2.0" or adding support for a new field from the FE through to the BE.

I can name a few times where I worked on something that you could consider groundbreaking (for some values of groundbreaking), and even that was usually more the combination of small pieces of work or existing ideas.

As maybe a more poignant example- I used to do a lot of on-campus recruiting when I worked in HFT, and I think I disappointed a lot of people when I told them my day to day was pretty mundane and consisted of banging out Jiras, usually to support new exchanges, and/or securities we hadn't traded previously. 3% excitement, 97% unit tests and covering corner cases.

turblety 3 hours ago|||

I'm not sure if you'd call it a productivity gain, but I have to host our infrastructure on a system that runs processes entirely in Linux userland.

To bridge the containers in userland only, without root, I had to build: https://github.com/puzed/wrapguard

I'm sure it's not perfect, and I'm sure there are lots of performance/productivity gains that can be made, but it's allowed us to connect our CDN based containers (which don't have root) across multiple regions, talking to each other on the same Wireguard network.

No product existed that I could find to do this (at least none I could find), and I could never build this (within the timeframe) without the help of AI.

revahage 5 hours ago|||

Well, it took opus 4.5 five messages to solve a trivial git problem for me. It hallucinated nonexistent flags three times. Hallucinating nonexistent flags is certainly a novel solution to my git ineptness.

Not to be outdone, chatgpt 5.2 thinking high only needed about 8 iterations to get a mostly-working ffmpeg conversion script for bash. It took another 5 messages to translate it to run in windows, on powershell (models escaping newlines on windows properly will be pretty nuch AGI, as far as I’m concerned).

AstroBen 3 hours ago|||

the 100x gains, even 10x, are obviously ridiculous but that doesn't mean AI is useless

Def_Os 5 hours ago|||

Yeah, I would LOVE to see attempts at significant video games that are then open-sourced for communities to work on. E.g. OpenGTA or OpenFIFA/OpenNHL.

llmslave 4 hours ago|||

baffled that people are still suspicious of ai coding models

logicprog 59 minutes ago|||

> single useful ... preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code

I see this originality criteria appended a lot, and

1) I don't think it's representative of the actual requirements for something to be extremely useful and productivity-enhancing, even revolutionary, for programming. IDE features, testing, code generation, compilers — all of these things did not really directly help you produce more original solutions to original problems, and yet they were huge advances in program or productivity.

I mean like. How many such programs are there in general?

The vast vast majority of programs that are written are slight modifications, reorganizations, or extensions, of one or more programs that are already publicly available a bunch of times over.

Even the ones that aren't could fairly easily be considered just recombinations of different pieces of programs that have been written and are publicly available dozens or more times over, just different parts of them combined in a different order.

Hell, most code is a reorganization or recombination of the exact same types of patterns just in a different way corresponding to different business logic or algorithms, if you want to push it that far.

And yet plenty of deeply unoriginal programs are very useful and fill a useful niche, so they get written anyway.

2) Nor is it a particularly satisfiable goal. If there aren't, as a percentage, very many reliable, useful, and original programs that have been written in the decades since open source became a thing, why would we expect a five-year-old technology to have done so, especially when, obviously, the more reliable original and broadly useful programs have already written, the narrower the scope for new ones to satisfy the originality criteria?

3) Nor is it actually something that we would expect even under the hypothesis that agents make people significantly more productive at programs. Even if agents give 100x productivity gains to writing a useful tool or service or program or improving existing ones with new features. We still wouldn't expect them to give necessarily very many much productivity gains at all to writing original programs, precisely because of their current technology is a product of deep thinking, understanding a specific domain, seeing a niche, inspiration, science, talent and luck much more than the ability to even do productive engineering.

jason_oster 1 hour ago|||

Personally, I’ve only been using a coding agent for a few months infrequently, so I have nothing to show for it. (It is not 100x productivity, that’s absurd.)

But I have plenty of examples of really atrocious human written code to show you! TheDailyWtf has been documenting the phenomenon for decades.

beernet 6 hours ago|||

Can you point me to a human written program an LLM cannot write? And no, just answering with a massively large codebase does not count because this issue is temporary.

Some people just hate progress.

HAL3000 4 hours ago|||

> Can you point me to a human written program an LLM cannot write?

Sure:

"The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

As one particularly challenging example, Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claude’s compiler can compile completely by itself.)"[1]

1. https://www.anthropic.com/engineering/building-c-compiler

svantana 5 hours ago||||

Pretty much any software that people pay for? If LLMs could clone an app, why would anyone still pay good money for the original?

falloutx 5 hours ago||||

Even a normal website like landonorris.com. Try copying all those effects with AI.

Another example: Red Dead Redemption 2

Another one: Roller coaster tycoon

Another one: ShaderToy

avaer 4 hours ago|||

I wish I could agree with you, but as a game dev, shader author, and occasional asm hacker, I still think AIs have demonstrated being perfectly capable of copying "those effects". It's been trained on them, of course.

You're not gonna one-shot RD2, but neither will a human. You can one-shot particles and shader passes though.

falloutx 4 hours ago||

I didnt say one shot it, coding agents have been out for more than couple years and yet we cant point to single Good piece of software built by it.

satvikpendem 5 hours ago|||

Why do you believe an LLM can't write these, just because they're 3D? If the assets are given (just as with a human game programmer, who has artists provide them the assets), then an LLM can write the code just the same.

falloutx 5 hours ago||

What? People can easily get assets, thats not a even a problem in 2026. Roller coaster tycoon's assets were done by the programmer himself. If its so easy why haven't we seen actually complex pieces of software done in couple of weeks by LLM users?

Also try building any complex effects by prompting LLMs, you wont get any far, this is why all of the LLM coded websites look stupidly bland.

satvikpendem 4 hours ago||

Not sure what you're confused about, I never said assets were hard to get, I just said that the LLM needs to be provided a folder of the assets for it to make use of them, it's not going to create them from scratch (at least not without great difficulty, because LLMs are capable of using and coding Three.js for example). I don't know the answer to your first question because I don't hang around in the 3D or game dev fields, I'm sure there are examples of vibe coded games however.

As to your second question, it is about prompting them correctly, for example [0]. Now I don't know about you but some of those sites especially after using the frontend skill look pretty good to me. If those look bland to you then I'm not really sure what you're expecting, keeping in mind that the example you showed with the graphics are not regular sites but more design oriented, and even still nothing stops LLMs from producing such sites.

[0] https://youtu.be/f2FnYRP5kC4

falloutx 4 hours ago||

you have shown me 0 examples, I showed actual examples to the given question. Your answers have just been "AI can also do this" but gave no actual proof.

satvikpendem 4 hours ago||

The examples are in the video I linked, as I said, if you don't bother to watch it then I'm not sure what to tell you. As I said for games I don't know and won't presume to search up some random vibe coded game if I don't have personal experience with how LLMs handle games, but for web development, the sites I've made and seen made look pretty good.

Edit: I found examples [0] of games too with generated assets as well. These are all one shot so I imagine with more prompting you can get a decent game all without coding anything yourself.

[0] https://www.youtube.com/watch?v=8brENzmq1pE

suddenlybananas 5 hours ago|||

And some people clearly hate humans.

eviks 6 hours ago||

Great question, here is the link from the future:

More comments...