Codex logging bug may write TBs to local SSDs

Posted by vantareed 13 hours ago

Codex logging bug may write TBs to local SSDs(github.com)

397 points | 217 commentspage 3

bob1029 12 hours ago|

I'm struggling with how this much logging information could be generated at any level of verbosity. Is codex writing log entries while it's sitting idle? Why would someone want to look at these logs?

taosu_la 11 hours ago||

Can someone tell me if the current sub-agent of codex is available now? There used to always be a spinning issue.

indiv0 13 hours ago||

This thread will become a typical "haha slop company made slop" but I've been bitten by a bug exactly like this before in a (pre-AI, artisan) OSS project. The maintainer there didn't properly account for DST when calculating last backup time, so the app started and never stopped writing/re-writing backups continuously.

Perhaps the framing shouldn't be "haha slop" but rather why doesn't the AI write better quality software than we do? To which the answer is obvious IMO -- even emergent properties can't elevate AI intelligence too far above the training dataset. So how do we get to superintelligent (or at least "not-wreck-your-NVMe-endurance-telligent") AI, if we, as a whole, are not smart enough ourselves?

Judge not the slop-bot, lest ye be judged yourself, engineer.

sleples 12 hours ago||

We've gone from "you're holding it wrong" to "the training data was bad because humans suck too". Difference is, humans learn from their mistakes.

klibertp 7 hours ago|||

A singular human does (or tends to). Humans as a group, where members join and leave a group with time, also do learn, but at a much slower pace - over the years to decades timeframe. "X things programmers should know about Y" is a template for quite a few very influential blog posts, yet for most of them, you find many programmers, even decades later, who don't actually know what they "should".

My experience was always that 90% of code is ugly and clunky. I'm not at all surprised, while reviewing AI-generated code, to see many of the same ugliness we regularly commit. The quality of the output code is now consistently average, which means it's basically shit in 90% of cases, but it tends to mostly work (in the general case). The same kind of shit I've seen people push to production thousands of times in my career.

We don't fully know how to write good code. We don't really understand what good code should objectively look like. Spending more time on code doesn't automatically lead to better code (but costs a lot more). Above all, we don't need good code - the business side is perfectly fine with "good enough right now" rather than "maybe a lot better half a year from now". And that's what the models are trained on. They would, indeed, need quite a lot of "emergent properties" to go from that to consistently good code. ASI-level properties, I suspect.

SilverSlash 12 hours ago|||

> Difference is, humans learn from their mistakes.

Great! So next time the human will prompt the agent to watch out for and avoid this bug.

ponector 12 hours ago|||

You are a senior developer. Please do no mistakes!

sdesol 8 hours ago||||

> Great! So next time the human will prompt the agent to watch out for and avoid this bug.

I actually created a system for something like this. The basic idea is, once you have identified what the issue was and fixed it, you can create lessons that lives inside the repository. Lessons are designed to be mapped to one or more files so if the LLM changes the files again, they can see what the issue was.

The main challenge is being able to summarize and create proper tags so the AI after any code change can easily find the lesson.

fg137 2 hours ago|||

Given the amount of training data out there, LLMs should have been perfect by now.

Zenul_Abidin 4 hours ago|||

I've been bitten by this bug for several days, to the point where I had had to write a script to delete the WAL so that my server would stop getting locked up from a lack of disk space from codex logging.

You can find it here: https://github.com/openai/codex/issues/28224#issuecomment-47...

I have been making noise about this bug for a week, so I'm glad to see this is blowing up on HN.

xpct 11 hours ago|||

Lack of accountability is the cause here. People don't think before hitting the 'Publish' button. Their managers let them off the hook because the culture still allows making egregious mistakes, as long as there's an LLM to blame.

applfanboysbgon 12 hours ago|||

1. I bet that developer only made that mistake one time in their life. Humans learn from their mistakes, LLMs don't. If you rely on LLMs to generate all of your code, you can expect to run into the same issues again and again.

2. "One developer somewhere in the world made a bad mistake one time, so this represents the quality of all software devs everywhere". Maybe they were just a bad developer? Bad developers exist. I have never written a bug that has destroyed my users' hardware, and I think that writing such a bug is completely inexcusable in an enterprise environment with software that will be shipped to millions of users, as Codex is.

matharmin 12 hours ago|||

LLMs do learn from mistakes. Not as directly from individual mistakes like humans do, but in aggregate the models have improved much more in the last year than most humans I know learn in the same time.

xpct 11 hours ago|||

I don't like the reframing of 'learning from mistakes' from a human-like, near instantaneous feedback loop, to a year-long process of retraining on many traces collected from user data. They're different concepts and we should refer to them using different phrasing.

Y-bar 10 hours ago|||

How many more times do I have to add variations of ”do not run any commands for the application without first entering the running container at `docker compose …`” to my AGENTS.md before it learns that node and phpunit is not available outside these containers?

lifthrasiir 12 hours ago|||

> I have never written a bug that has destroyed my users' hardware, ...

Probably whoever (human or agent) originally decided to put TRACE logs into SQLite also thought---or reasoned---so. Maybe the decision was right at that time but the amount of TRACE logs have increased enormously. You will never know.

applfanboysbgon 12 hours ago||

I love that we've moved the goalposts from "LLMs are better than artisanal software engineers" to "actually, shipping hardware-destroying bugs in production is literally unavoidable, nobody could possibly avoid doing it".

lifthrasiir 12 hours ago||

I only meant what I said. After all the OP's thesis was that LLMs aren't better than artisanal software engineers, are they? There was no goalpost to move at least in this particular thread. And the solution might be another agent monitoring those oft-ignored signals.

da_grift_shift 12 hours ago||

What are your thoughts on the SNR of the linked GitHub issue threads? Consider the volume of comments posted and the substance of each comment.

indiv0 12 minutes ago|||

My gut reaction is "I wish they'd just get to the point". Tbf some people would probably react the same way to my issue thread on the bug that I hit [0].

[0]: https://github.com/mnemosyne-proj/mnemosyne/issues/99

fn-mote 11 hours ago|||

I read the first page and they were excellent. Each was clearly written by an experienced dev who knows how to substantiate their claims and propose an acceptable fix that could just be merged.

Your comment, on the other hand, would be improved by including your own opinion on the matter.

gruez 8 hours ago|||

> Each was clearly written by an experienced dev

/s?

They're clearly AI generated

rvz 13 hours ago||

The first of many bugs that are beyond the complexity of its authors, thanks to comprehension debt.

Even with tests, the more complex the code base is, the more risky it is to vibe-code on it without introducing more bugs [0] and increasing the debt. Does not matter if the CI is green or if all the tests pass.

It gets even worse if you can't explain the change / pull request or what the implications are after applying that "suggested" fix.

[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...

HPsquared 12 hours ago|

There are going to be sooooo many consulting opportunities after this wave.

hun3 12 hours ago||

The operating system has historically trusted the applications not to do dumb things too much.

Only now we're witnessing the consequences much more frequently thanks to accelerated slop.

skydhash 9 hours ago|

> The operating system has historically trusted the applications not to do dumb things too much.

The OS is a thin layer providing an abstract and consistent interface regardless of the hardware configuration. Policing applications is mostly related to security and resources utilization, not moronic errors.

hun3 6 hours ago||

> The OS is a thin layer providing an abstract and consistent interface regardless of the hardware configuration.

This is called a hardware abstraction layer, not OS.

https://en.wikipedia.org/wiki/Hardware_abstraction

abihordun 11 hours ago||

SQLite + unbounded TRACE logs = firehose in a bathtub. No rotation, no cap, no surprise. The RAISE(IGNORE) fix patches a design flaw. OpenAI's silence is worse than the bug.

consp 13 hours ago||

Why didn't the review process spot this obvious error? Oh wait ... @codex review this

cedws 12 hours ago||

Moreover why isn't the bug fixed already? I thought programmers were obsolete now. Surely one of the leading AI labs has figured out full automation of software development end-to-end by now if that's so.

charcircuit 13 hours ago|||

Because it's not an error. The software is working as the creators intended. The diagnostic data (trace logs) are intentionally being saved for debug purposes.

anematode 1 hour ago||

What?

Imustaskforhelp 13 hours ago||

I don't understand how Codex can blunder so badly. I imagine that even if they would be using vibe-coding, surely they must have some good engineers. So why is there such severe bugs?

One can argue that these products are the flagship products of their respective AI companies aside from the AI models themselves of course.

I imagine that this story will be picked up by the news left and right, some stories just feel this way and this one is like that (given 12 upvotes on HN in 7 minutes)

The only logical conclusion (from this incident) that I can have is: An (vibe-coded?) product is hard to maintain even for some of the best engineers and is bound to have severe bugs.

2. Proper testing and taking issues seriously is the key if you still wish to do this and there isn't much. This is a week old issue which I can only classify as severe.

I wish to keep an nuanced opinion about it but oh this is bad for openAI (not as bad as them accepting autonomous AI within drones and mass surveillance though)

My point is: AI has both uphills and downward valleys and cliffs. It might as well just accelerate you, which could be, towards your downfall as well. Its recommended to keep an eye while driving and not drive too fast.

AI companies might be like car companies which don't offer a brake pedal.

dathinab 13 hours ago||

> I don't understand how Codex can blunder so badly.

because they trust the AI too much (and seem to be fin with acting knowingly negligent)

the problem is

- AI tends to produces very convincing looking code, even if fully wrong

- AI does mistakes of kinds no human would do, at least no human who is also able to write convincing looking code

- code reviews are hard, a lot of devs, including senior devs, put a lot of implicit trust into the co-worker behaving "sane and non malicious". But AIs behave sometimes not so sane and in a way (wrt. trying to be convincing). In the worst case in ways which if it where a human you might consider to be them trying malicious sabotage the product

Like a "dump" example from work:

- AI randomly removes a HTML element id while doing other changes in jsx/react

- the PR has a lot of changes, the id removal line looks innocent, like some on the fly cleanup

- human reviewers have the bad tendency to often not look too much at deleted lines, only if they need reference to how a new line was before (but it's only a deleted line and no new line)

- you don't expect humans to randomly without reason delete important properties of components when changing other things

- you maybe would still have found it, but it's a emergency fix for a production issue

- it happens to miss integration tests, but happens to still matter a lot for one specific important for complicated reasons not properly tested flow (similar people tend to not test logging too much, at best the presence of needed info but hardly ever the absence of noise)

espdev 6 hours ago|||

> I don't understand how Codex can blunder so badly. I imagine that even if they would be using vibe-coding, surely they must have some good engineers. So why is there such severe bugs?

I'd say this is also partly a problem of working under intense pressure and the demand to work faster and faster - even faster now with "AI". All these companies are competing with each other very aggressively and are driving their employees like horses in order to win the "AI" race.

PunchyHamster 13 hours ago|||

> I don't understand how Codex can blunder so badly. I imagine that even if they would be using vibe-coding, surely they must have some good engineers. So why is there such severe bugs?

Because it was deemed not Hard Enough task for real engineer to look at, so AI was sent to do it with no supervision, just checking the effects.

Also overly excessive logging is probably useful to them in chasing some of the edge cases, the cost to users doesn't matter in the slightest to them

bakugo 6 hours ago|||

"Vibe coding" implies minimal to no human involvement. It doesn't matter how good of an engineer the person who typed the prompt was, they were not involved in writing or reviewing the code, so the end result will not reflect their skill. The whole point of vibe coding is making software engineers irrelevant.

People like to go on about how "good engineers review their AI code" but that's just not what's happening in reality. Not only is reviewing large amounts of AI generated code unpleasant and mentally taxing, it also negates most of the perceived productivity boost, so people are simply not doing it.

> Proper testing

There is no formal testing that would be expected to catch an issue like this. It can barely be classified as a bug, the logging is working as intended, just with negative side effects that weren't accounted for.

The only real way to proactively prevent an issue like this is for a human programmer to stop and think about this code as they're writing it and go "hmm, we're logging large amounts of data to disk at a fast pace here, this may be a bad idea". Without human involvement, this is just going to keep happening. All vibe coded software is bloated and unstable, I have yet to see a single counter-example.

supriyo-biswas 13 hours ago||

The truth of the matter is that any time that has been saved in writing the code must be spent on ensuring proper system design, reviewing the code, and most importantly of all, QA, which is an uncomfortable discussion for AI techbros who are peddling complete automation of the software profession.

whalesalad 8 hours ago|

Yikes. I have a habit of leaving sessions open for a long time. I just ran `sudo iotop` to watch live disk activity and sure enough all my idle codex sessions were spinning away writing god knows what constantly to disk.

More comments...