GPT‑5.3‑Codex‑Spark

Posted by meetpateltech 4 hours ago

428 points | 190 comments

beklein 4 hours ago|

I love this! I use coding agents to generate web-based slide decks where “master slides” are just components, and we already have rules + assets to enforce corporate identity. With content + prompts, it’s straightforward to generate a clean, predefined presentation. What I’d really want on top is an “improv mode”: during the talk, I can branch off based on audience questions or small wording changes, and the system proposes (say) 3 candidate next slides in real time. I pick one, present it, then smoothly merge back into the main deck. Example: if I mention a recent news article / study / paper, it automatically generates a slide that includes a screenshot + a QR code link to the source, then routes me back to the original storyline. With realtime voice + realtime code generation, this could turn the boring old presenter view into something genuinely useful.

sva_ 3 hours ago||

I love the probabilistic nature of this. Presentations could be anywhere from extremely impressive to hilariously embarrassing.

clickety_clack 2 hours ago||

It would be so cool if it generated live in the presentation and adjusted live as you spoke, so you’d have to react to whatever popped on screen!

crystal_revenge 1 hour ago|||

There was a pre-LLM version of this called "battledecks" or "PowerPoint Karaoke"[0] where a presenter is given a deck of slides they've never seen and have to present on it. With a group of good public speakers it can be loads of fun (and really impressive the degree that some people can pull it off!)

0. https://en.wikipedia.org/wiki/PowerPoint_karaoke

bsharper 1 hour ago||

There is a Jackbox game called "Talking Points" that's like this: the players come up with random ideas for presentations, your "assistant" (one of the other players) picks what's on each slide while you present: https://www.youtube.com/watch?v=gKnprQpQONw

nikcub 3 minutes ago||||

and with neuralink it would generate slides of the audience naked

Etheryte 2 hours ago||||

Some consulting firms do this, one guy is giving the presentation live while others are in the next meeting room still banging out the slides.

onionisafruit 2 hours ago||||

Every presentation becomes improv

deepGem 2 hours ago|||

Isn't that such a great outcome. No more robotic presentations. The best part is that you can now practice Improv at the comfort of your home.

mbreese 1 hour ago||

And this product will work great for any industry... can I get a suggestion for an industry from the crowd?

Audience: Transportation... Education... Insurance...

Speaker: Great! I heard "Healthcare".

Right... as we can see from this slide, this product fits the "Healthcare" industry great because of ...

lelandfe 39 minutes ago||

Caro’s first LBJ biography tells of how the future president became a congressman in Texas in his 20s, by carting around a “claque” of his friends to various stump speeches and having them ask him softball questions and applauding loudly after

Well, hey, who needs friends?

DonHopkins 2 hours ago|||

I had a butterfly take over my live DreamScape slide show demo at the 1995 WWDC.

https://youtu.be/5NytloOy7WM?t=321

m_mueller 39 minutes ago|||

You're describing almost verbatim what we're building at Octigen [1]! Happy to provide a demo and/or give you free access to our alpha version already online.

[1] https://octigen.com

deepGem 1 hour ago|||

I built something similar at a hackathon, a dynamic teleprompter that adjusts the speed of tele-prompting based on speaker tonality and spoken wpm. I can see extending the same to an improv mode. This is a super cool idea.

jorgenveisdal 2 hours ago|||

As an associate professor who spends a ridiculous amount of time preparing for lectures, I would love to try this in one of my courses

esafak 3 hours ago|||

Can you show one?

beklein 1 hour ago||

The end result would be a normal PPT presentation, check https://sli.dev as an easy start, ask Codex/Claude/... to generate the slides using that framework with data from something.md. The interesting part here is generating these otherwise boring slide decks not with PowerPoint itself but with AI coding agents and a master slides, AGENTS.md context. I’ll be showing this to a small group (normally members only) at IPAI in Heilbronn, Germany on 03/03. If you’re in the area and would like to join, feel free to send me a message I will squeeze you in.

orochimaaru 3 hours ago|||

How do you handle the diagrams?

beklein 3 hours ago||

In my AGENTS.md file i have a _rule_ that tells the model to use Apache ECharts, the data comes from the prompt and normally .csv/.json files. Prompt would be like: "After slide 3 add a new content slide that shows a bar chart with data from @data/somefile.csv" ... works great and these charts can be even interactive.

orochimaaru 2 hours ago||

What about other ad hoc diagrams like systems architecture, roadmaps, mind maps, etc.

These are the bane of any staff engineers life - lol. Because people above need to know a plan in art form.

So seriously interested on how I can make it easier

beklein 1 hour ago|||

Not my normal use-case, but you can always fall back and ask the AI coding agent to generate the diagram as SVG, for blocky but more complex content like your examples it will work well and still is 100% text based, so the AI coding agents or you manually can fix/adjust any issues. An image generation skill is a valid fallback, but in my opinion it's hard to change details (json style image creation prompts are possible but hard to do right) and you won't see changes nicely in the git history. In your use case you can ask the AI coding agent to run a script.js to get the newest dates for the project from a page/API, then it should only update the dates in the roadmap.svg file on slide x with the new data. This way you will automagically have the newest numbers and can track everything within git in one prompt. Save this as a rule in AGENTS.md and run this every month to update your slides with one prompt.

mcamac 2 hours ago||||

You could try something like mermaid (or ASCII) -> nano banana. You can also go the other way and turn images into embedded diagrams (which can be interactive depending on how you're sharing the presentation)

sleazebreeze 1 hour ago|||

Claude code can output Excalidraw format files which can be imported directly into the webapp. You can MCP it too if you want.

turnsout 3 hours ago||

I love the idea of a living slide deck. This feels like a product that needs to exist!

postalcoder 2 hours ago||

First thoughts using gpt-5.3-codex-spark in Codex CLI:

Blazing fast but it definitely has a small model feel.

It's tearing up bluey bench (my personal agent speed benchmark), which is a file system benchmark where I have the agent generate transcripts for untitled episodes of a season of bluey, perform a web search to find the episode descriptions, and then match the transcripts against the descriptions to generate file names and metadata for each episode.

Downsides:

- It has to be prompted to do actions in my media library AGENTS.md that the larger models adhere to without additional prompting.

- It's less careful with how it handles context which means that its actions are less context efficient. Combine that with the smaller context window and I'm seeing frequent compactions.

  Bluey Bench* (minus transcription time):

  Codex CLI
  gpt-5.3-codex-spark low        20s
  gpt-5.3-codex-spark medium     41s
  gpt-5.3-codex-spark xhigh   1m 09s (1 compaction)

  gpt-5.3-codex low           1m 04s
  gpt-5.3-codex medium        1m 50s

  gpt-5.2 low                 3m 04s
  gpt-5.2 medium              5m 20s

  Claude Code
  opus-4.6 (no thinking)      1m 04s

  Antigravity
  gemini-3-flash              1m 40s
  gemini-3-pro low            3m 39s

  *Season 2, 52 episodes

alexdobrenko 1 hour ago||

can we plese make the bluey bench the gold standard for all models always

mnicky 2 hours ago|||

Can you compare it to Opus 4.6 with thinking disabled? It seems to have very impressive benchmark scores. Could also be pretty fast.

postalcoder 1 hour ago||

Added a thinking-disabled Opus 4.6 timing. It took 1m 4s – coincidentally the same as 5.3-codex-low.

Squarex 2 hours ago||

I wonder why they named it so similiarly to the normal codex model while it much worse, while cool of course.

pjs_ 3 hours ago||

Continue to believe that Cerebras is one of the most underrated companies of our time. It's a dinner-plate sized chip. It actually works. It's actually much faster than anything else for real workloads. Amazing

onlyrealcuzzo 2 hours ago||

Nvidia seems cooked.

Google is crushing them on inference. By TPUv9, they could be 4x more energy efficient and cheaper overall (even if Nvidia cuts their margins from 75% to 40%).

Cerebras will be substantially better for agentic workflows in terms of speed.

And if you don't care as much about speed and only cost and energy, Google will still crush Nvidia.

And Nvidia won't be cheaper for training new models either. The vast majority of chips will be used for inference by 2028 instead of training anyway.

Nvidia has no manufacturing reliability story. Anyone can buy TSMC's output.

Power is the bottleneck in the US (and everywhere besides China). By TPUv9 - Google is projected to be 4x more energy efficient. It's a no-brainer who you're going with starting with TPUv8 when Google lets you run on-prem.

These are GW scale data centers. You can't just build 4 large-scale nuclear power plants in a year in the US (or anywhere, even China). You can't just build 4 GW solar farms in a year in the US to power your less efficient data center. Maybe you could in China (if the economics were on your side, but they aren't). You sure as hell can't do it anywhere else (maybe India).

What am I missing? I don't understand how Nvidia could've been so far ahead and just let every part of the market slip away.

sailingparrot 2 hours ago|||

> let every part of the market slip away.

Which part of the market has slept away, exactly ? Everything you wrote is supposition and extrapolation. Nvidia has a chokehold on the entire market. All other players still exist in the small pockets that Nvidia doesn’t have enough production capacity to serve. And their dev ecosystem is still so far ahead of anyone else. Which providers gets chosen to equip a 100k chips data center goes so far beyond the raw chip power.

onlyrealcuzzo 2 hours ago||

> Nvidia has a chokehold on the entire market.

You're obviously not looking at expected forward orders for 2026 and 2027.

louiereederson 33 minutes ago||

I think most estimates have Nvidia at more or less stable share of CoWoS capacity (around 60%), which is ~doubling in '26.

mnicky 2 hours ago||||

> What am I missing?

Largest production capacity maybe?

Also, market demand will be so high that every player's chips will be sold out.

onlyrealcuzzo 2 hours ago||

> Largest production capacity maybe?

Anyone can buy TSMC's output...

Keyframe 1 hour ago||

Can anyone buy TSMC though?

louiereederson 28 minutes ago||

No. TSMC will not take the risk on allocating capacity to just anyone given the opportunity cost.

wing-_-nuts 2 hours ago||||

Man I hope someone drinks Nvidia's milk shake. They need to get humbled back to the point where they're desperate to sell gpus to consumers again.

Only major road block is cuda...

whism 2 hours ago||||

I believe they licensed smth from groq

Handy-Man 2 hours ago|||

Well they `acquired` groq for a reason.

zozbot234 3 hours ago|||

It's "dinner-plate sized" because it's just a full silicon wafer. It's nice to see that wafer-scale integration is now being used for real work but it's been researched for decades.

arcanemachiner 3 hours ago|||

Just wish they weren't so insanely expensive...

azinman2 3 hours ago||

The bigger the chip, the worse the yield.

speedgoose 2 hours ago|||

I suggest to read their website, they explain pretty well how they manage good yield. Though I’m not an expert in this field. I does make sense and I would be surprised if they were caught lying.

moralestapia 2 hours ago|||

This comment doesn't make sense.

Sohcahtoa82 2 hours ago|||

One wafer will turn into multiple chips.

Defects are best measured on a per-wafer basis, not per-chip. So if if your chips are huge and you can only put 4 chips on a wafer, 1 defect can cut your yield by 25%. If they're smaller and you fit 100 chips on a wafer, then 1 defect on the wafer is only cutting yield by 1%. Of course, there's more to this when you start reading about "binning", fusing off cores, etc.

There's plenty of information out there about how CPU manufacturing works, why defects happen, and how they're handled. Suffice to say, the comment makes perfect sense.

snovv_crash 1 hour ago||

That's why you typically fuse off defective sub-units and just have a slightly slower chip. GPU and CPU manufacturers have done this for at least 15 years now, that I'm aware of.

azinman2 2 hours ago||||

Sure it does. If it’s many small dies on a wafer, then imperfections don’t ruin the entire batch; you just bin those components. If the entire wafer is a single die, you have much less tolerance for errors.

dekhn 2 hours ago|||

Although, IIUC, Cerebras expects some amount of imperfection and can adjust the hardware (or maybe the software) to avoid those components after they're detected. https://www.cerebras.ai/blog/100x-defect-tolerance-how-cereb...

pertymcpert 2 hours ago|||

You can just do dynamic binning.

louiereederson 27 minutes ago||||

You say this with such confidence and then ask if smaller chips require smaller wafers.

DocJade 2 hours ago|||

Bigger chip = more surface area = higher chance for somewhere in the chip to have a manufacturing defect

Yields on silicon are great, but not perfect

moralestapia 2 hours ago||

Does that mean smaller chips are made from smaller wafers?

Sohcahtoa82 2 minutes ago||

Nope. They use the same size wafers and then just put more chips on a wafer.

dalemhurley 2 hours ago|||

Yet investors keep backing NVIDIA.

vimda 1 hour ago||

At this point Tech investment and analysis is so divorced from any kind of reality that it's more akin to lemmings on the cliff than careful analysis of fundamentals

latchkey 3 hours ago|||

Not for what they are using it for. It is $1m+/chip and they can fit 1 of them in a rack. Rack space in DC's is a premium asset. The density isn't there. AI models need tons of memory (this product annoucement is case in point) and they don't have it, nor do they have a way to get it since they are last in line at the fabs.

Their only chance is an aquihire, but nvidia just spent $20b on groq instead. Dead man walking.

p1esk 3 hours ago|||

The real question is what’s their perf/dollar vs nvidia?

zozbot234 3 hours ago|||

I guess it depends what you mean by "perf". If you optimize everything for the absolutely lowest latency given your power budget, your throughput is going to suck - and vice versa. Throughput is ultimately what matters when everything about AI is so clearly power-constrained, latency is a distraction. So TPU-like custom chips are likely the better choice.

p1esk 2 hours ago|||

By perf I mean how much does it cost to serve 1T model to 1M users at 50 tokens/sec.

zozbot234 2 hours ago||

All 1T models are not equal. E.g. how many active parameters? what's the native quantization? how long is the max context? Also, it's quite likely that some smaller models in common use are even sub-1T. If your model is light enough, the lower throughput doesn't necessarily hurt you all that much and you can enjoy the lightning-fast speed.

p1esk 1 hour ago|||

Just pick some reasonable values. Also, keep in mind that this hardware must still be useful 3 years from now. What’s going to happen to cerebras in 3 years? What about nvidia? Which one is a safer bet?

On the other hand, competition is good - nvidia can’t have the whole pie forever.

zozbot234 1 hour ago||

> Just pick some reasonable values.

And that's the point - what's "reasonable" depends on the hardware and is far from fixed. Some users here are saying that this model is "blazing fast" but a bit weaker than expected, and one might've guessed as much.

> On the other hand, competition is good - nvidia can’t have the whole pie forever.

Sure, but arguably the closest thing to competition for nVidia is TPUs and future custom ASICs that will likely save a lot on energy used per model inference, while not focusing all that much on being super fast.

latchkey 1 hour ago||

AMD

wiredpancake 39 minutes ago|||

[dead]

fragmede 2 hours ago|||

> Throughput is ultimately what matters

I disagree. Yes it does matter, but because the popular interface is via chat, streaming the results of inference feels better to the squishy messy gross human operating the chat, even if it ends up taking longer. You can give all the benchmark results you want, humans aren't robots. They aren't data driven, they have feelings, and they're going to go with what feels better. That isn't true for all uses, but time to first byte is ridiculously important for human-computer interaction.

zozbot234 2 hours ago||

You just have to change the "popular interface" to something else. Chat is OK for trivia or genuinely time-sensitive questions, everything else goes through via email or some sort of webmail-like interface where requests are submitted and replies come back asynchronously. (This is already how batch APIs work, but they only offer a 50% discount compared to interactive, which is not enough to really make a good case for them - especially not for agentic workloads.)

xnx 3 hours ago||||

Or Google TPUs.

latchkey 3 hours ago||

TPUs don't have enough memory either, but they have really great interconnects, so they can build a nice high density cluster.

Compare the photos of a Cerebras deployment to a TPU deployment.

https://www.nextplatform.com/wp-content/uploads/2023/07/cere...

https://assets.bwbx.io/images/users/iqjWHBFdfxIU/iOLs2FEQxQv...

The difference is striking.

p1esk 2 hours ago||

Oh wow the cabling in the first link is really sloppy!

latchkey 3 hours ago|||

Exactly. They won't ever tell you. It is never published.

Let's not forget that the CEO is an SEC felon who got caught trying to pull a fast one.

spwa4 3 hours ago|||

Oh don't worry. Ever since the power issue started developing rack space is no longer at a premium. Or at least, it's no longer the limiting factor. Power is.

latchkey 3 hours ago||

The dirty secret is that there is plenty of power. But, it isn't all in one place and it is often stranded in DC's that can't do the density needed for AI compute.

Training models needs everything in one DC, inference doesn't.

femiagbabiaka 3 hours ago|||

yep

xnx 3 hours ago||

Cerebras is a bit of a stunt like "datacenters in spaaaaace".

Terrible yield: one defect can ruin a whole wafer instead of just a chip region. Poor perf./cost (see above). Difficult to program. Little space for RAM.

the_duke 3 hours ago|||

They claim the opposite, though, saying the chip is designed to tolerate many defects and work around them.

simonw 1 hour ago||

My stupid pelican benchmark proves to be genuinely quite useful here, you get a visual representation of the quality difference between GPT-5.3-Codex-Spark and full GPT-5.3-Codex: https://simonwillison.net/2026/Feb/12/codex-spark/

lacoolj 1 hour ago|

These are the ones I look for every time a new model is released. Incorporates so many things into one single benchmark.

Also your blog is tops. Keep it up, love the work.

perdomon 1 hour ago||

This has been the industry standard for the last 20 minutes. I can't believe people are still using GPT-5.3-Codex.

sam_goody 46 minutes ago|

I read this headline and was like, "A look, an announcement by GPT!! That means that Google or Anthropic must have had a release today!"

And, yup, there is Gemini in item 3!

jryio 4 hours ago||

This is interesting for offloading "tiered" workloads / priority queue with coding agents.

If 60% of the work is "edit this file with this content", or "refactor according to this abstraction" then low latency - high token inference seems like a needed improvement.

Recently someone made a Claude plugin to offload low-priority work to the Anthropic Batch API [1].

Also I expect both Nvidia and Google to deploy custom silicon for inference [2]

1: https://github.com/s2-streamstore/claude-batch-toolkit/blob/...

2: https://www.tomshardware.com/tech-industry/semiconductors/nv...

zozbot234 4 hours ago||

Note that Batch APIs are significantly higher latency than normal AI agent use. They're mostly intended for bulk work where time constraints are not essential. Also, GPT "Codex" models (and most of the "Pro" models also) are currently not available under OpenAI's own batch API. So you would have to use non-agentic models for these tasks and it's not clear how well they would cope.

(Overall, batches do have quite a bit of potential for agentic work as-is but you have to cope with them taking potentially up to 24h for just a single roundtrip with your local agent harness.)

dehugger 4 hours ago||

I built something similar using an MCP that allows claude to "outsource" development to GLM 4.7 on Cerebras (or a different model, but GLM is what I use). The tool allows Claude to set the system prompt, instructions, specify the output file to write to and crucially allows it to list which additional files (or subsections of files) should be included as context for the prompt.

Ive had great success with it, and it rapidly speeds up development time at fairly minimal cost.

cheema33 4 hours ago||

Why use MCP instead of an agent skill for something like this when MCP is typically context inefficient?

pertymcpert 2 hours ago|||

MCP is fine if your tool definition is small. If it's something like a sub-agent harness which is used very often, then in fact it's probably more context efficient because the tools are already loaded in context and the model doesn't have to spend a few turns deciding to load the skill, thinking about it and then invoking another tool/script to invoke the subagent.

wahnfrieden 3 hours ago|||

Models haven't been trained enough on using skills yet, so they typically ignore them

andai 3 hours ago||

Is that true? I had tool use working with GPT-4 in 2023, before function calling or structured outputs were even a thing. My tool instructions were only half a page though. Maybe the long prompts are causing problems?

pertymcpert 2 hours ago||

They're talking about "skills" which are not the same thing as tools. Most models haven't been trained on the open SKILL spec, and therefore aren't tuned to invoke them reliable when the need occurs.

nikkwong 4 hours ago||

> Our latest frontier models have shown particular strengths in their ability to do long-running tasks, working autonomously for hours, days or weeks without intervention.

I have yet to see this (produce anything actually useful).

simonw 4 hours ago||

How hard have you tried?

I've been finding that the Opus 4.5/4.6 and GPT-5.2/5.3 models really have represented a step-change in how good they are at running long tasks.

I can one-shot prompt all sorts of useful coding challenges now that previously I would have expected to need multiple follow-ups to fix mistakes the agents made.

I got all of this from a single prompt, for example: https://github.com/simonw/research/tree/main/cysqlite-wasm-w... - including this demo page: https://simonw.github.io/research/cysqlite-wasm-wheel/demo.h... - using this single prompt: https://github.com/simonw/research/pull/79

aeyes 3 hours ago|||

What do you mean? The generated script just downloads the sources and runs pyodide: https://github.com/simonw/research/blob/main/cysqlite-wasm-w...

There is maybe 5 relevant lines in the script and nothing complex at all that would require to run for days.

andai 3 hours ago|||

Maybe so, but I did once spend 12 hours straight debugging an Emscripten C++ compiler bug! (After spending the first day of the jam setting up Emscripten, and the second day getting Raylib to compile in it. Had like an hour left to make the actual game, hahah.)

I am a bit thick with such things, but just wanted to provide the context that Emscripten can be a fickle beast :)

I sure am glad I can now deploy Infinite Mechanized Autistic Persistence to such soul-crushing tasks, and go make a sandwich or something.

(The bug turned out to be that if I included a boolean in a class member, the whole game crashed, but only the Emscripten version. Sad. Ended up switching back to JS, which you basically need anyway for most serious web game dev.)

simonw 3 hours ago|||

No, not for days - but it churned away on that one for about ten minutes.

I don't think I've got any examples of multi-hour or multi-day sessions that ran completely uninterrupted - this one back in December took 4.5 hours but I had to prompt it to keep going a few times along the way: https://simonwillison.net/2025/Dec/15/porting-justhtml/

basilgohar 3 hours ago|||

Can you share any examples of these one-shot prompts? I've not gotten to the point where I can get those kind of results yet.

simonw 3 hours ago||

If you look through the commit logs on simonw/research and simonw/tools on GitHub most commits should either list the prompt, link to a PR with the prompt or link to a session transcript.

gamegoblin 4 hours ago|||

I routinely leave codex running for a few hours overnight to debug stuff

If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase

nikkwong 3 hours ago|||

I have a hard time understanding how that would work — for me, I typically interface with coding agents through cursor. The flow is like this: ask it something -> it works for a min or two -> I have to verify and fix by asking it again; etc. until we're at a happy place with the code. How do you get it to stop from going down a bad path and never pulling itself out of it?

The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?

Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.

woah 3 hours ago|||

For some reason setting up agents in a loop with a solid prompt and new context each iteration seems to result in higher quality work for larger or more difficult tasks than the chat interface. It's like the agent doesn't have to spend half its time trying to guess what you want

gamegoblin 35 minutes ago||||

I use Codex CLI or Claude Code

I don't even necessarily ask it to fix the bug — just identify the bug

Like if I've made a change that is causing some unit test to fail, it can just run off and figure out where I made an off-by-one error or whatever in my change.

zem 1 hour ago||||

it's more like "this function is crashing with an inconsistent file format error. can you figure out how a file with the wrong format got this far into the pipeline?". in cases like that the fix is usually pretty easy once you have the one code path out of several thousands nailed down.

p1esk 3 hours ago||||

“here's a failing test—do whatever you can to fix it”

Bad idea. It can modify the code that the test passes but everything else is now broken.

SatvikBeri 22 minutes ago||

I've heard this said a lot but never had this problem. Claude has been decent at debugging tests since 4.0 in my experience (and much better since 4.5)

vel0city 2 hours ago|||

You do things like ralph loops.

https://github.com/snarktank/ralph

Its constantly restarting itself, looking at the current state of things, re-reading what was the request, what it did and failed at in the past (at a higher level), and trying again and again.

tsss 3 hours ago||||

How can you afford that?

wahnfrieden 3 hours ago||

It costs $200 for a month

addaon 3 hours ago|||

> it's an ideal usecase

This is impressive, you’ve completely mitigated the risk of learning or understanding.

arcanemachiner 3 hours ago||

Or, they have freed up time for more useful endeavours, that may otherwise have spent on drudgery.

I don't discount the value of blood, sweat and tears spent on debugging those hard issues, and the lessons learned from doing so, but there is a certain point where it's OK to take a pass and just let the robots figure it out.

XCSme 4 hours ago|||

Their ability to burn through tokens non-stop for hours, days or weeks without intervention.

raw_anon_1111 3 hours ago||

You’re mixing up Open AI for Anthropic.

Anthropic is actually sort of concerned with not burning through cash and charging people a reasonable price. Open AI doesn’t care. I can use Codex CLI all day and not approach any quotas with just my $20 a month ChatGPT subscription.

I treat coding agents like junior developers and never take my hand off the wheel except for boilerplate refactoring.

TheMuenster 1 hour ago|||

Can I just say how funny this metric is?

"Our model is so slow and our tokens/second is so low that these tasks can take hours!" is not the advertising they think it is.

johnfn 3 hours ago|||

The other day I got Codex to one-shot an upgrade to Vite 8 at my day job (a real website with revenue). It worked in this for over 3 hours without intervention (I went to sleep). This is now in production.

seunosewa 2 hours ago||

How did you verify it?

girvo 1 hour ago||

Just send it bro

(but honestly for a lot of websites and web apps you really can just send it, the stakes are very low for a lot of what most people do, if they're honest with themselves)

wahnfrieden 3 hours ago|||

It worked for me several times.

It's easy to say that these increasingly popular tools are only able to produce useless junk. You haven't tried, or you haven't "closed the loop" so that the agent can evaluate its own progress toward acceptance criteria, or you are monitoring incompetent feeds of other users.

nikkwong 3 hours ago||

I'm definitely bullish on LLM's for coding. It sounds to me as though getting it to run on its own for hours and produce something usable requires more careful thought and setup than just throwing a prompt at it and wishing for the best—but I haven't seen many examples in the wild yet

foobar10000 2 hours ago|||

It needs a closed loop.

Strategy -> [ Plan -> [Execute -> FastVerify -> SlowVerify] -> Benchmark -> Learn lessons] -> back to strategy for next big step.

Claude teams and a Ralph wiggum loop can do it - or really any reasonable agent. But usually it all falls apart on either brittle Verify or Benchmark steps. What is important is to learn positive lessons into a store that survives git resets, machine blowups, etc… Any telegram bot channel will do :)

The entire setup is usually a pain to set up - docker for verification, docker for benchmark, etc… Ability to run the thing quickly, ability for the loop itself to add things , ability to do this in worktree simultaneously for faster exploration - and got help you if you need hardware to do this - for example, such a loop is used to tune and custom-fuse CUDA kernels - which means a model evaluator, big box, etc….

wahnfrieden 1 hour ago||

I do it easily just by asking Codex

rcarmo 2 hours ago|||

well, you can start with https://github.com/rcarmo/go-textile, https://github.com/rcarmo/go-rdp, https://github.com/rcarmo/go-ooxml, https://github.com/rcarmo/go-busybox (still WIP). All of these are essentially SPEC and test-driven and they are all working for me (save a couple of bugs in go-rdp I need to fix myself, and some gaps in the ECMA specs for go-ooxml that require me to provide actual manually created documents for further testing).

I am currently porting pyte to Go through a similar approach (feeding the LLM with a core SPEC and two VT100/VT220 test suites). It's chugging along quite nicely.

bitwize 3 hours ago||

PEBKAC

raahelb 2 hours ago||

Interesting to note that the reduced latency is not just due to the improved model speed, but also because of improvements made to the harness itself:

> "As we trained Codex-Spark, it became apparent that model speed was just part of the equation for real-time collaboration—we also needed to reduce latency across the full request-response pipeline. We implemented end-to-end latency improvements in our harness that will benefit all models [...] Through the introduction of a persistent WebSocket connection and targeted optimizations inside of Responses API, we reduced overhead per client/server roundtrip by 80%, per-token overhead by 30%, and time-to-first-token by 50%. The WebSocket path is enabled for Codex-Spark by default and will become the default for all models soon."

I wonder if all other harnesses (Claude Code, OpenCode, Cursor etc.,) can make similar improvements to reduce latency. I've been vibe coding (or doing agentic engineering) with Claude Code a lot for the last few days and I've had some tasks take as long as 30 minutes.

2001zhaozhao 1 hour ago|

This might actually be hard for open source agents (e.g. Opencode) to replicate, barring a standardized WebSocket LLM API being widely adopted.

kachapopopow 4 hours ago||

Is this the first time one of the big 3 using Cerebras? I've been waiting for this day...

arisAlexis 4 hours ago|

They were afraid for the untested tech but it looks like a leap in speed now

rvz 4 hours ago||

This is nonsense what do you mean? Mistral uses Cerebras for their LLMs as well. [0]

It's certainly not "untested".

[0] https://www.cerebras.ai/blog/mistral-le-chat

lemming 3 hours ago||

Tested at Mistral’s scale is a very different thing to tested at OpenAI’s scale.

rvz 3 hours ago||

The scale of being "tested" clearly convinced Meta (beyond OpenAI's scale) [0] HuggingFace [1], Perplexity [2] and unsuprisingly many others in the AI industry [3] that require more compute than GPUs can deliver.

So labelling it "untested" even at Meta's scale as a customer (which exceeds OpenAI's scale) is quiet nonsensical and frankly an uninformed take.

[0] https://www.cerebras.ai/customer-spotlights/meta

[1] https://www.cerebras.ai/news/hugging-face-partners-with-cere...

[2] https://www.cerebras.ai/press-release/cerebras-powers-perple...

[3] https://www.cerebras.ai/customer-spotlights

mudkipdev 4 hours ago|

Off topic but how is it always this HN user sharing model releases within a couple of minutes of their announcement?

casefields 3 hours ago||

The account isn’t a normal user. They literally only post stuff like this. Their comments are just official links back to said announcements.

sho_hn 4 hours ago|||

Maybe they set up an agent for it.

Squarex 3 hours ago|||

or a simple cron :)

lacoolj 1 hour ago|||

Google Alerts

More comments...