The last six months in LLMs in five minutes

Posted by yakkomajuri 6 hours ago

The last six months in LLMs in five minutes(simonwillison.net)

350 points | 218 comments

hollowturtle 32 minutes ago|

> The coding agents got really good

It's since november 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

All I observe they got better at tool call and answering questions about big codebases, especially if the question has a vague pattern to search, and they're superuseful for that! For generating production code even with a lot of steering and baby sitting?

Absolutely not, not quite there not even close in my experience.

But we should stop talking about 1s and 0s, especially with marketing hype trains, there exist a gradient of capabalities that agents have that really depends on the intricacies of the codebase you're working on, I think everyone has yet to discover how to better apply these tools in their day to day work.

But that totally collides with the current narrative, that flattens out our work to be always the same and that can be automated easily in each case, it's not!

That's why the debate is so polizered imo, there isn't a shared experience

kstenerud 17 minutes ago||

The polarization comes from the very disparate coding experiences and output quality that different people find when using these tools.

For example, I've had the opposite experience of yours, generating very high quality work using Claude (such as https://github.com/kstenerud/yoloai). Just in dealing with all the bugs and idiosyncrasies in the technologies I'm using, the agent has been a godsend in discovering and cataloguing them so that the implementation phase doesn't keep tripping over them: https://github.com/kstenerud/yoloai/blob/main/docs/dev/backe...

And the agents keep getting better all the time. Even in the past month I've noticed a considerable jump in its ability to anticipate issues and correctly infer implications as we build out research, design, architecture and planning docs. By the time it comes to coding, it's mostly a mechanical process that can be passed off to sonnet with a negligible defect rate.

hollowturtle 2 minutes ago||

Don't want to be rough, but I'd like to read experiences about novelty ideas that solve people real problems in the real world, your project it's just about selling new shovels.

As I commented on another thread

> If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!

Razengan 20 minutes ago|||

> It's since November 2025, the so called "inflection point", that I'm still wondering for who coding agents become "really good".

You can dig up my past comments semi-arguing with simonw that AI just isn't good enough yet, but lately I've been using Codex mostly just to review existing Godot/GDScript code: https://github.com/InvadingOctopus/comedot

and now I'd say that in this day and age one would have to be dumb to not use AI in SOME way :)

It's helped me catch a lot of bugs that would have taken me a long time to even notice on my own. I guess it helps that my project is modular enough where each file can be considered standalone, with just 1-2 dependencies and well-commented already, so the AI can look at each file on its own one at a time. You can see the AGENTS.md I use on that repo.

Most of my productivity in the last 3 or so months has been thanks to AI, though none of the code there is AI generated. I even bought a MacBook Neo just to use as an "AI thin client" while on travel! even though I already had a beefy MacBook Pro M2 Max that I just keep at home/hotel as a desktop. Codex's recent remote control features have made it more useful for the moments when I get a cool idea while out at a cafe or on a walk.

I don't just copy-paste the AI's output, because it's almost always inefficient anyway, but I use its findings to manually clean up my shit. Maybe they're not that good with GDScript yet which is a bit of a jank language anyway.

So my main framework is wholly made by meat, but I do have fun now and then telling Codex to make experimental games using only the library of modular components I have written so far, to test my framework and also the AI's abilities. This kind of work seems like a surprisingly good match for AI sometimes: It just has to put existing blocks together, that already have well-defined interfaces and contracts etc.

I've been on the $20 ChatGPT plan for about a year now, and only started using Codex since like maybe 4 months ago, almost always on the latest model with "Extended Thinking" or "Extra High", because I want my shared code to be as correct as possible because everything else I do depends on it, and I only hit limits like 2 times in the last 3 months.

Claude on the other hand, terrible: https://i.imgur.com/jYawPDY.png

Grok is OK for general stuff, never tried it for coding.

Gemini's UI/UX and lack of privacy and the AI itself is so terrible I tried it just maybe 2 times ever...and it refused to work, on Google's own Flights website and reverse image search! (it told me to do it myself)

hollowturtle 12 minutes ago|||

Thanks for sharing your experience! I totally agree that if you "own your code", as in you're invested in it, coding it and documenting it, these tools can be really valuable for review, bug fixing and maintenance, it pushes you to do better, maybe one piece at a time like you said with a good modularized codebase. I think more devs should share experiences like that, we should overthrow marketing and people narratives that "don't code anymore since X"

jaccola 6 minutes ago|||

I set up a hook that reviews every commit and highlights potential bugs (async) and writes to a report to a dir.

Then I have a script that summarises that I usually run before pushing or at end of day.

Works quite well for both improving my code and the code ai wrote.

treme 15 minutes ago||

you are experiencing reverse Dunning–Kruger effect.

For someone that just dabbled in coding prior, it went from AI building 80%, and struggling through to finish the 20% when trying to build an app/website.

now it's like 97% and struggling with last 3%. Yes it'll look rough around the edges when evaulated by a senior dev, but being able to build MVP level things to completion with ease helps you stay engaged and motivated to continue and learn.

hollowturtle 7 minutes ago||

Please do not cite Dunning–Kruger effect at random.

Who needs to generate a dumb demo of a 97% done crud app? We had code generators for those, everytime I read claims like that and I ask to explain further I then discover it's people who were not productive before generating the so called "MVP level things to completion with ease".

If you're trying to solve a HARD problem people REALLY have, it's a novelty that agents can't help with, otherwise if it gets 97% there MAYBE it's just a signal that your idea isn't that novel!

LLMs can effectively validate your business idea

jaccola 4 minutes ago||

The obvious pushback to all of the slop is: coding was never hard. Learning resources were abundant and free.

If these people had a burning desire to build things prior to LLMs and couldn’t put in the effort to learn to build them (which is also fun!) then why would they ever put the effort into anything to understand it and make it good??

jimbobthemighty 1 hour ago||

I asked Gemini for a video of 'pelican riding a unicycle in hyde park' - I was blown away by the output:

https://gemini.google.com/share/55e250c99693

sfdlkj3jk342a 36 minutes ago||

I'm surprised by Grok as well:

https://grok.com/imagine/post/8d1eab88-737f-4d46-ba92-9b6502...

Interesting that it does better at making the pelican peddle in the video generation than in image generation.

grey-area 1 hour ago|||

That’s really impressive, and slightly worrying for creatives involved in film, animation or modelling.

notachatbot123 1 hour ago|||

Even more worrying are the implications for fakenews, propaganda, fraud, deception and mental health.

sevenzero 42 minutes ago|||

This is really my biggest worry when it gets to consumer AI. People already have a hard time informing themselves properly. Now we have technology that just boosts the already existing confirmation bias people have. It's sickening.

dzhiurgis 24 minutes ago|||

Maybe short term yes. But longer term people will finally put their guard up against deception that’s been around for decades.

tonyedgecombe 10 minutes ago|||

[delayed]

layer8 10 minutes ago|||

People will still believe what they want to believe.

drdaeman 53 minutes ago|||

It’s the opposite, non-creatives (if such roles even exist in those industries) should be worried. All those models offset technical skills, allowing to get from idea to implementation through a different route (which can be easier or harder depending on idea and model - good luck tweaking that pelican’s exact pose and movements to match your imagination precisely). Nothing touches creativity, not even in the slightest.

But there’s a lot of panicking, fear-mongering and all sorts of nonsense around this whole subject.

Retric 29 minutes ago|||

My mother has started watching 100% AI generated stories on YouTube. They are good enough to be entertaining even if they include random errors like messing up the main character’s name.

The thing is the creative economy is all about people’s attention and pocketbooks, it doesn’t need to be great just good enough.

colinb 40 minutes ago|||

The truly excellent weavers will be fine?

ionwake 17 minutes ago|||

only SVG counts tho, dont know why

songting591 40 minutes ago||

[flagged]

wewewedxfgdf 1 hour ago||

Does this guy have a "publish to front page of HN" button on his blog editor?

xnorswap 30 minutes ago||

HN has a mechanism that causes popular blogs to stay popular.

It's a winner-takes-all karma prize for being first to post the article.

This causes a rush of people to post.

HN has a mechanism by which duplicate submissions count as upvotes toward the first submission.

This is a positive feedback for the desire to be first, which increases duplicate submissions and in turn the karma reward.

This effect means that good blogs stay well upvoted. This isn't altogether a bad thing, but it does mean some blogs require a string of poorly received posts before that effect wears off and people no longer rush to be first.

One way to fix this would be to attribute all karma to user simonw himself ( and do similar where attribution to an HN user is known. )

nickvec 1 hour ago|||

He’s pretty well known in the HN community. https://en.wikipedia.org/wiki/Simon_Willison

koolala 1 hour ago||

thats a cool wiki picture

schnitzelstoat 43 minutes ago|||

I liked the article, so if he has such a button I hope he keeps clicking it.

specproc 50 minutes ago|||

He's one of the main developers behind Django.

dcminter 4 minutes ago||

[delayed]

victorbjorklund 38 minutes ago|||

he usually have good posts so people usually upvote

koolala 59 minutes ago||

its better than ex-google CEO spam i see astroturfed everywhere else

Insanity 4 hours ago||

I wonder how much the 'inflection point' is a thing vs marketing. I'm sure the models got somewhat better, but even now when I'm trying to 'vibe code' a game with the latest models (combination of Codex w/ gpt5.5 and gpt5.3-codex), they really do struggle.

They definitely get something barebones up and running, but it's far from a fully fledged application.

kvakkefly 3 hours ago||

I remember this very clearly myself. Before opus 4.5, I was doing a lot of hand holding and was coding a lot myself, but I have not written code since that day more or less.

I did write some stuff myself just to learn how the enigma encryption machine worked, so wrote myself to learn. But professionally, I stopped coding in November.

krzyk 2 hours ago|||

It is sad. I like programming, if I couldn't do it and had to write text (which I do hate, I'm not a writer) it would be make quite a sad world.

bloppe 1 hour ago|||

A pattern I've settled into is to write code but leave a TODO for every narrow thing I want the LLM to do for me. Then just tell the agent to fix the todos. It's often faster and easier to give "instructions" this way

yen223 39 minutes ago||||

Nothing stopping you from doing that in a post-LLM world

satvikpendem 2 hours ago|||

Of course you can always program by hand, no one is stopping you.

junga 1 hour ago|||

Not sure this is true for all of us. I bet many/some (unsure here) are told to use ai for their daily programming tasks.

LtWorf 1 hour ago|||

Plenty of companies are forcing the use of AI to people.

tonyedgecombe 4 minutes ago||

[delayed]

viccis 3 hours ago||||

How do you justify your salary given that you're just using a tool that any of us could use for $20 an hour in your role?

peepee1982 54 minutes ago|||

I don't feel the need to justify my salary, since I'm simply lucky in that regard. But I'm pretty sure you couldn't do my job just because you had access to a coding agent. Most of my time at the office is spent discussing high-level architecture and strategy, ideas, customer requests, backward compatibility, safety, security, quality assurance, etc.

Writing the actual code is a significant part of that, but the codebase is so complex that even Opus 4.7 and GPT-5.5 struggle with it without being fed a *lot* of context and constraints. And even then, they need a *lot* of steering due to making bad decisions that only someone with an intimate knowledge of the theory behind our software is able to catch.

I can only assume that people who think coding agents can completely replace an actual developer mostly deal with trivial software regarding both scope and the type of customers they serve (individuals instead of big companies in industry).

rafaelmn 3 hours ago||||

How do you justify your salary given that you're just using OSS compiler/editor any of us could use for free in your role ?

AI just changed how I edit code - I still see coworkers (senior developers) failing with Claude/Codex and get stuck when there are trivial solutions if you understand the full problem space. Right now AI is just a productivity tool.

manmal 1 hour ago||

Can you share how you use it to edit code? I‘ve seen a couple approaches, curious what you are doing:

1. Spec -> plan -> code (all agent driven, maybe with grill-me or ultraplan)

2. Handwritten spec -> agent driven plan -> agent driven code

3. Agent driven spec -> vibed code -> Fix by handholding until ok-ish

4. Vibed throwaway prototypes -> extract useful patterns -> rewrite with handholding

5. Generate file structure with handholding -> manual TODO comments -> Fill in blanks with handholding

rafaelmn 36 minutes ago||

Usually I describe the problem, explore a bit with LLM iteratively. Then I switch to creating a plan when I have enough insight (and the LLM has it in context/same session as exploration), specifying all the things I'm trying to accomplish.

Then I just iterate with LLM - I let it start writing stuff in YOLO mode and check on what it's doing in the code steering it in the direction I want.

Usually the code LLM generates will work but is kind of garbage - but I can easily steer it towards better implementations.

Sometimes using an LLM is theoretically slower than hand-rolling - if I just sat down and focused I could outperform the iteration and the waiting, especially considering how stupid agents are at running expensive builds/test suites (with a bunch of explicit instructions in skills/claude/agents.md). But the practical improvement of going with LLM is that you have a bunch of thinking traces saved as a part of your iteration proces - it's really easy to get back into flow. This is a huge productivity win for me given how many interruptions I have in my work day. Like so many people like to point out - writing code ends up being less and less of your time as you level up in your career.

skor 14 minutes ago||||

This is _the_ question we must all be able to answer, so here goes my attempt - we all have access to the same tools, before stackoverflow it was forums, books/manuals, so its always been about “getting there, showing up, figuring it out” your hypothetical boss has other things to do than kick a LLM around at that price

musebox35 3 hours ago||||

Please see Ben Evans’ podcast on a good take on this. Coding is just one of the task you do in your job, it is not the job or at least it probably is not. You do not get paid to code, you get paid to make a set of decisions that create value to the company. If this is automated then yes sadly your salary is not justified.

Timwi 2 hours ago|||

> Coding is just one of the task[s] you do in your job

But it's by far the most fun part and the only reason to take such a job...

peepee1982 49 minutes ago|||

I agree, but the reality is that most people work to make a living, not to have fun. If you enjoy your job because you mostly get to write code in a tight feedback loop instead of doing the "hard" work of planning, writing and reviewing specs, balancing customer requirements, and the lot, you have a very privileged life. And those jobs are probably going to get fewer now.

It's kind of sad. But on the other hand, I am glad I don't have to write every little line of code myself *on top* of having to do all the other stuff.

OakNinja 1 hour ago|||

To me, LLM's free up time for me so that I can spend time on the fun parts of coding. Less boilerplate, more focus on the interesting problems. This is no different from using high level languages. The problem domain is less around memory management and garbage collection and closer to the problem you're actually trying to solve.

dawnerd 1 hour ago|||

But we’ve had tools to automate out the boilerplate for years. We don’t need ai for that. It’s seriously like we all forgot we could run one command and scaffold a project. AI isn’t even that great at it. Last I tried a month ago it used a really out of date version of nextjs and picked all sorts of random deps that weren’t in the plan.

I could have just used the next project scaffold tool and been on my way before the ai even started returning output.

stepbeek 1 hour ago|||

I agree with this. I feel like there’s a false dichotomy right now in a lot of these discussions where one can only vibe code or only code by hand. It is possible to do both…

BOOSTERHIDROGEN 2 hours ago|||

Which episode ?

aspenmartin 3 hours ago||||

Someone competent using them is today a requirement and for awhile will make the marginal utility of skilled workers greater than that of unskilled. The justification is that they are much more productive than they were before.

MikeNotThePope 2 hours ago||||

You can build things quickly with AI, but you can’t delegate your responsibilities to AI. Once the AI starts struggling, you’ll need to takeover and figure it out.

altmanaltman 2 hours ago||||

They're using a tool that anyone can use for $20 an hour, sure. But that's not what they're "just" doing. This is what is so insane about non-technical people talking about code - writing the actual syntax is not really the hard part.

What you're saying is like "how do you justify your salary as a NASA engineer when anyone can use Simulink and generate the code?"

It is extremely ignorant.

pastel8739 1 hour ago||||

How do you justify your salary given that you sit in a chair all day, likely making the world worse, and make 5x as much as someone saving lives, building houses, or teaching kids how to read?

IshKebab 1 hour ago||

Supply and demand. Not many people are good at programming and it's highly in demand.

The question is how many people will be good at vibe coding? If the answer is "lots" then we can definitely expect programming salaries to return to "normal" levels. His question is very relevant; you can't dismiss it as easily as that.

apsurd 1 hour ago||

it can be easily dismissed because "anyone can use the tool that costs $20" makes no meaningful sense.

this was always true in fact $20 is more than the free it costs for notepad++

it's a flippant statement. Go down the line of any tool; it's cost has basically nothing to do with skill difference to operate it. See basically everything. There's levels.

IshKebab 1 hour ago||

I have no idea what you're trying to say. If anyone really can vibe code then programming salaries are pretty much guaranteed to come down. The critical question is whether it really is true that anyone can do it, or if it still requires rare skill.

apsurd 55 minutes ago||

are you a programmer? it 100% requires skill. AI or not.

i'm trying to say there's levels to this. if you don't agree then you don't agree. but i can buy commodity tools for any skill and that doesn't make me professional grade at that skill.

piva00 1 hour ago||||

I don't think you understand how programming as a job works, writing code is the final output of the process but it's not the job in itself.

komali2 1 hour ago||||

There is no good justification for anyone's salary really, except perhaps doctors and underwater welders.

yieldcrv 3 hours ago||||

no engineers on staff and stakeholders think the company is incompetent

Coinbase is paying the price for that for every UX glitch, after the CEO was gleeful about HR personnel shipping production code

wilg 2 hours ago||||

They don't need to justify it!

bsder 3 hours ago|||

Because the tool will happily give you a "solution" that kinda works for a few inputs. It will happily correct itself when you give it more incorrect tests.

It will almost never converge on the general solution that will pass tests you haven't given it yet.

This is why AI is sooo good at Javascript and related slop. A solution that "kinda works" is good enough 9 times out of 10 and if some tests fail well ... YOLO and the web page will probably render anyway.

Contrast that to using Scheme or Lisp where AI will have trouble simply keeping the parentheses balanced.

FeepingCreature 2 hours ago||

To be fair, take away a human's paren highlighting and see how well they do.

dkersten 58 minutes ago|||

While I certainly like parentheses highlighting and rainbow parentheses, I've programmed Clojure without syntax highlighting and while it’s not as nice as it would be with, it’s fine.

I’ve also written C++ and Java in Notepad long ago. Not ideal, but hardly a problem.

hansmayer 2 hours ago||||

Not everyone is a "coder" you know, some of us are engineers.

sampullman 1 hour ago|||

You adjust pretty quickly. Taking away compiler error messages would be fun though.

szundi 3 hours ago|||

[dead]

bluegatty 4 hours ago|||

Paradox - you can get multiple inflection points even as systems start to have dimishing marginal returns in core capability, I think this is due to 'threshold crossing' where something 'becomes good enough for a specific purpose' - it just unlocks capabilities.

'Nail Guns' used to be heavy, required heavy power cords, they were extremely expensive. When they got lighter, cheaper, battery pack ... at some point, they blend seamlessly into the roofers process, and multiply dramatically the work that can be done. Marginal improvements beyond that may not yield the same 'unlocks' because the threshold has been crossed.

smackeyacky 1 hour ago|||

This is a great analogy. Jan/Feb this year was when the models crossed from useful to essential.

asdff 1 hour ago||||

Nitpick but commercial roofers prefer pneumatic over battery.

szundi 2 hours ago|||

[dead]

magicalhippo 2 hours ago|||

I've "vibed" some non-trivial stuff lately using a combination of Codex with 5.5 and Claude Code with Opus 4.7.

Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases. I go back and forth between them on this document until we're all happy.

For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered. This becomes input to next phase.

I do check the documents, and what they're doing. I also check the tests, some more thorough. And some spot checks on the code to see if I like the structure.

I have mainly used Claude for coding and Codex for design and code review after phases. I ask both to check test coverage after phases.

Managed to implement some tools and libraries without writing a single line of code this way, which have been very beneficial to us.

Since it's so async I can work on other stuff while they plod along.

I think it's not universal though. But stuff that can be tested easily and which you have a firm grasp of what you want to achieve, but not necessarily exactly how, that I've been impressed with.

nopurpose 1 hour ago|||

Do you use anything to orcheatrate multiple agent pitted against each other (coder, reviewer, tester, etc)?

manmal 1 hour ago||||

That’s not vibing, but waterfall development.

whatshisface 1 hour ago||

Waterfall was famous for wasting developer time and extending delivery dates in exchange for simplifying management. If Claude time is comparatively inexpensive, but human oversight remains necessary, we will switch back to waterfall because the relative importance of the two resources will invert.

WesolyKubeczek 1 hour ago||||

> Key has been to spend a fair amount of time on initial overall design document, which is split into tangible and limited phases.

> For each phase an implementation plan is made. At the end, a summary document of what was delivered and what was discovered.

> I do check the documents, and what they're doing. I also check the tests, some more thorough.

Sounds like programming, but with extra steps.

magicalhippo 1 minute ago|||

[delayed]

dawnerd 1 hour ago|||

Also the least fun part of development. Maybe I’m the weird one but I like to just jump right in, planning every last detail before writing code is boring.

mrcsharp 1 minute ago||

> planning every last detail before writing code is boring

Not only that but you can't really plan everything. It is impossible. Without LLMs, with every line of code you are making a decision or discovering something new that must be dealt with or realizing how the current thing might impact something else and so on.

There is no way for a programmer to consider all of these little things ahead of time and if an attempt is made, it will take as long as actually writing that code.

nothinkjustai 1 hour ago|||

None of it is non-trivial tho. You might think so, but it’s not.

ryanjshaw 1 hour ago|||

I find it gets you past the starting line but when you dig into the code it’s a mess of duplicated code, muddled responsibilities, poor architecture, 10k line files that eat your tokens, etc.

I’m building something using LLMs to scrape websites/socials for unstructured event data from combined text/images and the only way I’ve managed to get 100% consistent results for a reasonable cost is to break the task down into very small pieces that reduce the scope of mistakes significantly.

At present, for reasonable complex tasks, Codex/Claude will happily code you into an expensive corner.

ben_w 47 minutes ago||

Indeed. To add to this, the obvious solution (ask the AI to break down the tasks to whatever METR says they'd be capable of 80% of the time) is of limited utility, as the AI are only so-so at estimating task complexity.

(Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs").

minimaxir 4 hours ago|||

Opus 4.5 in November 2025 was legitimately, unironically an inflection point and is the sole reason for the current hysteria.

GPT 5.5 is a significant improvement over GPT 5.4 but I wouldn't call it an inflection.

baq 2 hours ago||

5.2 and the first codex model were step function changes in capability

orrito 1 hour ago|||

While some people got it to work better, for me vibe coding games still didn't reach the point of regular sites/web apps. Physics, creativity, assets and UI/UX still need a lot of hand handholding with the models. Games that are more interface based like point and click or something like reigns are easier though

adgjlsfhk1 4 hours ago|||

It's very real. Just in the past 2 months or so IMO there's been a pretty big improvement in claude for local dev (although I think a lot of that is less model strength and more harness capability). 1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid). The other biggest difference I've noticed is a better balance of actually doing the work vs pushing back on bad ideas. I want the AI to tell me if it thinks the thing I am telling it is wrong or a bad idea, but if I confirm, I want it to do that anyway. A couple months ago, the claude was a lot more likely to either say "This is too much work I'm not going to do all of it", tell me the idea was genius (and then pretend to do it) or something equally useless.

DeathArrow 3 hours ago||

>1m context is a huge difference (~30 min vs 2.5hr between compact significantly increases the scope of what I get the AI to do before it goes stupid)

I think the smart zone stays within the first 100k tokens, no mater if the context window is 240k or 1 million.

I divide the work to fit within that 100k and use subagent for the tasks.

danielbln 2 hours ago||

In my experience it's more like 400-500k tokens.

halflife 4 hours ago|||

I feel the change. It went from an autocomplete tool, to an agent running 5 tasks in parallel while I just supervise. The improvement is enormous.

xbmcuser 4 hours ago|||

It's real for me as a non coder previously uploading a python script asking it to add this function or that function used to break it now usually it just works at least with Claude and Chat Gpt models. Google Gemini still breaks stuff but rumors are their new flash model that will be announced soon is very good. I am usually working with data in csv files and generating spreadsheet pdf etc and the results for that has improved dramatically.

LAC-Tech 4 minutes ago|||

"flash" or "fast" AI models are worse than useless at coding for me. they make my codebase much worse. It's a maintenance burden.

Gemini Pro on the other hand can be quite a pleasant experience.

Scoundreller 2 hours ago|||

That’s me. Built a scraper do dump stuff to a csv of a list of images for further ocr and openCV processing. Now I have a convenient list of hits once I run the batch that used to be a loooot of manual sifting.

Once I work out the kinks, I’ll be able to further automate it.

Would have taken 10-100x as long for me to build it without AI and the AI version is probably better.

But yeah, I have enough knowledge to know what prompts are needed and figure out those “oh, I think it’s running slow or failing because of xyz” and further prompt to improve it based on that what I think it should do instead.

And I know where to make slight changes without burning my allotments.

iLoveOncall 58 minutes ago|||

It is all marketing. The easiest way to tell is that a year ago the same people said the inflection point was X or Y model.

When people claim LLMs just don't work for them, the first question is whether they're using the latest model or not, and if not, dismissing the poster.

The thing is that that same question was being asked a year ago, and even a year before that, but with the models that lead to a dismissal today.

Just make the experiment yourself, wait 6 months, say LLMs just aren't working for the software engineering that you do, and people will dismiss you if you say that you use Opus 4.5 and not the latest model Claude MegaMind 8.8 pro max gigathinking. Despite this model being touted as the inflection point in this article.

harshitaneja 18 minutes ago||

I think it's because both sides are talking about different things. If you go in expecting it is good enough to make developers obsolete today(reasonable impression to get from the way a lot of people hype it) you would be disappointed and after first couple of tries every few months you would probably not try it much with next generations. Reasonable if it's considered a dichotomy.

But a lot of people excited about new generations(including me, now) are not seeing it as a dichotomy but rather a spectrum where models are getting better and indeed once a year or even 6 months at times there comes a sudden growth which feels like an inflection point from what came before. Practically, it's a tool like any other, you evaluate it based on if it's worth the effort and cost for the benefit you get from it and if it is and has a good DX you use it. If the calculation doesn't work for you, it doesn't. For me, it has gone from a novelty, to good for some kind of quick manual search, to I guess it can debug some kind of errors at times in very specific conditions, to hey I think I am getting a bit addicted to autocomplete in IDE provided by them even if I don't use them for anything intelligent but it's becoming indispensable now but only this part, to it's good for areas I lack expertise in, to agentic sucks I will stick with discussing algorithms and architecture with it on greenfield projects, to holy shit it can do agentic decently well now, I am skeptic to give it access more than in limited cases, to now I am getting close to letting it run free on my device in not so distant future I guess. Some of these were big jumps, at each point I was skeptical of growth. Everytime I thought now the growth will slow down from days 2k context window to millions now. From basic chat completion to working on complex adaptive systems, game theoretic modelling, heurestics and constraint modelling and other things I throw at it. I am still needed in the loop, it can be so smart at times and then will do something so stupid, but the frequency of stupidity is rapidly decreasing. I am still needed, I don't think it could accomplish alone all that it has done for me. But I do at times at night remain awake reflecting on my self worth for the potential day when I don't add that value. When I have a harder time keeping up.

Also had someone told me not in even 2019 that in 2026 we could have NLP models do what they do today, I would have posited it all as sci-fi and here I am waking up in awe of the world we live in and how quickly we adapt.

DeathArrow 3 hours ago||

Purely vibe code won't work. You need to define an excellent architecture, have great specs, a solid plan, divide the plan in small phases that fit well in a context window, use TDD and automated code reviews for implementing each phase, do QA and some code review.

At any point you need to have agents review, verify and test the other agents output and iterate until the output is perfect.

And also, have good e2e tests.

IMO, if you don't spend at least a few tens of millions tokens per day, you aren't doing it properly.

fluder_tw 1 hour ago||

Sounds very self confident to claim such thing. Something like "If you don't do how me is doing, then you are doing it wrong"

LZ_Khan 3 hours ago||

I'm curious how the 6 months have looked from a non-programmer's perspective. What kind of co-working tools and similar optimizations have people from other fields experienced?

generationP 6 minutes ago||

In pure maths:

- pre GPT-5.4: very limited use; some smart people got some mileage out of the models, but it always required serious work and a very suitable problem. Of course the models could solve homework problems, but that felt more like a downside to us who teach.

- since GPT-5.4 (Mar 2026): the "wow" release; suddenly answering MathOverflow-level problems that have previously been stumping experts. Still prone to hallucinations, but smart enough to use the built-in Python skill to verify its claims on small examples when possible. Probably a lot better at formula-heavy math than at the abstract "philosophical" kind.

- GPT-5.5: gave me a fascinating, significantly nontrivial and highly instructive "proof from the book" on an MO-hard problem that I'm in the process of writing up. Might have been luck and good prompting, though. Didn't really feel like a qualitative leap from 5.4, but I take quantitative any time. Still requires suitable problems, but it's much harder to rule out suitability from the get-go.

Claude and Gemini have been also-rans the whole time and still are. I use Claude for secretary-like tasks; occasionally it finds an easy proof too, but usually because I've missed something obvious.

Oh, and GPT and to a lesser extent Claude, are great at hunting errors in maths. Probably 90% of my prompts so far have been for proofreading my writings.

opto 2 hours ago|||

I am an instructor who helps deliver an apprenticeship. My new boss has been in our industry for about 20 years and is one of the most respected people in our company. They've just joined us to teach and are off doing a two week course. On the first day she was told to let AI write all of her lesson plans, and then feed the lesson plans to AI to make her slides...

Hopefully she rejects all this out of hand, but if she doesn't it'll mean that none of our trainees get the benefit of her experience, who she is as a person, and what she has to pass onto them.

We have 6 monthly reviews as instructors where we are told the same thing. "How could you use AI for your teaching?"

They don't even feel the need to justify why this would be desirable, or is needed at all. It's just pure bandwagonning. Unbelievably, most of my coworkers are extremely positive about AI, although none of them have told me they use it for anything besides preparing their lessons for them — they just use it instead of having to think, or spend time preparing...the only important thing they do at work.

It makes no sense to me.

tkgally 1 hour ago|||

I’m teaching a class at a university in Japan (on AI-related issues, as it happens). I’ve been teaching for more than 40 years, but at 106 registered students this is by far the largest class I have ever taught. AI tools are very helpful for class management, such as keeping track of attendance and homework submissions.

I have to consciously avoid using AI for more cognitive tasks, though. It would be very tempting to have Claude, ChatGPT, or Gemini summarize, classify, and grade the students’ assignments, write individual feedback, prepare my lesson plans, etc. However, I know that my engagement with the material and with the students would suffer. I also want to show the students that they are learning together with me and with each other, not with bots.

I am semiretired and have a light teaching load that gives me plenty of time to prepare for class. I can see that full-time teachers might find it hard to resist the lure of offloading their thinking to AI.

bradley13 58 minutes ago|||

I've been a teacher (most of the time a college professor) for...a long time. Nowadays, when preparing a new course, I definitely work with AI: "Here's what I want, and who my audience is - give me a course outline".

That gives me a starting point. Of course, I modify it. Maybe I bounce back and forth to the AI for further refinements and suggestions, but ultimately I have to be happy with the result.

When prepping the individual lessons, the biggest time saver is coming up with examples to illustrate particular points. I could do this alone, but sometimes that involves staring at a blank screen for a while. It is faster to ask the AI for suggestions, pick the one I like, and refine it further myself.

AI is a tool. Use it appropriately.

vanuatu 2 hours ago|||

I work at a company that deploys AI to enterprises

The average office worker is amazed at Copilot (not in the IDE - but the app bundled with Windows), and they mostly copy paste material into their enterprise provided ChatGPT / Gemini, and get tips from Facebook / Instagram on their top 5 best prompts for work productivity

Showing them agents that automate work at scale is a very magical experience

dawnerd 55 minutes ago||

And then everyone that has to deal with their copy pasted output is too nice to say how bad it is and how much work it just offloads to the next person that’ll probably get frustrated and have an agent handle it.

TrackerFF 15 minutes ago|||

Purely anecdotal, but in my team of 20 data analysts, we've seen a bunch of them become quite productive in producing tools and apps. These are analysts with mostly domain knowledge, and not so much programming knowledge - meaning that they knew the basics to write scripts, and wrangle data programmatically, but not enough to actually engage in software engineering.

Some of these are now contributors.

I also have a friend (beware, N=1 study) with zero prior programming knowledge that has released his first app.

conception 2 hours ago|||

Claude in Office was a tipping point for nontechnical folks around me. Everyone’s slides decks are immaculate now. Finance isn’t needing nearly as much BI help. It’s pretty impressive.

grey-area 1 hour ago|||

I find it really troubling finance are relying on LLMs (word generators!) for financial analysis - I mean I guess it means there will never be any annoying gaps in the data.

aidos 37 minutes ago||

Depends on how it’s done.

I use it a lot now for knocking up grafana charts etc. It’s not so much that the LLM is feeding the numbers through. You can still use real tools to analyse and summarise the numbers, it’s just very quicker driving them.

As ever with data analysis, two things will continue to be true. Real insights come from spotting something that looks off and digging into it deeper. Secondly, it’s really easy to connect data in a misleading way.

I’ve had a Claude analysis handed to me this morning including a summary list of actions we’re going to take next which falls into this very trap.

The insights you’ll get from your data will only be as deep as the curiosity of the person at the helm.

Gigachad 1 hour ago||||

Can I get Claude to view the slide decks for me so I don't waste my time?

RobinL 2 hours ago|||

Interesting. I don't have to use PowerPoint much, but I hate it when I do. I don't want the llm to write the words but I do want it to make things look nice. So does this work well now?

angled 2 hours ago|||

My pipeline for this is vscode + prompts + markdown templates + GitHub copilot -> markdown docs -> pandoc to produce.docx -> copilot in word for “nice” formatting -> copilot in ppt for nice decks. LLMs all the way down.

I find it’s easier to version control and diff the .md artefacts, those remain my authoritative source.

asdff 1 hour ago||

Wow. Seems like a headache compared to how I make slides the old fashioned way: copy and paste my figures into blank powerpoint.

jillesvangurp 1 hour ago||||

With a little bit of work, it works very well. You can generate powerpoint directly with Codex or Claude Cowork. There is also Canva support for these tools and it has its own AI integration. Another useful tool in this space is the Gemini integration in Google slides.

If you are a bit technical, reveal.js is actually really nice for this. I one shotted a pdf export for that uses a headless browser. I've used that a few times now.

What works well for me is to take an existing presentation and then some raw input and generate a new presentation in the same style as the old one from the raw input. After that, I can go in and tweak individual slides.

Another thing I did recently was take somebody's existing pitch deck and fix it with a one line prompt: "this deck is a bit meh, pimp it!" that worked unreasonably well. I like using shitty prompts like that. Codex often manages to do the right thing if you don't overthink your prompts.

Classic deck of somebody that used way too much text and only bullets. It did a great job on that presenting the content in a more simple and better structured way. Pulling out key facts and highlighting those, simplifying text, etc. Doing that manually would have taken hours.

ta8903 1 hour ago|||

If you don't want an LLM to write the words, surely you also want to decide on the data and graphs to show by yourself? Isn't that 90% of a presentation? The "looking nice" part doesn't matter as much, it could be black text on a white background and it would be fine.

The important part is the presentation matching your presenting cadence, which is something LLM generated presentations never get right. I don't have a problem with people generating presentations, but most of the time they just end up reading whatever is on the screen when presenting.

angled 2 hours ago|||

In business: using coworking tools to review and propose filing of emails; manage my files and folders; on a daily basis scour the intranet for interesting and relevant content.

Personal: my wife tutors in her native language to non-native primary and high school kids. They are all using these tools now generate fresh content for practice based on school lesson plans. The kids are improving much more quickly now than they were just a few months ago.

beng-nl 2 hours ago||

As someone who works somewhere where the intranet is a bit of a jungle: what tool do you use to scour the intranet?

Thanks!

angled 2 hours ago||

Copilot Cowork in the M365 ecosystem. It inherits all the permissions from my account, has access to exchange to send me emails, and OneDrive to save each day’s summary for posterity and future refinement.

beng-nl 42 minutes ago||

Thank you, I will try to find it. Thanks!

Quothling 1 hour ago|||

I think Claude Cowork through the Microsoft thing which was copilot but is now named M365 (or something?) is likely creating every powerpoint resentation within our organisation at this point.

We have whatever AI is in teams transcribe every meeting, and it's scaringly good at it. It's also extremely good at sumerizing or finding things from pervious meetings when tasked. One disadvantage in this, is that I can see how stupid I sound on writing. I'll go "yeah, hmm, yeah, that's, yeah", but it really is pretty good.

I assume we're going to see a massive increase in AI with this Cowork inside the Microsoft client. We actually have a better tool available through a librechat where you can create and configure your own agents with the same filesystem access to your one drive, and a lot more tools and models than just Claude. Almost nobody has been capable of figuring out how to use it though, so they've been using the regular office365 copilot and it sucks so bad that a lot of people stopped beliving in AI.

It's ironic that Microsoft fumbling the ball on AI, but being very good at enterprise customers (especially non-IT) means that they'll likely be the company which is going to sell us AI tools that people will actually use. I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund. It's quite litterally a copy of ChatGPT where you can point-and-click configure an agent, but we're seeing that even employees who use a lot of ChatGPT privately don't use this tool professionally. Meanwhile everyone has been capable of using the Microsoft thing (that I personally think is less user friendly since you will need to add your configuration files to every promt).

piokoch 1 hour ago||

"I have no idea why it's so hard for people to pick up the Librechat tool we're given access to through our equity fund"

That's because M365 is integrated with the whole Office/Exchange environment, especially in terms of security policies, etc. MS also guarantee that the data are private, this is very important for many companies both from the IP protection perspective and the liability to expose some users/customers data (think of GDPR regulations is Europe).

I don't know who is behind Liberchat, probably some good and friendly folks, but when it comes to privacy/security Microsoft has much more to loose and if shit happens it is easier to sue them than some random VC-financed company from the USA.

Antibabelic 2 hours ago|||

My day job is not in the tech industry. I am an editor. Literally nothing has changed for me in the last four years.

alexwwang 2 hours ago|||

As a former data scientist, I started to use code agent 3 monthes ago. Before that, I use chat completion on web. Now, I nearly do everything which outputs documents with code agent.

BOOSTERHIDROGEN 2 hours ago||

Can you give a sanitized example or a hypothetical scenario of what you mean by “output documents with code agents”? Thanks.

schnitzelstoat 1 hour ago||

I’m not him, but I’ve started using them to do the analysis (SQL, Python etc.) and then output the report as Quarto HTML which can be hosted on GitHub Pages. It works well for this analysis style work.

Once I was going to send some figures to leadership so I checked the queries myself and not only had it done it correctly, but it had also included a lot of sanity checks with other places in the database which as a human I doubt I’d have had the time or inclination to do.

Even for modelling work it can be good to check your ETL queries, or write one itself and then check it etc.

zarzavat 5 hours ago||

Somewhere right now some human artist is being tasked with drawing illustrations of pelicans riding bicycles to be used as training data at a big AI lab.

energy123 3 hours ago||

The quality of the Gemini pelican was such a step change in one iteration, while the other benchmarks remained quite flat, that I think you are right. Although whether they targeted Pelicans in particular or just svg, I can't say.

minimaxir 4 hours ago||

Every modern image-generation model can generate a pelican on a bicycle trivially. The point of the test is to generate SVG text that represents an image, which is more complicated.

Yes, there are ways to convert raster images to SVG for use in training data but it's not a good use of anyone's time.

Antibabelic 1 hour ago|||

I don't understand this response. Human artists can and do make SVGs.

jofzar 4 hours ago||||

I wouldn't wish creating a svg pelican on a bicycle on my worst enemy

Mashimo 3 hours ago|||

> Every modern image-generation model can generate a pelican on a bicycle trivially.

Mistral seems to be the exception. Their new model from a few weeks ago is worse then selfhosted gemma.

shepherdjerred 4 hours ago||

> and there’s zero chance any AI lab would train a model for such a ridiculous task.

I'm not sure that's true anymore considering how popular Simon's blog is

_puk 4 hours ago||

> So maybe the AI labs have been paying attention after all!

> I think this mainly demonstrates that the pelican on the bicycle has firmly exceeded its limits as a useful benchmark.

As acknowledged in the article.

kzrdude 3 hours ago||

Gemini 3.1 basically takes it home on that benchmark, anyway, it's done.

nickvec 4 hours ago|||

Simon mentions further along in his article that given Jeff Dean’s post referencing the pelican-riding-a-bike task (and how good current models are at doing it), that it’s no longer a great benchmark to use. Enter the opossum riding an e-scooter!

aaronbrethorst 2 hours ago||

Banana man on the Segway

simonw 3 hours ago||

That bit probably works better in the talk, it was a setup for a joke later on.

pineapple_opus 1 hour ago||

All I see is mention of how various models generate image of "pelican riding bicycle(s)"

emil-lp 1 hour ago||

Yes, the "pelican riding a bicycle" is the ultimate test of not understanding how LLMs work.

Well, a combination of that and believing that replication of test data is a good measure of progress.

ClikeX 1 hour ago||

We all know the true test of AI is Will Smith eating spaghetti.

tptacek 3 hours ago||

If you're a vulnerability researcher or a security person generally, there's a big inflection point from Spring of this year.

gnyman 2 hours ago||

If it turns out to be a good change or not is to be seen.

The half-full view is that the models are so good at finding vulns that if you plug them into your build-pipeline then the amount of new vulns introduced will go down towards zero.

The half-empty view is that we're now producing more junior-level code with less review, so everything will have more vuln, also it's cheaper and easier to find them so prepare for chaos.

Short term there is sure to be chaos either way as the models are clearly good enough to find all the old bugs, and not everyone has the resources or will to try to stay ahead of the curve like Mozilla is trying to do with their Mythos access https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...

muvlon 1 hour ago||

There's a major caveat to the half-full view: You'll only stop adding new vulns that your model can find.

A threat actor with access to a better model or more money to burn on tokens may yet find more. Some of them have deep pockets, and not nearly every project will get the Glasswing treatment of free Mythos tokens.

jxmesth 2 hours ago|||

I'm a security person and would love to hear other people's input here as I don't have that much experience with this

thierrydamiba 3 hours ago|||

Can you be more specific?

tetha 2 hours ago|||

Three deterministic Linux LPEs in a week, an LPE in BSD in execve (of all things...), nginx vulnerabilities, one or two new gnarly supply chain attacks. Linus noting that the linux-security mailing list is getting flooded with duplicated, AI-driven reports of varying quality. There are pretty crazy keycloak vulnerabilities getting discovered.

We're most likely entering a year or two or rapid vulnerability discovery, patching, as well as reducing and minimalizing system footprints just to survive the onslaught of strange vulnerabilities from e.g. ancient and widely unused kernel modules.

simonw 2 hours ago||||

The Claude Mythos / Project Glasswing thing is real: https://www.anthropic.com/glasswing

I met a few people at PyCon this week who have been part of Glasswing (they're just starting to be allowed to talk about it) and it really does drive down the cost of finding vulnerabilities.

I've been collecting notes on that here: https://simonwillison.net/tags/ai-security-research/

krzyk 2 hours ago|||

People in my company sounded underwhelmed by it. It usually was founding issues by not understanding deployment (or not being fed that info).

halflife 1 hour ago||

A friend of mine had hands on experience, it’s not the intelligence of it, it’s the speed.

You used to have a couple of days to close a breach, now it 2 hours.

Gigachad 1 hour ago|||

Wouldn't it drive up the cost of finding vulnerabilities when all the low hanging fruit has already been scanned and patched? Like the new baseline for finding a vulnerability will be something an LLM couldn't find.

tptacek 2 hours ago||||

Broadly, I'm talking about the shift from building elaborate vulnerability research harnesses towards using the frontier models and their RL-optimized harnesses to build simpler vulnerability discovery pipelines. And then: the ensuing carnage.

baq 2 hours ago|||

Not op but just look at HN posts in the last couple weeks: supply chain worms, zero-day LPEs for all OSes seemingly every other day, researchers on X and here openly saying they’ve got more valid findings than they know what to do with

nickvec 2 hours ago||

Are you referring to Claude Mythos?

throwaway2027 4 hours ago|

December 2025 was the breakthrough for me. January Claude was euphoric, ChatGPT was up there. February Gemini cooked for a second there. March amazing. April the big bad nerf. May GPT 5.5 is just pure bliss altough 2x limits temporarily, not sure about Claude it's sort of okay still not as good as it felt before, slowly increasing limits with more compute and rebuilding good will.

dmpk2k 3 hours ago||

I find your emotional language truly quite fascinating. I've heard people talk like that about drugs.

wilg 2 hours ago||

Similarly, I've heard people talk like that about things that are not drugs.

sph 1 hour ago||

You can get a dopamine rush from anything, from drugs to using LLMs.

_puk 3 hours ago|||

I think Opus 4.6 at its peak was the "how can anyone not get that this is good" for me.

Then the nerf, and the massive uplift in tokens for 4.7, a model which I find lazy and prone to hallucinate.

It's probably time to try GPT5.5. Like many I'm pretty heavily invested in the anthropic ecosystem at this point, which I suppose gives another strong reason to make the switch.

Arn_Thor 53 minutes ago|||

I was a dedicated Claude user but in March/April I started using GPT5.5 on a new project that Claude had tried and failed to execute successfully. GPT knocked it out of the park, and was able to do it within my subscription allocation of tokens. I'd recommend giving it a go at least. Something like OpenClaude can let you use the Claude tools you're used to

ant6n 35 minutes ago|||

I only used Claude first time in April, previously only ChatGPT and Gemini. And I struggle to see what the hype is all about - yes it seems a tiny bit smarter than the pack, but on the 20$ subscription it runs out of tokens in 5-20 minutes, and then you need to wait 3-4h.

ChatGPT 5.5 seems capable, although a bit stingy with “thinking” compared to earlier models, and I never run into session limits.

_puk 18 minutes ago||

I couldn't imagine using CC on the basic tier!

Even operations and GTM are all at "professional" level (which I think is vaguely equivalent to 5x).

nothinkjustai 1 hour ago||

[flagged]

More comments...