A few random notes from Claude coding quite a bit last few weeks

Posted by bigwheels 1/26/2026

A few random notes from Claude coding quite a bit last few weeks(twitter.com)

https://xcancel.com/karpathy/status/2015883857489522876

912 points | 847 commentspage 2

kshri24 1/28/2026|

Agree with Karpathy's take. Finally a down to Earth analysis from a respected source in the AI space. I guess I'll be using slopocalypse a lot more now :)

> I am bracing for 2026 as the year of the slopacolypse across all of github, substack, arxiv, X/instagram, and generally all digital media

It has arrived. Github will be most affected thanks to git-terrorists at Apna College refusing to take down that stupid tutorial. IYKYK.

ActorNightly 1/28/2026|

The respect is unwarranted.

He ran Teslas ML division, but still doesnt know what a simple kalman filter is (in the sense where he claimed that lidar would be hard to integrate with cameras).

akoboldfrying 1/28/2026||

The Kalman filter examples I've seen always involve estimating a very simple quantity, like the location of a single 3D point, from noisy sensors. It's clear how multiple estimates can be combined into a new estimate.

I'd guess that cameras on a self-driving car are trying to estimate something much more complex, something like 3D surfaces labeled with categories ("person", "traffic light", etc.). It's not obvious to me how estimates of such things from multiple sensors and predictions can be sensibly and efficiently combined to produce a better estimate. For example, what if there is a near red object in front of a distant red background, so that the camera estimates just a single object, but the lidar sees two?

ActorNightly 1/28/2026||

https://www.bzarg.com/p/how-a-kalman-filter-works-in-picture...

Kalman filters basic concept is essentially this.

1. make prediction on the next state change of some measurable n dimentional quantity, and estimate the covariance matrix across those n dimentions, which describe essentially a probability that the i-th dimention is going to increase (or decrease) with j-th dimention, where i and j are between 0 and n (indices of the vector)

2. Gather sensor data (that can be noisy), and reconcile the predicted measurement with the measured to get the best guess. The covariance matrix acts as a kind of weight for each of the elements

3. Update the covariance matrix based on the measurements in previous step.

You can do this for any vector of numbers. For example, instead of tracking individual objects, you can have a grid where each element represents a physical object that the car should not drive into, with a value representing certainty of that object being there. Then when you combine sensor reading, you still can use your vision model but that model would be enhanced by what lidar detects, both in terms of seeing things that camera doesn't pick up and rejecting things that aren't there.

And the concept is generic enough to where you can set up a system to be able to plug in any additional sensor with its own noise, and it all works out in the end. This is used all the You can even extend the concept past Gaussian noise and linearity, there are a number of other filters that deal with that, broadly under the umbrella of sensor fusion.

The problem is that Karpathy is more of a computer scientist, so he is on his Code 2.0 train of having ML models do everything. I dunno if he is like that himself or Musks "im smarter than everyone else that came before me" rubbed off.

And of course when you think like that, its going to be difficult to integrate lidar into the model. But the problem with that thinking is that forward inference LLM is not AI, and it will never ever be able to drive a car well compared to a true "reasoning" AI with feedback loops.

einrealist 1/27/2026||

> It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later.

Somewhere, there are GPUs/NPUs running hot. You send all the necessary data, including information that you would never otherwise share. And you most likely do not pay the actual costs. It might become cheaper or it might not, because reasoning is a sticking plaster on the accuracy problem. You and your business become dependent on this major gatekeeper. It may seem like a good trade-off today. However, the personal, professional, political and societal issues will become increasingly difficult to overlook.

cyode 1/27/2026||

This quote stuck out to me as well, for a slightly different reason.

The “tenacity” referenced here has been, in my opinion, the key ingredient in the secret sauce of a successful career in tech, at least in these past 20 years. Every industry job has its intricacies, but for every engineer who earned their pay with novel work on a new protocol, framework, or paradigm, there were 10 or more providing value by putting the myriad pieces together, muddling through the ever-waxing complexity, and crucially never saying die.

We all saw others weeded out along the way for lacking the tenacity. Think the boot camp dropouts or undergrads who changed majors when first grappling with recursion (or emacs). The sole trait of stubbornness to “keep going” outweighs analytical ability, leetcode prowess, soft skills like corporate political tact, and everything else.

I can’t tell what this means for the job market. Tenacity may not be enough on its own. But it’s the most valuable quality in an employee in my mind, and Claude has it.

noosphr 1/28/2026|||

There is an old saying back home: an idiot never tires, only sweats.

Claude isn't tenacious. It is an idiot that never stops digging because it lacks the meta cognition to ask 'hey, is there a better way to do this?'. Chain of thought's whole raison d'etre was so the model could get out of the local minima it pushed itself in. The issue is that after a year it still falls into slightly deeper local minima.

This is fine when a human is in the loop. It isn't what you want when you have a thousand idiots each doing a depth first search on what the limit of your credit card is.

Havoc 1/28/2026|||

> it lacks the meta cognition to ask 'hey, is there a better way to do this?'.

Recently had an AI tell me this code (that it wrote) is a mess and suggested wiping it and starting from scratch with a more structure plan. That seems to hint at some meta cognition outlines

zzrrt 1/28/2026|||

Haha, it has the human developer traits of thinking all old code is garbage, failing to identify oneself as the dummy who wrote this particular code, and wanting to start from scratch.

dpkirchner 1/28/2026||

It's like NIH syndrome but instead "not invented here today". Also a very human thing.

globular-toast 1/28/2026||

More like NIITS: Not Invented in this Session.

rurp 1/28/2026||||

Perhaps. I've had LLMs tell me some code is deeply flawed garbage that should be rewritten about code that exact same LLM wrote minutes before. It could be a sign of deep meta cognition, or it might be due to some cognitive gaps where it has no idea why it did something a minute ago and suddenly has a different idea.

Applejinx 1/28/2026||

This is not a fair criticism. There is _nobody_ there, so you can't be saying 'code the exact same LLM wrote minutes before'. There is no 'exact same LLM' and no ideas for it to have, you're trying to make sense of sparkles off the surface of a pond. There's no 'it' to have an idea and then a different idea, much less deep meta cognition.

rurp 1/29/2026|||

I'm not sure we disagree. I was pushing back against the idea that suggesting a rewrite of some code implies meta cognition abilities on the part of the LLM. That seems like weak evidence to me.

duncangh 1/29/2026|||

They should’ve named him tom instead of Claude in homage to Ten second Tom from fifty first dates

teaearlgraycold 1/28/2026||||

I asked Claude to analyze something and report back. It thought for a while said “Wow this analysis is great!” and then went back to thinking before delivering the report. They’re auto-sycophantic now!

lbrito 1/28/2026||||

Someone will say "you just need to instruct Claude.md to be more meta and do a wiggum loop on it"

hyperadvanced 1/28/2026||||

Metacognition As A Service, you say?

guy4261 1/28/2026|||

Running on the Meta Cognition Protocol server near you.

baxtr 1/28/2026||

You’ll get sued by Meta for this!

r-w 1/28/2026|||

I think that’s called “consulting”.

karlgkk 1/28/2026|||

lol no it doesn’t. It hints at convincing language models

samusiam 1/28/2026||||

I mean, not always. I've seen Claude step back and reconsider things after hitting a dead end, and go down a different path. There are also workflows, loops that can increase the likelihood of this occurring.

cocacolacowboy 1/28/2026|||

[dead]

BeetleB 1/27/2026||||

This is a major concern for junior programmers. For many senior ones, after 20 (or even 10) years of tenacious work, they realize that such work will always be there, and they long ago stopped growing on that front (i.e. they had already peaked). For those folks, LLMs are a life saver.

At a company I worked for, lots of senior engineers become managers because they no longer want to obsess over whether their algorithm has an off by one error. I think fewer will go the management route.

(There was always the senior tech lead path, but there are far more roles for management than tech lead).

codyb 1/28/2026|||

I feel like if you're really spending a ton of time on off by one errors after twenty years in the field you haven't actually grown much and have probably just spent a ton of time in a single space.

Otherwise you'd be senior staff to principle range and doing architecture, mentorship, coordinating cross team work, interviewing, evaluating technical decisions, etc.

I got to code this week a bit and it's been a tremendous joy! I see many peers at similar and lower levels (and higher) who have more years and less technical experience and still write lots of code and I suspect that is more what you're talking about. In that case, it's not so much that you've peaked, it's that there's not much to learn and you're doing a bunch of the same shit over and over and that's of course tiring.

I think it also means that everything you interact with outside your space does feel much harder because of the infrequency with which you have interacted with it.

If you've spent your whole career working the whole stack from interfaces to infrastructure then there's really not going to be much that hits you as unfamiliar after a point. Most frameworks recycle the same concepts and abstractions, same thing with programming languages, algorithms, data management etc.

But if you've spent most of your career in one space cranking tickets, those unknown corners are going to be as numerous as the day you started and be much more taxing.

rishabhaiover 1/27/2026||||

That's just sad. Right when I found love in what I do, my work has no value anymore.

jasonfarnon 1/28/2026||

Aren't you still better off than the rest of us who found what they love + invested decades in it before it lost its value. Isn't it better to lose your love when you still have time to find a new one?

josephg 1/28/2026|||

I don't think so. Those of us who found what we love and invested decades into it got to spend decades getting paid well to do what we love.

pesus 1/28/2026||||

Depends on if their new love provides as much money as their old one, which is probably not likely. I'd rather have had those decades to stash and invest.

jasonfarnon 1/28/2026||

A lot of pre-faang engineers dont have the stash you're thinking about. What you meant was "right when I found a lucrative job that I love". What was going on in tech these last 15 years, unfortunately, probably was once in a lifetime.

WarmWash 1/28/2026||

It's crazy to think back in the 80's programmers had "mild" salaries despite programming back then being worlds more punishing. No libraries, no stack exchange, no forums, no endless memory and infinite compute. If you had a challenging bug you better also be proficient in reading schematics and probing circuits.

lurking_swe 1/28/2026||

on the bright side software evolved much more slowly in the 80s. You could go very far by being an expert in 1 thing.

People had real offices with actual quiet focus time.

User expectations were also much lower.

pros and cons i guess?

sponaugle 1/28/2026||||

"it lost its value".

It has not lost its value yet, but the future will shift that value. All of the past experience you have is an asset for you to move with that shift. The problem will not be you losing value, it will be you not following where the value goes.

It might be a bit more difficult to love where the shift goes, but that is no different than loving being a artist which often shares a bed with loving being poor. What will make you happier?

nfredericks 1/28/2026|||

This is genuinely such a good take

dugidugout 1/28/2026||

Especially on the topic of value! We are all intuitively aware that value is highly contextual, but get in a knot trying to rationalize value long past genuine engagement!

test6554 1/27/2026|||

Imagine a senior dev who just approves PRs, approves production releases, and prioritizes bug reports and feature requests. LLM watches for errors ceaslessly, reports an issue. Senior dev reviews the issue and assigns a severity to it. Another LLM has a backlog of features and errors to go solve, it makes a fix and submits a PR after running tests and verifying things work on its end.

techgnosis 1/28/2026||||

Why are we pretending like the need for tenacity will go away? Certain problems are easier now. We can tackle larger problems now that also require tenacity.

samusiam 1/28/2026||

Even right at this very moment where we have a high-tenacity AI, I'd argue that working with the AI -- that is to say, doing AI coding itself and dealing with the novel challenges that brings requires a lot of stubborn persistence.

mykowebhn 1/28/2026|||

Fittingly, George Hinton toiled away for years in relative obscurity before finally being recognized for his work. I was always quite impressed by his "tenacity".

So although I don't think he should have won the Nobel Prize because not really physics, I felt his perseverance and hard work should merit something.

direwolf20 1/28/2026||

... The person who embezzled from the SDC in 2018? https://eu.jsonline.com/story/news/investigations/2024/04/19...

mykowebhn 1/29/2026||

Haha, my bad. Yes, that "George" Hinton!

daxfohl 1/27/2026|||

I still find in these instances there's at least a 50% chance it has taken a shortcut somewhere: created a new, bigger bug in something that just happened not to have a unit test covering it, or broke an "implicit" requirement that was so obvious to any reasonable human that nobody thought to document it. These can be subtle because you're not looking for them, because no human would ever think to do such a thing.

Then even if you do catch it, AI: "ah, now I see exactly the problem. just insert a few more coins and I'll fix it for real this time, I promise!"

gtowey 1/27/2026|||

The value extortion plan writes itself. How long before someone pitches the idea that the models explicitly almost keep solving your problem to get you to keep spending? Would you even know?

password4321 1/27/2026|||

First time I've seen this idea, I have a tingling feeling it might become reality sooner rather than later.

sailfast 1/27/2026||||

That’s far-fetched. It’s in the interest of the model builders to solve your problem as efficiently as possible token-wise. High value to user + lower compute costs = better pricing power and better margins overall.

d0mine 1/27/2026|||

> far-fetched

Remember Google?

Once it was far-fetched that they would make the search worse just to show you more ads. Now, it is a reality.

With tokens, it is even more direct. The more tokens users spend, the more money for providers.

retsibsi 1/28/2026|||

> Now, it is a reality.

What are the details of this? I'm not playing dumb, and of course I've noticed the decline, but I thought it was a combination of losing the battle with SEO shite and leaning further and further into a 'give the user what you think they want, rather than what they actually asked for' philosophy.

supriyo-biswas 1/28/2026|||

https://www.wheresyoured.at/the-men-who-killed-google/

SetTheorist 1/28/2026|||

As recently as 15 years ago, Google _explicitly_ stated in their employee handbook that they would NOT, as a matter of principle, include ads in the search results. (Source: worked there at that time.)

Now, they do their best to deprioritize and hide non-ad results...

throwthrowuknow 1/28/2026|||

Only if you are paying per token on the API. If you are paying a fixed monthly fee then they lose money when you need to burn more tokens and they lose customers when you can’t solve your problems within that month and max out your session limits and end up with idle time which you use to check if the other providers have caught up or surpassed your current favourite.

layla5alive 1/28/2026||

Indeed, unlimited plan seems like the only way that makes sense to not have it be guaranteed to be abused by the provider

xienze 1/27/2026||||

> It’s in the interest of the model builders to solve your problem as efficiently as possible token-wise.

Unless you’re paying by the token.

lelanthran 1/28/2026|||

> It’s in the interest of the model builders to solve your problem as efficiently as possible token-wise. High value to user + lower compute costs = better pricing power and better margins overall.

It's only in the interests of the model builders to do that IFF the user can actually tell that the model is giving them the best value for a single dollar.

Right now you can't tell.

fragmede 1/28/2026||

Why not? Seems like you'd just build the same app on each of the models you want to test and judge how they did.

lelanthran 1/28/2026||

> Why not? Seems like you'd just build the same app on each of the models you want to test and judge how they did.

I tried that on a few problems; even on the same model the results have too much variation.

When comparing different models, repeating the experiment gives you different results.

hnuser40690 6 days ago||

Interesting point about model variation. It would be useful to run multiple trials and look at the statistical distribution of results rather than single runs. This could help identify which models are more consistent in their outputs.

lelanthran 4 days ago||

> Interesting point about model variation. It would be useful to run multiple trials and look at the statistical distribution of results rather than single runs. This could help identify which models are more consistent in their outputs.

That doesn't help in practical usage - all you'd know is their consistency at the point in time of testing. After all, 5m after your test is done, your request to an API might lead to a different model being used in the background because the limits of the current one were reached.

fragmede 1/27/2026||||

The free market proposition is that competition (especially with Chinese labs and grok) means that Anthropic is welcome to do that. They're even welcome to illegally collude with OpenAi such that ChatGPT is similarly gimped. But switching costs are pretty low. If it turns out I can one shot an issue with Qwen or Deepseek or Kimi thinking, Anthropic loses not just my monthly subscription, but everyone else's I show that too. So no, I think that's some grade A conspiracy theory nonsense you've got there.

coffeefirst 1/27/2026|||

It’s not that crazy. It could even happen by accident in pursuit of another unrelated goal. And if it did, a decent chunk of the tech industry would call it “revealed preference” because usage went up.

hnuser123456 1/27/2026||

LLMs became sycophantic and effusive because those responses were rated higher during RLHF, until it became newsworthy how obviously eager-to-please they got, so yes, being highly factually correct and "intelligent" was already not the only priority.

jrflowers 1/27/2026||||

This is a good point. For example if you have access to a bunch of slot machines, one of them is guaranteed to hit the jackpot. Since switching from one slot machine to another is easy, it is trivial to go from machine to machine until you hit the big bucks. That is why casinos have such large selections of them (for our benefit).

krupan 1/27/2026|||

"for our benefit" lol! This is the best description of how we are all interacting with LLMs now. It's not working? Fire up more "agents" ala gas town or whatever

direwolf20 1/28/2026||

gas is the transaction fees in Ethereum. It's a fitting name.

robotmaxtron 1/28/2026|||

last time I was at a casino I checked to see what company built the machines, imagine my surprise that it was (by my observation) a single vendor.

bandrami 1/28/2026||||

> But switching costs are pretty low

Switching costs are currently low. Once you're committed to the workflow the providers will switch to prepaying for a year's worth of tokens.

daxfohl 1/27/2026||||

To be clear I don't think that's what they're doing intentionally. Especially on a subscription basis, they'd rather me maximize my value per token, or just not use them. Lulling users into using tokens unproductively is the worst possible option.

The way agents work right now though just sometimes feels that way; they don't have a good way of saying "You're probably going to have to figure this one out yourself".

thunderfork 1/27/2026||||

As a rational consumer, how would you distinguish between some intentional "keep pulling the slot machine" failure rate and the intrinsic failure rate?

I feel like saying "the market will fix the incentives" handwaves away the lack of information on internals. After all, look at the market response to Google making their search less reliable - sure, an invested nerd might try Kagi, but Google's still the market leader by a long shot.

In a market for lemons, good luck finding a lime.

krupan 1/27/2026||

FWIW, kagi is better than Google

direwolf20 1/28/2026||

yes, that was their point. Everyone uses Google anyway.

zelphirkalt 1/29/2026|||

And we all know the market always gives us the best quality product ...

Fnoord 1/28/2026||||

I was thinking more of deliberate backdoor in code. RCE is an obvious example, but another one could be bias. "I'm sorry ma'am, computer says you are ineligable for a bank account." These ideas aren't new. They were there in 90s already when we still thought about privacy and accountability regarding technology, and dystopian novels already described them long, long ago.

chanux 1/28/2026|||

Is this from a page of dating apps playbook?

direwolf20 1/28/2026||

yes

wvenable 1/27/2026||||

> These can be subtle because you're not looking for them

After any agent run, I'm always looking the git comparison between the new version and the previous one. This helps catch things that you might otherwise not notice.

teaearlgraycold 1/28/2026||

And after manually coding I often have an LLM review the diff. 90% of the problems it finds can be discounted, but it’s still a net positive.

einrealist 1/28/2026||||

And there is this paradox where it becomes harder to detect the problems as the models 'improve'.

charcircuit 1/27/2026|||

You are using it wrong, or are using a weak model if your failure rate is over 50%. My experience is nothing like this. It very consistently works for me. Maybe there is a <5% chance it takes the wrong approach, but you can quickly steer it in the right direction.

testaccount28 1/27/2026||

you are using it on easy questions. some of us are not.

meowface 1/28/2026|||

A lot of people are getting good results using it on hard things. Obviously not perfect, but > 50% success.

That said, more and more people seem to be arriving at the conclusion that if you want a fairly large-sized, complex task in a large existing codebase done right, you'll have better odds with Codex GPT-5.2-Codex-XHigh than with Claude Code Opus 4.5. It's far slower than Opus 4.5 but more likely to get things correct, and complete, in its first turn.

testaccount28 1/29/2026||

yes, i also get good results. that's why i use it on the hard things.

mikkupikku 1/27/2026||||

I think a lot of it comes down to how well the user understands the problem, because that determines the quality of instructions and feedback given to the LLM.

For instance, I know some people have had success with getting claude to do game development. I have never bothered to learn much of anything about game development, but have been trying to get claude to do the work for me. Unsuccessful. It works for people who understand the problem domain, but not for those who don't. That's my theory.

samrus 1/27/2026||

It works for hard problems when the person already solves it and just needs the grunt work done

It also works for problems that have been solved a thousand times before, which impresses people and makes them think it is actually solving those problems

daxfohl 1/27/2026|||

Which matches what they are. They're first and foremost pattern recognition engines extraordinaire. If they can identify some pattern that's out of whack in your code compared to something in the training data, or a bug that is similar to others that have been fixed in their training set, they can usually thwack those patterns over to your latent space and clean up the residuals. If comparing pattern matching alone, they are superhuman, significantly.

"Reasoning", however, is a feature that has been bolted on with a hacksaw and duct tape. Their ability to pattern match makes reasoning seem more powerful than it actually is. If your bug is within some reasonable distance of a pattern it has seen in training, reasoning can get it over the final hump. But if your problem is too far removed from what it has seen in its latent space, it's not likely to figure it out by reasoning alone.

samrus 1/29/2026|||

Exactly. I go back to a recent ancestor of LLMs, seq2seq. Its purpose was to translate things. Thats all. That needed representation learning and an attention mechanism, and it lead to some really freaky emergent capabilities, but its trained to trainslate language.

And thats exactly what its good for. It works great if you already solve a tough problem and provide it the solution in natural language, because the program is already there, it just needs to translate it to python.

Anything more than that that might emerge from this is going to be unreliable sleight of next-token-prediction at best.

We need a new architectural leap to have these things reason, maybe something that involves reinforcement learning at the token represention level, idk. But scaling the context window and training data arent going to cut it

charcircuit 1/27/2026|||

>"Reasoning", however, is a feature that has been bolted on with a hacksaw and duct tape.

What do you mean by this? Especially for tasks like coding where there is a deterministic correct or incorrect signal it should be possible to train.

direwolf20 1/28/2026||

it's meant in the literal sense but with metaphorical hacksaws and duct tape.

Early on, some advanced LLM users noticed they could get better results by forcing insertion of a word like "Wait," or "Hang on," or "Actually," and then running the model for a few more paragraphs. This would increase the chance of a model noticing a mistake it made.

Reasoning is basically this.

charcircuit 1/28/2026||

It's not just force inserting a word. Reasoning is integrated into the training process of the model.

samrus 1/29/2026||

Not the core foundation model. The foundation model still only predicts the next token in a static way. The reasoning is tacked onto the instructGPT style finetuning step and its done through prompt engineering. Which is the shittiest way a model like this could have been done, and it shows

thunky 1/28/2026|||

> It also works for problems that have been solved a thousand times before

So you mean it works on almost all problems?

samrus 1/29/2026||

I mean problems not worth solving, because theyve already been solved. If you need to just do the grunt work of retrieving the solution to a trite and worn out problem from the models training data, then they work great

But if you want to do interesting things, like all the shills keep trying to claim they do. Then this wont do it for you. You have to do it for it

thunky 1/29/2026||

> But if you want to do interesting things, like all the shills keep trying to claim they do

I don't know where this is coming from. I've seen some over-enthusiastic hype for sure, but most of the day-to-day conversations I see aren't people saying they're curing cancer with Claude, they're people saying they're automating their bread and butter tasks with great success.

baq 1/27/2026|||

Don’t use it for hard questions like this then; you wouldn’t use a hammer to cut a plank, you’d try to make a saw instead

fooker 1/27/2026|||

> It might become cheaper or it might not

If it does not, this is going to be first technology in the history of mankind that has not become cheaper.

(But anyway, it already costs half compared to last year)

ctoth 1/27/2026|||

> But anyway, it already costs half compared to last year

You could not have bought Claude Opus 4.5 at any price one year ago I'm quite certain. The things that were available cost half of what they did then, and there are new things available. These are both true.

I'm agreeing with you, to be clear.

There are two pieces I expect to continue: inference for existing models will continue to get cheaper. Models will continue to get better.

Three things, actually.

The "hitting a wall" / "plateau" people will continue to be loud and wrong. Just as they have been since 2018[0].

[0]: https://blog.irvingwb.com/blog/2018/09/a-critical-appraisal-...

teaearlgraycold 1/28/2026|||

As a user of LLMs since GPT-3 there was noticeable stagnation in LLM utility after the release of GPT-4. But it seems the RLHF, tool calling, and UI have all come together in the last 12 months. I used to wonder what fools could be finding them so useful to claim a 10x multiplier - even as a user myself. These days I’m feeling more and more efficiency gains with Claude Code.

HNisCIS 1/28/2026||

That's the thing people are missing, the models plateaued a while ago, still making minor gains to this day, but not huge ones. The difference is now we've had time to figure out the tooling. I think there's still a ton of ground to cover there and maybe the models will improve given that the extra time, but I think it's foolish to consider people who predicted that completely wrong. There are also a lot of mathematical concerns that will cause problems in the near and distant future. Infinite progress is far from a given, we're already way behind where all the boosters thought we'd be my now.

teaearlgraycold 1/28/2026||

I believe Sam Altman, perhaps the greatest grifter in today’s Silicon Valley, claimed that software engineering would be obsolete by the end of last year.

bsder 1/27/2026||||

> The "hitting a wall" / "plateau" people will continue to be loud and wrong. Just as they have been since 2018[0].

Everybody who bet against Moore's Law was wrong ... until they weren't.

And AI is the reaction to Moore's Law having broken. Nobody gave one iota of damn about trying to make programming easier until the chips couldn't double in speed anymore.

twoodfin 1/27/2026||

This is exactly backwards: Dennard scaling stopped. Moore’s Law has continued and it’s what made training and running inference on these models practical at interactive timescales.

bsder 1/28/2026||

You are technically correct. The best kind of correct.

However, most people don't know the difference between the proper Moore's Law scaling (the cost of a transistor halves every 2 years) which is still continuing (sort of) and the colloquial version (the speed of a transistor doubles every 2 years) which got broken when Dennard scaling ran out. To them, Moore's Law just broke.

Nevertheless, you are reinforcing my point. Nobody gave a damn about improving the "programming" side of things until the hardware side stopped speeding up.

And rather than try to apply some human brainpower to fix the "programming" side, they threw a hideous number of those free (except for the electricity--but we don't mention that--LOL) transistors at the wall to create a broken, buggy, unpredictable machine simulacrum of a "programmer".

(Side note: And to be fair, it looks like even the strong form of Moore's Law is finally slowing down, too)

twoodfin 1/28/2026||

If you can turn a few dollars of electricity per hour into a junior-level programmer who never gets bored, tired, or needs breaks, that fundamentally changes the economics of information technology.

And in fact, the agentic looped LLMs are executing much better than that today. They could stop advancing right now and still be revolutionary.

simianwords 1/27/2026|||

interesting post. i wonder if these people go back and introspect on how incorrect they have been? do they feel the need to address it?

fooker 1/27/2026|||

No, people do not do that.

This is harmless when it comes to tech opinions but causes real damage in politics and activism.

People get really attached to ideals and ideas, and keep sticking to those after they fail to work again and again.

simianwords 1/27/2026||

i don't think it is harmless or we are incentivising people to just say whatever they want without any care for truth. people's reputations should be attached to their predictions.

cogogo 1/27/2026||||

Some people definitely do but how do they go and address it? A fresh example in that it addresses pure misinformation. I just screwed up and told some neighbors garbage collection was delayed for a day because of almost 2ft of snow. Turns out it was just food waste and I was distracted checking the app and read the notification poorly.

I went back to tell them (do not know them at all just everyone is chattier digging out of a storm) and they were not there. Feel terrible and no real viable remedy. Hope they check themselves and realize I am an idiot. Even harder on the internet.

maest 1/28/2026|||

Do _you_ do that?

simianwords 1/28/2026||

i try to yes

peaseagee 1/27/2026||||

That's not true. Many technologies get more expensive over time, as labor gets more expensive or as certain skills fall by the wayside, not everything is mass market. Have you tried getting a grandfather clock repaired lately?

willio58 1/27/2026|||

Repairing grandfather clocks isn't more expensive now because it's gotten any harder; it's because the popularity of grandfather clocks is basically nonexistent compared to anything else to tell time.

direwolf20 1/28/2026||

Doesn't need to be any particular reason to disprove the notion that technology only gets cheaper.

simianwords 1/27/2026||||

"repairing a unique clock" getting costlier doesn't mean technology hasn't gotten cheaper.

check out whether clocks have gotten cheaper in general. the answer is that it has.

there is no economy of scale here in repairing a single clock. its not relevant to bring it up here.

ipaddr 1/27/2026||

Clocks prices have gone up since 2020. Unless a cheaper better way to make clocks has emerged inflation causes prices to grow.

fooker 1/27/2026|||

Luxury watches have gone up, 'clocks' as a technology is cheaper than ever.

You can buy one for 90 cents on temu.

ipaddr 1/27/2026||

The landing cost for that 90 cent watch has gone way up. Shipping and to some degree taxes has pushed the price higher.

pas 1/28/2026||

that's not the technology

of course it's silly to talk about manufacturing methods and yield and cost efficiency without having an economy to embed all of this into, but ... technology got cheaper means that we have practical knowledge of how to make cheap clocks (given certain supply chains, given certain volume, and so and so)

we can make very cheap very accurate clocks that can be embedded into whatever devices, but it requires the availability of fabs capable of doing MEMS components, supply materials, etc.

simianwords 1/27/2026|||

not true, clocks have gone down after accounting for inflation. verified using ChatGPT.

ipaddr 1/27/2026|||

You can't account for inflation because the price increase is inflation.

pas 1/28/2026|||

you can look at a basket of goods that doesn't have your specific product and compare directly

but inflation is the general price level increase, this can be used as a deflator to get the price of whatever product in past/future money amount to see how the price of the product changed in "real" terms (ie. relative to the general price level change)

simianwords 1/27/2026|||

this is not true

peaseagee 1/28/2026|||

You cannot verify anything using a tool that cannot validate truth.

esafak 1/27/2026||||

Instead of advancing tenuous examples you could suggest a realistic mechanism by which costs could rise, such as a Chinese advance on Taiwan, effecting TSMC, etc.

groby_b 1/27/2026||||

No. You don't get to make "technology gets more expensive over time" statements for deprecated technologies.

Getting a bespoke flintstone axe is also pretty expensive, and has also absolutely no relevance to modern life.

These discussions must, if they are to be useful, center in a population experience, not in unique personal moments.

ipaddr 1/27/2026|||

I purchased a 5T drive in 2019 and the price is higher now despite newer better drives going on the market since.

Not much has down in price over the last few years.

groby_b 1/27/2026||

Price volatility exists.

Meanwhile the overall price of storage has been going down consistently: https://ourworldindata.org/grapher/historical-cost-of-comput...

solomonb 1/27/2026||||

okay how about the Francis Scott Key Bridge?

https://marylandmatters.org/2025/11/17/key-bridge-replacemen...

groby_b 1/28/2026||

You will get a different bridge. With very different technology. Same as "I can't repair my grandfather clock cheaply".

In general, there are several things that are true for bridges that aren't true for most technology:

* Technology has massively improved, but most people are not realizing that. (E.g. the Bay Bridge cost significantly more than the previous version, but that's because we'd like to not fall down again in the next earthquake) * We still have little idea how to reason about the cost of bridges in general. (Seriously. It's an active research topic) * It's a tiny market, with the major vendors forming an oligopoly * It's infrastructure, not a standard good * The buy side is almost exclusively governments.

All of these mean expensive goods that are completely non-repeatable. You can't build the same bridge again. And on top of that, in a distorted market.

But sure, the cost of "one bridge, please" has gone up over time.

solomonb 1/28/2026|||

This seems largely the same as any other technology. The prices of new technologies go down initially as we scale up and optimize it's production, but as soon as demand fades, due to newer technology or whatever, the cost of that technology goes up again.

fooker 1/28/2026|||

> But sure, the cost of "one bridge, please" has gone up over time.

Even if you adjust for inflation?

groby_b 1/28/2026||

Depends, do we care about TCO? Also, can I pick the set of bridges I compare?

OK, kidding aside: If you deeply care, you can probably mine the Federal Highway Administration's bridge construction database: https://fhwaapps.fhwa.dot.gov/upacsp/tm?transName=MenuSystem...

I don't think the question is answerable in a meaningful way. Bridges are one-off projects with long life spans, comparing cost over time requires a lot of squinting just so.

arthurbrown 1/27/2026|||

Bought any RAM lately? Phone? GPU in the last decade?

ipaddr 1/27/2026||

The latest iphone has gone down in price? It's double. I guess the marketing is working.

xnyan 1/28/2026||

"Pens are not cheaper, look at this Montblanc" is not a good faith response.

'84 Motorola DynaTAC - ~$12k AfI (adjusted for inflation)

'89 MicroTAC ~$8k AfI

'96 StarTAC ~$2k AfI

`07 iPhone ~$673 AfI

The current average smartphone sells for around $280. Phones are getting cheaper.

direwolf20 1/28/2026||

Phones, or smartphones?

emtel 1/27/2026||||

Time-keeping is vastly cheaper. People don't want grandfather clocks. They want to tell time. And they can, more accurately, more easily, and much cheaper than their ancestors.

epidemiology 1/28/2026|||

Or riding in an uber?

InsideOutSanta 1/27/2026||||

Sure, running an LLM is cheaper, but the way we use LLMs now requires way more tokens than last year.

fooker 1/27/2026|||

10x more tokens today cost less than than half of X tokens from ~mid 2024.

simianwords 1/27/2026|||

ok but the capabilities are also rising. what point are you trying to make?

oytis 1/27/2026||

That it's not getting cheaper?

jstummbillig 1/27/2026|||

But it is, capability adjusted, which is the only way it makes sense. You can definitely produce last years capability at a huge discount.

simianwords 1/27/2026|||

you are wrong. https://epoch.ai/data-insights/llm-inference-price-trends

this is accounting for the fact that more tokens are used.

techpression 1/27/2026||

The chart shows that they’re right though. Newer models cost more than older models. Sure they’re better but that’s moot if older models are not available or can’t solve the problem they’re tasked with.

simianwords 1/27/2026|||

this is incorrect. the cost to achieve the same task by old models is way higher than by new models.

> Newer models cost more than older models

where did you see this?

techpression 1/27/2026||

On the link you shared, 4o vs 3.5 turbo price per 1m tokens.

There’s no such thing as ”same task by old model”, you might get comparable results or you might not (and this is why the comparison fail, it’s not a comparison), the reason you pick the newer models is to increase chances of getting a good result.

simianwords 1/27/2026||

> The dataset for this insight combines data on large language model (LLM) API prices and benchmark scores from Artificial Analysis and Epoch AI. We used this dataset to identify the lowest-priced LLMs that match or exceed a given score on a benchmark. We then fit a log-linear regression model to the prices of these LLMs over time, to measure the rate of decrease in price. We applied the same method to several benchmarks (e.g. MMLU, HumanEval) and performance thresholds (e.g. GPT-3.5 level, GPT-4o level) to determine the variation across performance metrics

This should answer. In your case, GPT-3.5 definitely is cheaper per token than 4o but much much less capable. So they used a model that is cheaper than GPT-3.5 that achieved better performance for the analysis.

fooker 1/27/2026|||

OpenAI has always priced newer models lower than older ones.

simianwords 1/27/2026|||

not true! 4o was costlier than 3.5 turbo

techpression 1/27/2026|||

https://platform.openai.com/docs/pricing

Not according to their pricing table. Then again I’m not sure what OpenAI model versions even mean anymore, but I would assume 5.2 is in the same family as 5 and 5.2-pro as 5-pro

fooker 1/27/2026||

Check GPT 5.2 vs it's predecessor the 'o' series of reasoning models.

root_axis 1/27/2026||||

Not true. Bitcoin has continued to rise in cost since its introduction (as in the aggregate cost incurred to run the network).

LLMs will face their own challenges with respect to reducing costs, since self-attention grows quadratically. These are still early days, so there remains a lot of low hanging fruit in terms of optimizations, but all of that becomes negligible in the face of quadratic attention.

twoodfin 1/27/2026|||

For Bitcoin that’s by design!

namcheapisdumb 1/28/2026|||

> bitcoin

so close! that is a commodity

fulafel 1/28/2026||||

I don't think computation is going to become more expensive, but there are techs that have become so: Nuclear power plants. Mobile phones. Oil extraction.

(Oil rampdown is a survival imperative due to the climate catastrophe so there it's a very positive thing of course, though not sufficient...)

krupan 1/27/2026||||

There are plenty of technologies that have not become cheaper, or at least not cheap enough, to go big and change the world. You probably haven't heard of them because obviously they didn't succeed.

asadotzler 1/27/2026||||

cheaper doesnt mean cheap enough to be viable after the bills come due

runarberg 1/28/2026||||

Supersonic jet engines, rockets to the moon, nuclear power plants, etc. etc. all have become more expensive. Superconductors were discovered in 1911, and we have been making them for as long as we have been making transistors in the 1950s, yet superconductors show no sign of becoming cheaper any time soon.

There have been plenty of technologies in history which do not in fact become cheaper. LLMs are very likely to become such, as I suspect their usefulness will be superseded by cheaper (much cheaper in fact) specialized models.

ak_111 1/27/2026|||

Concorde?

YetAnotherNick 1/27/2026|||

With optimizations and new hardware, power is almost a negligible cost. You can get 5.5M tokens/s/MW[1] for kimi k2(=20M/KWH=181M tokens/$) which is 400x cheaper than current pricing. It's just Nvidia/TSMC/other manufacturers eating up the profit now because they can. My bet is that China will match current Nvidia within 5 years.

[1]: https://developer-blogs.nvidia.com/wp-content/uploads/2026/0...

storystarling 1/27/2026||

Electricity is negligible but the dominant cost is the hardware depreciation itself. Also inference is typically memory bandwidth bound so you are limited by how fast you can move weights rather than raw compute efficiency.

YetAnotherNick 1/28/2026||

Yes, because the margin is like 80% for Nvidia, and 80% again for the manufacturers like Samsung and TSMC. Once the fixed cost like R and D is amortized the same node technology and hardware capacity could be just few single digit percent of current.

redox99 1/27/2026|||

> And you most likely do not pay the actual costs.

This is one of the weakest anti AI postures. "It's a bubble and when free VC money stops you'll be left with nothing". Like it's some kind of mystery how expensive these models are to run.

You have open weight models right now like Kimi K2.5 and GLM 4.7. These are very strong models, only months behind the top labs. And they are not very expensive to run at scale. You can do the math. In fact there are third parties serving these models for profit.

The money pit is training these models (and not that much if you are efficient like chinese models). Once they are trained, they are served with large profit margins compared to the inference cost.

OpenAI and Anthropic are without a doubt selling their API for a lot more than the cost of running the model.

bob1029 1/28/2026|||

Humans run hot too. Once you factor in the supply chain that keeps us alive, things become surprisingly equivalent.

Eating burgers and driving cars around costs a lot more than whatever # of watts the human brain consumes.

bbor 1/28/2026||

I mean, “equivalent” is an understatement! There’s a reason Claude Code costs less than hiring a full time software engineer…

direwolf20 1/28/2026||

(it's VC money burn)

crazygringo 1/27/2026|||

> Somewhere, there are GPUs/NPUs running hot.

Running at their designed temperature.

> You send all the necessary data, including information that you would never otherwise share.

I've never sent the type of data that isn't already either stored by GitHub or a cloud provider, so no difference there.

> And you most likely do not pay the actual costs.

So? Even if costs double once investor subsidies stop, that doesn't change much of anything. And the entire history of computing is that things tend to get cheaper.

> You and your business become dependent on this major gatekeeper.

Not really. Switching between Claude and Gemini or whatever new competition shows up is pretty easy. I'm no more dependent on it than I am on any of another hundred business services or providers that similarly mostly also have competitors.

hahahahhaah 1/27/2026|||

It is also amazing seeing Linux kernel work, scheduling threads, proving interrupts and API calls all without breaking a sweat or injuring its ACL.

mikeocool 1/27/2026|||

To me this tenacity is often like watching someone trying to get a screw into board using a hammer.

There’s often a better faster way to do it, and while it might get to the short term goal eventually, it’s often created some long term problems along the way.

chasebank 1/28/2026|||

I don’t understand this pov. Unfortunately, id pay 10k mo for my cc sub. I wish I could invest in anthropic, they’re going to be the most profitable company on earth

moooo99 1/28/2026|||

My agent struggled for 45 minutes because it tried to do `go run` on a _test.go file, which the compiler repeatedly exited after posting an error message that files named like this cannot be executed using the run command.

So yeah, that wasted a lot of GPU cycles for a very unimpressive result, but with a renewed superficial feeling of competence

squidbeak 1/28/2026|||

> you most likely do not pay the actual costs. It might become cheaper or it might not

Why would this be the first technology that doesn't become cheaper at scale over time?

karlgkk 1/28/2026|||

> And you most likely do not pay the actual costs

Oh my lord you absolutely do not. The costs to oai per token inference ALONE are at least 7x. AT LEAST and from what I’ve heard, much higher.

tgrowazay 1/28/2026||

We can observe how much generic inference providers like deepinfra or together-ai charge for large SOTA models. Since they are not subsidized and they don’t charge 7x of OpenAI, that means OAI also doesn’t have outrageously high per-token costs.

karlgkk 1 day ago||

Actually, that doesn’t mean anything.

OAI is running boundary pushing large models. I don’t think those “second tier” applications can even get the GPUs with the HBM required at any reasonable scale for customer use.

Not to mention training costs of foundation models

utopiah 1/28/2026||

AI genius discover brute forcing... what a time to be alive. /s

Like... bro that's THE foundation of CS. That's the principle of The bomb in Turing's time. One can still marvel at it but it's been with us since the beginning.

bob1029 1/28/2026||

I would agree that OAIs GPT-5 family of models is a phase change over GPT-4.

In the ChatGPT product this is not immediately obvious and many people would strongly argue their preference for 4. However, once you introduce several complex tools and make tool calling mandatory, the difference becomes stark.

I've got an agent loop that will fail nearly every time on GPT-4. It works sometimes, but definitely not enough to go to production. GPT-5 with reasoning set to minimal works 100% of the time. $200 worth of tokens and it still hasn't failed to select the proper sequence of tools. It sometimes gets the arguments to the tools incorrect, but it's always holding the right ones now.

I was very skeptical based upon prior experience but flipping between the models makes it clear there has been recent stepwise progress.

I'll probably be $500 deep in tokens before the end of the month. I could barely go $20 before I called bullshit on this stuff last time.

alansaber 1/28/2026||

Pretty sure there wasn't extensive training on tooling beforehand. I mean, god, during GPT-3 even getting a reliable json output was a battle and there were dedicated packages for json inference.

theshrike79 1/28/2026||

Now imagine local models with 95%+ reliable tool calling, you can do insane things when that's the reality.

oxag3n 1/27/2026||

> Atrophy. I've already noticed that I am slowly starting to atrophy my ability to write code manually... > Largely due to all the little mostly syntactic details involved in programming, you can review code just fine even if you struggle to write it.

Until you struggle to review it as well. Simple exercise to prove it - ask LLM to write a function in familiar programming language, but in the area you didn't invest learning and coding yourself. Try reviewing some code involving embedding/SIMD/FPGA without learning it first.

sleazebreeze 1/27/2026|

People would struggle to review code in a completely unfamiliar domain or part of the stack even before LLMs.

piskov 1/28/2026|||

That’s why you need to write code to learn it.

No-one has ever learned skill just by reading/observing

sponaugle 1/28/2026||

"No-one has ever learned skill just by reading/observing" - Except of course all of those people in Cosmology who, you know, observe.

direwolf20 1/28/2026||

what skill do they have? making stars? no they are skilled at observing, which is what they do.

sponaugle 1/28/2026||

I think understanding stellar processes and then using that understanding to theorize about other observations is a skill. My point was that observing can be a fantastic way to build a skill.. not all skills, but certainly some skills. Learning itself is as much an observation as a practice.

AstroBen 1/28/2026||||

How would you find yourself in that situation before AI?

chrisjj 1/28/2026|||

No, because they wouldn't be so foolish as to try it.

philipwhiuk 1/27/2026||

The bits left unsaid:

1. Burning tokens, which we charge you for

2. My CPU does this when I tell it to do bogosort on a million 32-bit integers, it doesn't mean it's a good thing

vinhnx 1/28/2026||

Boris Cherny (Claude Code creator) replies to Andrej Karpathy

https://xcancel.com/bcherny/status/2015979257038831967

porise 1/27/2026||

I wish the people who wrote this let us know what king of codebases they are working on. They seem mostly useless in a sufficiently large codebase especially when they are messy and interactions aren't always obvious. I don't know how much better Claude is than ChatGPT, but I can't get ChatGPT to do much useful with an existing large codebase.

CameronBanga 1/27/2026||

This is an antidotal example, but I released this last week after 3 months of work on it as a "nights and weekdends" project: https://apps.apple.com/us/app/skyscraper-for-bluesky/id67541...

I've been working in the mobile space since 2009, though primarily as a designer and then product manager. I work in kinda a hybrid engineering/PM job now, and have never been a particularly strong programmer. I definitely wouldn't have thought I could make something with that polish, let alone in 3 months.

That code base is ~98% Claude code.

bee_rider 1/27/2026||

I don’t know if “antidotal example” is a pun or a typo but I quite like it.

CameronBanga 1/27/2026|||

Lol typing on my phone during lunch and meant anecdotal. But let's leave it anyways. :)

oasisbob 1/27/2026|||

That is fun.

Not sure if it's an American pronunciation thing, but I had to stare at that long and hard to see the problem and even after seeing it couldn't think of how you could possibly spell the correct word otherwise.

bsder 1/27/2026||

> Not sure if it's an American pronunciation thing

It's a bad American pronunciation thing like "Febuwary" and "nuculer".

If you pronounce the syllables correctly, "an-ec-dote", "Feb-ru-ar-y", "nu-cle-ar" the spellings follow.

English has it's fair share of spelling stupidities, but if people don't even pronounce the words correctly there is no hope.

lynguist 1/28/2026||

https://en.wiktionary.org/wiki/February

The pronunciation of the first r with a y sound has always been one of two possible standards, in fact "February" is a re-Latinizing spelling but English doesn’t like the br-r sound so it naturally dissimilates to by-r.

TaupeRanger 1/27/2026|||

Claude and Codex are CLI tools you use to give the LLM context about the project on your local machine or dev environment. The fact that you're using the name "ChatGPT" instead of Codex leads me to believe you're talking about using the web-based ChatGPT interface to work on a large codebase, which is completely beside the point of the entire discussion. That's not the tool anyone is talking about here.

danielvaughn 1/27/2026|||

It's important to understand that he's talking about a specific set of models that were release around november/december, and that we've hit a kind of inflection point in model capabilities. Specifically Anthropic's Opus 4.5 model.

I never paid any attention to different models, because they all felt roughly equal to me. But Opus 4.5 is really and truly different. It's not a qualitative difference, it's more like it just finally hit that quantitative edge that allows me to lean much more heavily on it for routine work.

I highly suggest trying it out, alongside a well-built coding agent like the one offered by Claude Code, Cursor, or OpenCode. I'm using it on a fairly complex monorepo and my impressions are much the same as Karpathy's.

suddenlybananas 1/28/2026||

People have said this about every single model release.

danielvaughn 1/28/2026||

I had the same reaction. So when people were talking about this model back in December, I brushed it off. It wasn't until a couple weeks ago that I decided to try it out, and I immediately saw the difference.

My opinion isn't based on what other people are saying, it's my own experience as a fairly AI-skeptical person. Again, I highly suggest you give it an honest try and decide for yourself.

CSMastermind 1/28/2026|||

Are you using Codex?

I'm not sure how big your repos are but I've been effective working with repos that have thousands of files and tens of thousands of lines of code.

If you're just prototyping it will hit wall when things get unwieldy but that's normally a sign that you need to refactor a bit.

Super strict compiler settings, static analysis, comprehensive tests, and documentation help a lot. As does basic technical design. After a big feature is shipped I do a refactor cycle with the LLM where we do a comprehensive code review and patch things up. This does require human oversight because the LLMs are still lacking judgement on what makes for good code design.

The places where I've seen them be useless is working across repositories or interfacing with things like infrastructure.

It's also very model-dependent. Opus is a good daily driver but Codex is much better are writing tests for some reason. I'll often also switch to it for hard problems that Claude can't solve. Gemini is nice for 'I need a prototype in the next 10 minutes', especially for making quick and dirty bespoke front-ends where you don't care about the design just the functionality.

madhadron 1/28/2026||

> tens of thousands of lines of code

Perhaps this is part of it? Tens of thousands of lines of code seems like a very small repo to me.

keerthiko 1/27/2026|||

Almost always, notes like these are going to be about greenfield projects.

Trying to incorporate it in existing codebases (esp when the end user is a support interaction or more away) is still folly, except for closely reviewed and/or non-business-logic modifications.

That said, it is quite impressive to set up a simple architecture, or just list the filenames, and tell some agents to go crazy to implement what you want the application to do. But once it crosses a certain complexity, I find you need to prompt closer and closer to the weeds to see real results. I imagine a non-technical prompter cannot proceed past a certain prototype fidelity threshold, let alone make meaningful contributions to a mature codebase via LLM without a human engineer to guide and review.

reubenmorais 1/27/2026|||

I'm using it on a large set of existing codebases full of extremely ugly legacy code, weird build systems, tons of business logic and shipping directly to prod at neckbreaking growth over the last two years, and it's delivering the same type of value that Karpathy writes about.

jjfoooo4 1/27/2026||||

That was true for me, but is no longer.

It's been especially helpful in explaining and understanding arcane bits of legacy code behavior my users ask about. I trigger Claude to examine the code and figure out how the feature works, then tell it to update the documentation accordingly.

chrisjj 1/28/2026||

> I trigger Claude to examine the code and figure out how the feature works, then tell it to update the documentation accordingly.

And how do you verify its output isn't total fabrication?

jjfoooo4 1/28/2026|||

I read through it, scanning sections that seem uncontroversial and reading more closely sections that talk about things I'm less sure about. The output cites key lines of code, which are faster to track down and look at than trying to remember where in a large codebase to look.

Inconsistencies also pop up in backtesting, for example if there's a point that the llm answers different ways in multiple iterations, that's a good candidate to improve docs on.

Similar to a coworker's work, there's a certain amount of trust in the competency involved.

_dark_matter_ 1/28/2026|||

Your docs are a contact. You can verify that contract using integration tests

chrisjj 1/28/2026||

Contract? These docs are information answering user queries. So if you use a chatbot to generate them, I'd like to be reasonably sure they aren't laden with the fabricated misinformation for which these chatbots are famous.

jjfoooo4 1/28/2026||

It's a very reasonable concern. My solution is to have the bot classify what the message is talking about as a first pass, and have a relatively strict filtering about what it responds to.

For example, I have it ignore messages about code freezes, because that's a policy question that probably changes over time, and I have it ignore urgent oncall messages, because the asker there probably wants a quick response from a human.

But there's a lot of questions in the vein of "How do I write a query for {results my service emits}", how does this feature work, where automation can handle a lot (and provide more complete answers than a human can off the top of their head)

chrisjj 1/28/2026||

OK, but little of that applies to this use case, to "then tell it to update the documentation accordingly."

1123581321 1/27/2026|||

These models do well changing brownfield applications that have tests because the constraints on a successful implementation are tight. Their solutions can be automatically augmented by research and documentation.

mh2266 1/28/2026||

I don't exactly disagree with this but I have seen models simply deleting the tests, or updating the tests to pass and declaring the failures were "unrelated to my changes", so it helpfully fixed them

1123581321 1/28/2026|||

I’ve had to deal with this a handful of times. You just have to make it restore the test, or keep trying to pass a suite of explicit red-green method tests it wrote earlier.

hnben 1/28/2026|||

Yes. You have to treat the model like an eager yet incompetent worker, i.e. don't go full yolo mode and review everything they do.

gwd 1/27/2026|||

For me, in just the golang server instance and the core functional package, `cloc` reports over 40k lines of code, not counting other supporting packages. I spent the last week having Claude rip out the external auth system and replace it with a home-grown one (and having GPT-codex review its changes). If anything, Claude makes it easier on me as a solo founder with a large codebase. Rather than having to re-familiarize myself with code I wrote a year ago, I describe it at a high level, point Claude to a couple of key files, and then tell it to figure out what it needs to do. It can use grep, language server, and other tools to poke around and see what's going on. I then have it write an "epic" in markdown containing all the key files, so that future sessions already know the key files to read.

I really enjoyed the process. As TFA says, you have to keep a close eye on it. But the whole process was a lot less effort, and I ended up doing mor than I would otherwise have done.

ph4te 1/27/2026|||

I don't know how big sufficiently large codebase is, but we have a 1mil loc Java application, that is ~10years old, and runs POS systems, and Claude Code has no issues with it. We have done full analyses with output details each module, and also used it to pinpoint specific issues when described. Vibe coding is not used here, just analysis.

fy20 1/28/2026|||

At my dayjob my team uses it on our main dashboard, which is a pretty large CRUD application. The frontend (Vue) is a horrible mess, as it was originally built by people who know just enough to be dangerous. Over time people have introduced new standards without cleaning up the old code - for example, we have three or four different state management techologies.

For this the LLM struggles a bit, but so does a human. The main issues are it messes up some state that it didnt realise was used elsewhere, and out test coverage is not great. We've seen humans make exactly the same kind of mistakes. We use MCP for Figma so most of the time it can get a UI 95% done, just a few tweaks needed by the operator.

On the backend (Typescript + Node, good test coverage) it can pretty much one-shot - from a plan - whatever feature you give it.

We use opus-4.5 mostly, and sometimes gpt-5.2-codex, through Cursor. You aren't going to get ChatGPT (the web interface) to do anything useful, switch to Cursor, Codex or Claude Code. And right now it is worth paying for the subscription, you don't get the same quality from cheaper or free models (although they are starting to catch up, I've had promising results from GLM-4.7).

yasoob 1/28/2026|||

Another personal example. I spent around a month last year in January on this application: https://apps.apple.com/us/app/salam-prayer-qibla-quran/id674...

I had never used Swift before that and was able to use AI to whip up a fairly full-featured and complex application with a decent amount of code. I had to make some cross-cutting changes along the way as well that impacted quite a few files and things mostly worked fine with me guiding the AI. Mind you this was a year ago so I can only imagine how much better I would fare now with even better AI models. That whole month was spent not only on coding but on learning Swift enough to fix problems when AI started running into circles and then learning about Xcode profiler to optimize the application for speed and improving perf.

BeetleB 1/27/2026|||

> They seem mostly useless in a sufficiently large codebase especially when they are messy and interactions aren't always obvious.

What type of documents do you have explaining the codebase and its messy interactions, and have you provided that to the LLM?

Also, have you tried giving someone brand new to the team the exact same task and information you gave to the LLM, and how effective were they compared to the LLM?

> I don't know how much better Claude is than ChatGPT, but I can't get ChatGPT to do much useful with an existing large codebase.

As others have pointed out, from your comment, it doesn't sound like you've used a tool dedicated for AI coding.

(But even if you had, it would still fail if you expect LLMs to do stuff without sufficient context).

smusamashah 1/27/2026|||

The code base I work on at $dayjob$ is legacy, has few files with 20k lines each and a few more with around 10k lines each. It's hard to find things and connect dots in the code base. Dont think LLMs able to navigate and understand code bases of that size yet. But have seen lots of seemingly large projects shown here lately that involve thousands of files and millions of lines of code.

jumploops 1/27/2026||

I’ve found that LLMs seem to work better on LLM-generated codebases.

Commercial codebases, especially private internal ones, are often messy. It seems this is mostly due to the iterative nature of development in response to customer demands.

As a product gets larger, and addresses a wider audience, there’s an ever increasing chance of divergence from the initial assumptions and the new requirements.

We call this tech debt.

Combine this with a revolving door of developers, and you start to see Conway’s law in action, where the system resembles the organization of the developers rather than the “pure” product spec.

With this in mind, I’ve found success in using LLMs to refactor existing codebases to better match the current requirements (i.e. splitting out helpers, modularizing, renaming, etc.).

Once the legacy codebase is “LLMified”, the coding agents seem to perform more predictably.

YMMV here, as it’s hard to do large refactors without tests for correctness.

(Note: I’ve dabbled with a test first refactor approach, but haven’t gone to the lengths to suggest it works, but I believe it could)

mh2266 1/28/2026|||

are LLM codebases not messy?

Claude by default, unless I tell it not to, will write stuff like:

    // we need something to be true
    somethingPasses = something()
    if (!somethingPasses) {
        return false
    }

    // we need somethingElse to be true
    somethingElsePasses = somethingElse()
    if (!somethingElsePasses) {
        return false
    }

    return true

instead of the very simple boolean logic that could express this in one line, with the "this code does what it obviously does" comments added all over the place.

generally unless you tell it not to, it does things in very verbose ways that most humans would never do, and since there's an infinite number of ways that it can invent absurd verbosity, it is hard to preemptively prompt against all of them.

to be clear, I am getting a huge amount of value out of it for executing a bunch of large refactors and "modernization" of a (really) big legacy codebase at scale and in parallel. but it's not outputting the sort of code that I see when someone prompts it "build a new feature ...", and a big part of my prompts is screaming at it not to do certain things or to refuse the task if it at any point becomes unsure.

jumploops 1/28/2026||

Yeah to be clear it will have the same issues as a flyby contributor if prompted to.

Meaning if you ask it “handle this new condition” it will happily throw in a hacky conditional and get the job done.

I’ve found the most success in having it reason about the current architecture (explicitly), and then to propose a set of changes to accomplish the task (2-5 ways), review, and then implement the changes that best suit the scope of the larger system.

dexdal 1/28/2026||

The failure mode is missing constraints, not “coding skill”. Treat the model as a generator that must operate inside an explicit workflow: define the invariant boundaries, require a plan/diff before edits, run tests and static checks, and stop when uncertainty appears. That turns “hacky conditional” behaviour into controlled change.

jumploops 1/28/2026||

Yes, exactly.

The LLM is onboarding to your codebase with each context window, all it knows is what it’s seen already.

dexdal 1/28/2026||

Right. Each context window is a partial view, so it cannot “know the codebase” unless you supply stable artefacts. Treat project state as inputs: invariants, interfaces, constraints, and a small set of must-keep facts. Then force changes through a plan and a diff, and gate with tests and checks. That turns context limits into a controlled boundary instead of a surprise.

olig15 1/27/2026|||

Surely because LLM generated code is part of the training data for the model, so code/patterns it can work with is closer to its training data.

tunesmith 1/27/2026|||

If you have a ChatGPT account, there's nothing stopping you from installing codex cli and using your chatgpt account with it. I haven't coded with ChatGPT for weeks. Maybe a month ago I got utility out of coding with codex and then having ChatGPT look at my open IDE page to give comments, but since 5.2 came out, it's been 100% codex.

Okkef 1/27/2026|||

Try Claude code. It’s different.

After you tried it, come back.

Imustaskforhelp 1/27/2026||

I think its not Claude code per se itself but rather the (Opus 4.5 model?) or something in an agentic workflow.

I tried a website which offered the Opus model in their agentic workflow & I felt something different too I guess.

Currently trying out Kimi code (using their recent kimi 2.5) for the first time buying any AI product because got it for like 1.49$ per month. It does feel a bit less powerful than claude code but I feel like monetarily its worth it.

Y'know you have to like bargain with an AI model to reduce its pricing which I just felt really curious about. The psychology behind it feels fascinating because I think even as a frugal person, I already felt invested enough in the model and that became my sunk cost fallacy

Shame for me personally because they use it as a hook to get people using their tool and then charge next month 19$ (I mean really Cheaper than claude code for the most part but still comparative to 1.49$)

epolanski 1/27/2026|||

1. Write good documentation, architecture, how things work, code styling, etc.

2. Put your important dependencies source code in the same directory. E.g. put a `_vendor` directory in the project, in it put the codebase at the same tag you're using or whatever: postgres, redis, vue, whatever.

3. Write good plans and requirements. Acceptance criteria, context, user stories, etc. Save them in markdown files. Review those multiple times with LLMs trying to find weaknesses. Then move to implementation files: make it write a detailed plan of what it's gonna change and why, and what it will produce.

4. Write very good prompts. LLMs follow instructions well if they are clear "you should proactively do X", is a weak instruction if you mean "you must do X".

5. LLMs are far from perfect, and full of limits. Karpathy sums their cons very well in his long list. If you don't know their limits you'll mismanage the expectations and not use them when they are a huge boost and waste time on things they don't cope well with. On top of that: all LLMs are different in their "personality", how they adhere to instruction, how creative they are, etc.

datsci_est_2015 1/28/2026|||

Also I never see anyone talking about code reviews, which is one of the primary ways that software engineering departments manage liability. We fired someone recently because they couldn’t explain any of the slop they were trying to get merged. Why tf would I accept the liability of managing code that someone else can’t even explain?

I guess this is fine when you don’t have customers or stakeholders that give a shit lol.

bluGill 1/27/2026|||

I've been trying Claude on my large code base today. When I give it the requirements I'd give an engineer and so "do it" it just writes garbage that doesn't make sense and doesn't seem to even meet the requirements (if it does I can't follow how - though I'll admit to giving up before I understood what it did, and I didn't try it on a real system). When I forced it to step back and do tiny steps - in TDD write one test of the full feature - it did much better - but then I spent the next 5 hours adjusting the code it wrote to meet our coding standards. At least I understand the code, but I'm not sure it is any faster (but it is a lot easier to see things wrong than come up with green field code).

Which is to say you have to learn to use the tools. I've only just started, and cannot claim to be an expert. I'll keep using them - in part because everyone is demanding I do - but to use them you clearly need to know how to do it yourself.

simonw 1/27/2026|||

Have you tried showing it a copy of your coding standards?

I also find pointing it to an existing folder full of code that conforms to certain standards can work really well.

bflesch 1/27/2026|||

Yeah let's share all your IP for the vague promise that it will somehow work ;)

simonw 1/27/2026|||

You just gave me a revelation as to why some people report being unable to get decent results out of coding agents!

CamperBob2 1/28/2026|||

(Shrug) If you're not willing to make that tradeoff, you'll be outcompeted by people who are. Your call.

bluGill 1/28/2026|||

At least some of them that it violated it has seen.

rob 1/27/2026||||

I've been playing around with the "Superpowers" [0] plugin in Claude Code on a new small project and really like it. Simple enough to understand quickly by reading the GitHub repo and seems to improve the output quality of my projects.

There's basically a "brainstorm" /slash command that you go back and forth with, and it places what you came up with in docs/plans/YYYY-MM-DD-<topic>-design.md.

Then you can run a "write-plan" /slash command on the docs/plans/YYYY-MM-DD-<topic>-design.md file, and it'll give you a docs/plans/YYYY-MM-DD-<topic>-implementation.md file that you can then feed to the "execute-plan" /slash command, where it breaks everything down into batches, tasks, etc, and actually implements everything (so three /slash commands total.)

There's also "GET SHIT DONE" (GSD) [1] that I want to look at, but at first glance it seems to be a bit more involved than Superpowers with more commands. Maybe it'd be better for larger projects.

[0] https://github.com/obra/superpowers

[1] https://github.com/glittercowboy/get-shit-done

gverrilla 1/28/2026|||

it's all about the context. observe what files it opened, etc. good luck

jwr 1/28/2026|||

I successfully use Claude Code in a large complex codebase. It's Clojure, perhaps that helps (Clojure is very concise, expressive and hence token-dense).

culi 1/28/2026||

Perhaps it's harder to "do Closure wrong" than it is to do JavaScript or Python or whatever other extremely flexible multi-paradigm high-level language

wcedmisten 1/28/2026||

Having spent 3 years of my career working with Clojure, I think it actually gives you even more rope to shoot yourself with than Python/JS.

E.g. macros exist in Clojure but not Python/JS, and I've definitely been plenty stumped by seeing them in the codebase. They tend to be used in very "clever" patterns.

On the other hand, I'm a bit surprised Claude can tackle a complex Clojure codebase. It's been a while since I attempted using an LLM for Clojure, but at the time it failed completely (I think because there is relatively little training data compared to other mainstream languages). I'll have to check that out myself

jwr 7 days ago||

This changed a lot over the last year, with an absolute seismic shift with Opus 4.5 (I've been trying regularly).

languid-photic 1/27/2026|||

They build Claude Code fully with Claude Code.

Macha 1/27/2026|||

Which is equal parts praise and damnation. Claude Code does do a lot of nice things that people just kind of don't bother for time cost / reward when writing TUIs that they've probably only done because they're using AI heavily, but equally it has a lot of underbaked edges (like accidentally shadowing the user's shell configuration when it tries to install terminal bindings for shift-enter even though the terminal it's configuring already sends a distinct shift-enter result), and bugs (have you ever noticed it just stop, unfinished?).

simianwords 1/27/2026||

i haven't used Claude Code but come on.. it is a production level quality application used seriously by millions.

gsk22 1/27/2026|||

If you haven't used it, how can you judge its quality level?

xyzsparetimexyz 1/27/2026|||

Look up the flickering issue. The program was created by dunces.

vindex10 1/27/2026|||

Ah, now I understand why @autocomplete suddenly got broken between versions and still not fixed )

redox99 1/27/2026|||

What do you even mean by "ChatGPT"? Copy pasting code into chatgpt.com?

AI assisted coding has never been like that, which would be atrocious. The typical workflow was using Cursor with some model of your choice (almost always an Anthropic model like sonnet before opus 4.5 released). Nowadays (in addition to IDEs) it's often a CLI tool like Claude Code with Opus or Codex CLI with GPT Codex 5.2 high/xhigh.

maxdo 1/27/2026|||

chatGPT is not made to write code. Get out of stone age :)

spaceman_2020 1/27/2026||

I'm afraid that we're entering a time when the performance difference between the really cutting edge and even the three-month-old tools is vast

If you're using plain vanilla chatgpt, you're woefully, woefully out of touch. Heck, even plain claude code is now outdated

shj2105 1/27/2026||

Why is plain Claude code outdated? I thought that’s what most people are using right now that are AI forward. Is it Ralph loops now that’s the new thing?

spaceman_2020 1/27/2026||

Plain Claude Code doesn’t have enough scaffolding to handle large projects

At a base level, people are “upgrading” their Claude Code with custom skills and subagents - all text files saved in .claude/agents|skills.

You can also use their new tasks primitive to basically run a Ralph-like loop

But at the edges, people are using multiple instances, each handling different aspects in parallel - stuff like Gas Town

Tbf you can still get a lot of mileage out of vanilla Claude Code. But I’ve found that even adding a simple frontend design skill improves the output substantially

duckmysick 1/28/2026||

Is there anywhere where we can learn more about creating your own agents/skills? Maybe some decent public repos that you could recommend.

spaceman_2020 1/28/2026||

You can just ask Claude to create them. They’re just markdown files

Anthropic’s own repo is as good place as any

https://github.com/anthropics/skills

gloosx 1/28/2026||

So what is he even coding there all the time?

Does anybody have any info on what he is actually working on besides all the vibe-coding tweets?

There seems to be zero output from they guy for the past 2 years (except tweets)

ayewo 1/28/2026||

> There seems to be zero output from they guy for the past 2 years (except tweets)

Well, he made Nanochat public recently and has been improving it regularly [1]. This doesn't preclude that he might be working on other projects that aren't public yet (as part of his work at Eureka Labs).

1: https://github.com/karpathy/nanochat

gloosx 1/28/2026||

So, it's generative pre-trained transformers again?

beng-nl 1/28/2026|||

He's building Eureka Labs[1], an AI-first education company (can't wait to use it). He's both a strong researcher[2] and an unusually gifted technical communicator. His recent videos[3] are excellent educational material.

More broadly though: someone with his track record sharing firsthand observations about agentic coding shouldn't need to justify it by listing current projects. The observations either hold up or they don't.

[1] https://x.com/EurekaLabsAI

[2] PhD in DL, early OpenAI, founding head of AI at Tesla

[3] https://www.youtube.com/@AndrejKarpathy/videos

direwolf20 1/28/2026||

If LLM coding is a 10x productivity enhancer, why aren't we seeing 10x more software of the same quality level, or 100x as much shitty software?

originalvichy 1/28/2026|||

Helper scripts for APIs for applications and tools I know well. LLMs have made my work bearable. Many software providers expose great apis, but expert use cases require data output/input that relies on 50-500 line scripts. Thanks to the models post gpt4.5 most requirements are solvable in 15 minutes when they could have taken multiple workdays to write and check by hand. The only major gap is safe ad-hoc environments to run these in. I provide these helper functions for clients that would love to keep the runtime in the same data environment as the tool, but not all popular software support FaaS style environments that provide something like a simple python env.

ruszki 1/28/2026|||

I don’t know, but it’s interesting that he and many others come up with this “we should act like LLMs are junior devs”. There is a reason why most junior devs work on fairly separate parts of products, most of the time parts which can be removed or replaced easily, and not an integral part of products: because their code is usually quite bad. Like every few lines contains issues, suboptimal solutions, and full with architectural problems. You basically never trust junior devs with core product features. Yet, we should pretend that an “LLM junior dev” is somehow different. These just signal to me that these people don’t work on serious code.

augment_me 1/28/2026||

This is the first question I ask, and every time I get the answer of some monolith that supposedly solves something. Imo, this is completely fine for any personal thing, I am happy when someone says they made an API to compare weekly shopping prices from the stores around them, or some recipe, this makes sense.

However more often than not, someone is just building a monolithic construction that will never be looked at again. For example, someone found that HuggingFace dataloader was slow for some type of file size in combination with some disk. What does this warrant? A 300000+ line non-reviewed repo to fix this issue. Not a 200-line PR to HuggingFace, no you need to generate 20% of the existing repo and then slap your thing on there.

For me this is puzzling, because what is this for? Who is this for? Usually people built these things for practice, but now its generated, so its not for practice because you made very little effort on it. The only thing I can see that its some type of competence signaling, but here again, if the engineer/manager looking knows that this is generated, it does not have the type of value that would come with such signaling. Either I am naive and people still look at these repos and go "whoa this is amazing", or it's some kind of induced egotrip/delusion where the LLM has convinced you that you are the best builder.

Macha 1/27/2026||

> - What does LLM coding feel like in the future? Is it like playing StarCraft? Playing Factorio? Playing music?

Starcraft and Factorio are exactly what it is not. Starcraft has a loooot of micro involved at any level beyond mid level play, despite all the "pro macros and beats gold league with mass queens" meme videos. I guess it could be like Factorio if you're playing it by plugging together blueprint books from other people but I don't think that's how most people play.

At that level of abstraction, it's more like grand strategy if you're to compare it to any video game? You're controlling high level pushes and then the units "do stuff" and then you react to the results.

kridsdale3 1/28/2026||

I think the StarCraft analogy is fine, you have to compare it not to macro and micro RTS play, but to INDIVIDUAL UNITS. For your whole career until now, you have been a single Zergling or Probe. Now you are the Commander.

TheRoque 1/28/2026||

Except that pro starcraft player still micro-manage every single Zergling or probe when necessary, while vibe coders just right click on the ennemy base and hope it'll go well

zetazzed 1/27/2026|||

It's like the Victoria 3 combat system. You just send an army and a general to a given front and let them get to work with no micro. Easy! But of course some percentage of the time they do something crazy like deciding to redeploy from your existential Franco-Prussian war front to a minor colonial uprising...

onetimeusename 1/27/2026|

> the ratio of productivity between the mean and the max engineer? It's quite possible that this grows *a lot*

I have a professor who has researched auto generated code for decades and about six months ago he told me he didn't think AI would make humans obsolete but that it was like other incremental tools over the years and it would just make good coders even better than other coders. He also said it would probably come with its share of disappointments and never be fully autonomous. Some of what he said was a critique of AI and some of it was just pointing out that it's very difficult to have perfect code/specs.

slfreference 1/27/2026|

I can sense two classes of coders emerging.

Billionaire coder: a person who has "written" billion lines.

Ordinary coders : people with only couple of thousands to their git blame.

More comments...