Top
Best
New

Posted by bigwheels 1/26/2026

A few random notes from Claude coding quite a bit last few weeks(twitter.com)
https://xcancel.com/karpathy/status/2015883857489522876
911 points | 847 commentspage 2
kshri24 1/28/2026|
Agree with Karpathy's take. Finally a down to Earth analysis from a respected source in the AI space. I guess I'll be using slopocalypse a lot more now :)

> I am bracing for 2026 as the year of the slopacolypse across all of github, substack, arxiv, X/instagram, and generally all digital media

It has arrived. Github will be most affected thanks to git-terrorists at Apna College refusing to take down that stupid tutorial. IYKYK.

ActorNightly 1/28/2026|
The respect is unwarranted.

He ran Teslas ML division, but still doesnt know what a simple kalman filter is (in the sense where he claimed that lidar would be hard to integrate with cameras).

akoboldfrying 1/28/2026||
The Kalman filter examples I've seen always involve estimating a very simple quantity, like the location of a single 3D point, from noisy sensors. It's clear how multiple estimates can be combined into a new estimate.

I'd guess that cameras on a self-driving car are trying to estimate something much more complex, something like 3D surfaces labeled with categories ("person", "traffic light", etc.). It's not obvious to me how estimates of such things from multiple sensors and predictions can be sensibly and efficiently combined to produce a better estimate. For example, what if there is a near red object in front of a distant red background, so that the camera estimates just a single object, but the lidar sees two?

ActorNightly 1/28/2026||
https://www.bzarg.com/p/how-a-kalman-filter-works-in-picture...

Kalman filters basic concept is essentially this.

1. make prediction on the next state change of some measurable n dimentional quantity, and estimate the covariance matrix across those n dimentions, which describe essentially a probability that the i-th dimention is going to increase (or decrease) with j-th dimention, where i and j are between 0 and n (indices of the vector)

2. Gather sensor data (that can be noisy), and reconcile the predicted measurement with the measured to get the best guess. The covariance matrix acts as a kind of weight for each of the elements

3. Update the covariance matrix based on the measurements in previous step.

You can do this for any vector of numbers. For example, instead of tracking individual objects, you can have a grid where each element represents a physical object that the car should not drive into, with a value representing certainty of that object being there. Then when you combine sensor reading, you still can use your vision model but that model would be enhanced by what lidar detects, both in terms of seeing things that camera doesn't pick up and rejecting things that aren't there.

And the concept is generic enough to where you can set up a system to be able to plug in any additional sensor with its own noise, and it all works out in the end. This is used all the You can even extend the concept past Gaussian noise and linearity, there are a number of other filters that deal with that, broadly under the umbrella of sensor fusion.

The problem is that Karpathy is more of a computer scientist, so he is on his Code 2.0 train of having ML models do everything. I dunno if he is like that himself or Musks "im smarter than everyone else that came before me" rubbed off.

And of course when you think like that, its going to be difficult to integrate lidar into the model. But the problem with that thinking is that forward inference LLM is not AI, and it will never ever be able to drive a car well compared to a true "reasoning" AI with feedback loops.

einrealist 1/27/2026||
> It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later.

Somewhere, there are GPUs/NPUs running hot. You send all the necessary data, including information that you would never otherwise share. And you most likely do not pay the actual costs. It might become cheaper or it might not, because reasoning is a sticking plaster on the accuracy problem. You and your business become dependent on this major gatekeeper. It may seem like a good trade-off today. However, the personal, professional, political and societal issues will become increasingly difficult to overlook.

daxfohl 1/27/2026||
I still find in these instances there's at least a 50% chance it has taken a shortcut somewhere: created a new, bigger bug in something that just happened not to have a unit test covering it, or broke an "implicit" requirement that was so obvious to any reasonable human that nobody thought to document it. These can be subtle because you're not looking for them, because no human would ever think to do such a thing.

Then even if you do catch it, AI: "ah, now I see exactly the problem. just insert a few more coins and I'll fix it for real this time, I promise!"

gtowey 1/27/2026|||
The value extortion plan writes itself. How long before someone pitches the idea that the models explicitly almost keep solving your problem to get you to keep spending? Would you even know?
fragmede 1/27/2026||
The free market proposition is that competition (especially with Chinese labs and grok) means that Anthropic is welcome to do that. They're even welcome to illegally collude with OpenAi such that ChatGPT is similarly gimped. But switching costs are pretty low. If it turns out I can one shot an issue with Qwen or Deepseek or Kimi thinking, Anthropic loses not just my monthly subscription, but everyone else's I show that too. So no, I think that's some grade A conspiracy theory nonsense you've got there.
charcircuit 1/27/2026|||
You are using it wrong, or are using a weak model if your failure rate is over 50%. My experience is nothing like this. It very consistently works for me. Maybe there is a <5% chance it takes the wrong approach, but you can quickly steer it in the right direction.
testaccount28 1/27/2026||
you are using it on easy questions. some of us are not.
fooker 1/27/2026|||
> It might become cheaper or it might not

If it does not, this is going to be first technology in the history of mankind that has not become cheaper.

(But anyway, it already costs half compared to last year)

peaseagee 1/27/2026|||
That's not true. Many technologies get more expensive over time, as labor gets more expensive or as certain skills fall by the wayside, not everything is mass market. Have you tried getting a grandfather clock repaired lately?
willio58 1/27/2026|||
Repairing grandfather clocks isn't more expensive now because it's gotten any harder; it's because the popularity of grandfather clocks is basically nonexistent compared to anything else to tell time.
esafak 1/27/2026||||
Instead of advancing tenuous examples you could suggest a realistic mechanism by which costs could rise, such as a Chinese advance on Taiwan, effecting TSMC, etc.
simianwords 1/27/2026||||
"repairing a unique clock" getting costlier doesn't mean technology hasn't gotten cheaper.

check out whether clocks have gotten cheaper in general. the answer is that it has.

there is no economy of scale here in repairing a single clock. its not relevant to bring it up here.

groby_b 1/27/2026|||
No. You don't get to make "technology gets more expensive over time" statements for deprecated technologies.

Getting a bespoke flintstone axe is also pretty expensive, and has also absolutely no relevance to modern life.

These discussions must, if they are to be useful, center in a population experience, not in unique personal moments.

ctoth 1/27/2026||||
> But anyway, it already costs half compared to last year

You could not have bought Claude Opus 4.5 at any price one year ago I'm quite certain. The things that were available cost half of what they did then, and there are new things available. These are both true.

I'm agreeing with you, to be clear.

There are two pieces I expect to continue: inference for existing models will continue to get cheaper. Models will continue to get better.

Three things, actually.

The "hitting a wall" / "plateau" people will continue to be loud and wrong. Just as they have been since 2018[0].

[0]: https://blog.irvingwb.com/blog/2018/09/a-critical-appraisal-...

simianwords 1/27/2026||
interesting post. i wonder if these people go back and introspect on how incorrect they have been? do they feel the need to address it?
fooker 1/27/2026||
No, people do not do that.

This is harmless when it comes to tech opinions but causes real damage in politics and activism.

People get really attached to ideals and ideas, and keep sticking to those after they fail to work again and again.

simianwords 1/27/2026||
i don't think it is harmless or we are incentivising people to just say whatever they want without any care for truth. people's reputations should be attached to their predictions.
InsideOutSanta 1/27/2026|||
Sure, running an LLM is cheaper, but the way we use LLMs now requires way more tokens than last year.
fooker 1/27/2026|||
10x more tokens today cost less than than half of X tokens from ~mid 2024.
simianwords 1/27/2026|||
ok but the capabilities are also rising. what point are you trying to make?
oytis 1/27/2026||
That it's not getting cheaper?
jstummbillig 1/27/2026|||
But it is, capability adjusted, which is the only way it makes sense. You can definitely produce last years capability at a huge discount.
simianwords 1/27/2026|||
you are wrong. https://epoch.ai/data-insights/llm-inference-price-trends

this is accounting for the fact that more tokens are used.

YetAnotherNick 1/27/2026||
With optimizations and new hardware, power is almost a negligible cost. You can get 5.5M tokens/s/MW[1] for kimi k2(=20M/KWH=181M tokens/$) which is 400x cheaper than current pricing. It's just Nvidia/TSMC/other manufacturers eating up the profit now because they can. My bet is that China will match current Nvidia within 5 years.

[1]: https://developer-blogs.nvidia.com/wp-content/uploads/2026/0...

bob1029 1/28/2026||
I would agree that OAIs GPT-5 family of models is a phase change over GPT-4.

In the ChatGPT product this is not immediately obvious and many people would strongly argue their preference for 4. However, once you introduce several complex tools and make tool calling mandatory, the difference becomes stark.

I've got an agent loop that will fail nearly every time on GPT-4. It works sometimes, but definitely not enough to go to production. GPT-5 with reasoning set to minimal works 100% of the time. $200 worth of tokens and it still hasn't failed to select the proper sequence of tools. It sometimes gets the arguments to the tools incorrect, but it's always holding the right ones now.

I was very skeptical based upon prior experience but flipping between the models makes it clear there has been recent stepwise progress.

I'll probably be $500 deep in tokens before the end of the month. I could barely go $20 before I called bullshit on this stuff last time.

alansaber 1/28/2026||
Pretty sure there wasn't extensive training on tooling beforehand. I mean, god, during GPT-3 even getting a reliable json output was a battle and there were dedicated packages for json inference.
theshrike79 1/28/2026||
Now imagine local models with 95%+ reliable tool calling, you can do insane things when that's the reality.
oxag3n 1/27/2026||
> Atrophy. I've already noticed that I am slowly starting to atrophy my ability to write code manually... > Largely due to all the little mostly syntactic details involved in programming, you can review code just fine even if you struggle to write it.

Until you struggle to review it as well. Simple exercise to prove it - ask LLM to write a function in familiar programming language, but in the area you didn't invest learning and coding yourself. Try reviewing some code involving embedding/SIMD/FPGA without learning it first.

sleazebreeze 1/27/2026|
People would struggle to review code in a completely unfamiliar domain or part of the stack even before LLMs.
piskov 1/28/2026|||
That’s why you need to write code to learn it.

No-one has ever learned skill just by reading/observing

sponaugle 1/28/2026||
"No-one has ever learned skill just by reading/observing" - Except of course all of those people in Cosmology who, you know, observe.
direwolf20 1/28/2026||
what skill do they have? making stars? no they are skilled at observing, which is what they do.
sponaugle 1/28/2026||
I think understanding stellar processes and then using that understanding to theorize about other observations is a skill. My point was that observing can be a fantastic way to build a skill.. not all skills, but certainly some skills. Learning itself is as much an observation as a practice.
AstroBen 1/28/2026||||
How would you find yourself in that situation before AI?
chrisjj 1/28/2026|||
No, because they wouldn't be so foolish as to try it.
philipwhiuk 1/27/2026||
> It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later.

The bits left unsaid:

1. Burning tokens, which we charge you for

2. My CPU does this when I tell it to do bogosort on a million 32-bit integers, it doesn't mean it's a good thing

vinhnx 1/28/2026||
Boris Cherny (Claude Code creator) replies to Andrej Karpathy

https://xcancel.com/bcherny/status/2015979257038831967

porise 1/27/2026||
I wish the people who wrote this let us know what king of codebases they are working on. They seem mostly useless in a sufficiently large codebase especially when they are messy and interactions aren't always obvious. I don't know how much better Claude is than ChatGPT, but I can't get ChatGPT to do much useful with an existing large codebase.
CameronBanga 1/27/2026||
This is an antidotal example, but I released this last week after 3 months of work on it as a "nights and weekdends" project: https://apps.apple.com/us/app/skyscraper-for-bluesky/id67541...

I've been working in the mobile space since 2009, though primarily as a designer and then product manager. I work in kinda a hybrid engineering/PM job now, and have never been a particularly strong programmer. I definitely wouldn't have thought I could make something with that polish, let alone in 3 months.

That code base is ~98% Claude code.

bee_rider 1/27/2026||
I don’t know if “antidotal example” is a pun or a typo but I quite like it.
CameronBanga 1/27/2026|||
Lol typing on my phone during lunch and meant anecdotal. But let's leave it anyways. :)
oasisbob 1/27/2026|||
That is fun.

Not sure if it's an American pronunciation thing, but I had to stare at that long and hard to see the problem and even after seeing it couldn't think of how you could possibly spell the correct word otherwise.

keerthiko 1/27/2026|||
Almost always, notes like these are going to be about greenfield projects.

Trying to incorporate it in existing codebases (esp when the end user is a support interaction or more away) is still folly, except for closely reviewed and/or non-business-logic modifications.

That said, it is quite impressive to set up a simple architecture, or just list the filenames, and tell some agents to go crazy to implement what you want the application to do. But once it crosses a certain complexity, I find you need to prompt closer and closer to the weeds to see real results. I imagine a non-technical prompter cannot proceed past a certain prototype fidelity threshold, let alone make meaningful contributions to a mature codebase via LLM without a human engineer to guide and review.

reubenmorais 1/27/2026|||
I'm using it on a large set of existing codebases full of extremely ugly legacy code, weird build systems, tons of business logic and shipping directly to prod at neckbreaking growth over the last two years, and it's delivering the same type of value that Karpathy writes about.
jjfoooo4 1/27/2026||||
That was true for me, but is no longer.

It's been especially helpful in explaining and understanding arcane bits of legacy code behavior my users ask about. I trigger Claude to examine the code and figure out how the feature works, then tell it to update the documentation accordingly.

1123581321 1/27/2026|||
These models do well changing brownfield applications that have tests because the constraints on a successful implementation are tight. Their solutions can be automatically augmented by research and documentation.
danielvaughn 1/27/2026|||
It's important to understand that he's talking about a specific set of models that were release around november/december, and that we've hit a kind of inflection point in model capabilities. Specifically Anthropic's Opus 4.5 model.

I never paid any attention to different models, because they all felt roughly equal to me. But Opus 4.5 is really and truly different. It's not a qualitative difference, it's more like it just finally hit that quantitative edge that allows me to lean much more heavily on it for routine work.

I highly suggest trying it out, alongside a well-built coding agent like the one offered by Claude Code, Cursor, or OpenCode. I'm using it on a fairly complex monorepo and my impressions are much the same as Karpathy's.

bluGill 1/27/2026|||
I've been trying Claude on my large code base today. When I give it the requirements I'd give an engineer and so "do it" it just writes garbage that doesn't make sense and doesn't seem to even meet the requirements (if it does I can't follow how - though I'll admit to giving up before I understood what it did, and I didn't try it on a real system). When I forced it to step back and do tiny steps - in TDD write one test of the full feature - it did much better - but then I spent the next 5 hours adjusting the code it wrote to meet our coding standards. At least I understand the code, but I'm not sure it is any faster (but it is a lot easier to see things wrong than come up with green field code).

Which is to say you have to learn to use the tools. I've only just started, and cannot claim to be an expert. I'll keep using them - in part because everyone is demanding I do - but to use them you clearly need to know how to do it yourself.

simonw 1/27/2026||
Have you tried showing it a copy of your coding standards?

I also find pointing it to an existing folder full of code that conforms to certain standards can work really well.

ph4te 1/27/2026|||
I don't know how big sufficiently large codebase is, but we have a 1mil loc Java application, that is ~10years old, and runs POS systems, and Claude Code has no issues with it. We have done full analyses with output details each module, and also used it to pinpoint specific issues when described. Vibe coding is not used here, just analysis.
TaupeRanger 1/27/2026|||
Claude and Codex are CLI tools you use to give the LLM context about the project on your local machine or dev environment. The fact that you're using the name "ChatGPT" instead of Codex leads me to believe you're talking about using the web-based ChatGPT interface to work on a large codebase, which is completely beside the point of the entire discussion. That's not the tool anyone is talking about here.
tunesmith 1/27/2026|||
If you have a ChatGPT account, there's nothing stopping you from installing codex cli and using your chatgpt account with it. I haven't coded with ChatGPT for weeks. Maybe a month ago I got utility out of coding with codex and then having ChatGPT look at my open IDE page to give comments, but since 5.2 came out, it's been 100% codex.
spaceman_2020 1/27/2026|||
I'm afraid that we're entering a time when the performance difference between the really cutting edge and even the three-month-old tools is vast

If you're using plain vanilla chatgpt, you're woefully, woefully out of touch. Heck, even plain claude code is now outdated

shj2105 1/27/2026||
Why is plain Claude code outdated? I thought that’s what most people are using right now that are AI forward. Is it Ralph loops now that’s the new thing?
Okkef 1/27/2026|||
Try Claude code. It’s different.

After you tried it, come back.

Imustaskforhelp 1/27/2026||
I think its not Claude code per se itself but rather the (Opus 4.5 model?) or something in an agentic workflow.

I tried a website which offered the Opus model in their agentic workflow & I felt something different too I guess.

Currently trying out Kimi code (using their recent kimi 2.5) for the first time buying any AI product because got it for like 1.49$ per month. It does feel a bit less powerful than claude code but I feel like monetarily its worth it.

Y'know you have to like bargain with an AI model to reduce its pricing which I just felt really curious about. The psychology behind it feels fascinating because I think even as a frugal person, I already felt invested enough in the model and that became my sunk cost fallacy

Shame for me personally because they use it as a hook to get people using their tool and then charge next month 19$ (I mean really Cheaper than claude code for the most part but still comparative to 1.49$)

languid-photic 1/27/2026|||
They build Claude Code fully with Claude Code.
Macha 1/27/2026||
Which is equal parts praise and damnation. Claude Code does do a lot of nice things that people just kind of don't bother for time cost / reward when writing TUIs that they've probably only done because they're using AI heavily, but equally it has a lot of underbaked edges (like accidentally shadowing the user's shell configuration when it tries to install terminal bindings for shift-enter even though the terminal it's configuring already sends a distinct shift-enter result), and bugs (have you ever noticed it just stop, unfinished?).
simianwords 1/27/2026||
i haven't used Claude Code but come on.. it is a production level quality application used seriously by millions.
maxdo 1/27/2026||
chatGPT is not made to write code. Get out of stone age :)
gloosx 1/28/2026||
So what is he even coding there all the time?

Does anybody have any info on what he is actually working on besides all the vibe-coding tweets?

There seems to be zero output from they guy for the past 2 years (except tweets)

ayewo 1/28/2026||
> There seems to be zero output from they guy for the past 2 years (except tweets)

Well, he made Nanochat public recently and has been improving it regularly [1]. This doesn't preclude that he might be working on other projects that aren't public yet (as part of his work at Eureka Labs).

1: https://github.com/karpathy/nanochat

gloosx 1/28/2026||
So, it's generative pre-trained transformers again?
beng-nl 1/28/2026|||
He's building Eureka Labs[1], an AI-first education company (can't wait to use it). He's both a strong researcher[2] and an unusually gifted technical communicator. His recent videos[3] are excellent educational material.

More broadly though: someone with his track record sharing firsthand observations about agentic coding shouldn't need to justify it by listing current projects. The observations either hold up or they don't.

[1] https://x.com/EurekaLabsAI

[2] PhD in DL, early OpenAI, founding head of AI at Tesla

[3] https://www.youtube.com/@AndrejKarpathy/videos

direwolf20 1/28/2026||
If LLM coding is a 10x productivity enhancer, why aren't we seeing 10x more software of the same quality level, or 100x as much shitty software?
originalvichy 1/28/2026|||
Helper scripts for APIs for applications and tools I know well. LLMs have made my work bearable. Many software providers expose great apis, but expert use cases require data output/input that relies on 50-500 line scripts. Thanks to the models post gpt4.5 most requirements are solvable in 15 minutes when they could have taken multiple workdays to write and check by hand. The only major gap is safe ad-hoc environments to run these in. I provide these helper functions for clients that would love to keep the runtime in the same data environment as the tool, but not all popular software support FaaS style environments that provide something like a simple python env.
ruszki 1/28/2026|||
I don’t know, but it’s interesting that he and many others come up with this “we should act like LLMs are junior devs”. There is a reason why most junior devs work on fairly separate parts of products, most of the time parts which can be removed or replaced easily, and not an integral part of products: because their code is usually quite bad. Like every few lines contains issues, suboptimal solutions, and full with architectural problems. You basically never trust junior devs with core product features. Yet, we should pretend that an “LLM junior dev” is somehow different. These just signal to me that these people don’t work on serious code.
augment_me 1/28/2026||
This is the first question I ask, and every time I get the answer of some monolith that supposedly solves something. Imo, this is completely fine for any personal thing, I am happy when someone says they made an API to compare weekly shopping prices from the stores around them, or some recipe, this makes sense.

However more often than not, someone is just building a monolithic construction that will never be looked at again. For example, someone found that HuggingFace dataloader was slow for some type of file size in combination with some disk. What does this warrant? A 300000+ line non-reviewed repo to fix this issue. Not a 200-line PR to HuggingFace, no you need to generate 20% of the existing repo and then slap your thing on there.

For me this is puzzling, because what is this for? Who is this for? Usually people built these things for practice, but now its generated, so its not for practice because you made very little effort on it. The only thing I can see that its some type of competence signaling, but here again, if the engineer/manager looking knows that this is generated, it does not have the type of value that would come with such signaling. Either I am naive and people still look at these repos and go "whoa this is amazing", or it's some kind of induced egotrip/delusion where the LLM has convinced you that you are the best builder.

Macha 1/27/2026||
> - What does LLM coding feel like in the future? Is it like playing StarCraft? Playing Factorio? Playing music?

Starcraft and Factorio are exactly what it is not. Starcraft has a loooot of micro involved at any level beyond mid level play, despite all the "pro macros and beats gold league with mass queens" meme videos. I guess it could be like Factorio if you're playing it by plugging together blueprint books from other people but I don't think that's how most people play.

At that level of abstraction, it's more like grand strategy if you're to compare it to any video game? You're controlling high level pushes and then the units "do stuff" and then you react to the results.

hombre_fatal 1/27/2026|
Interpret it more abstractly.

In Starcraft, you decide if an SVC should mine crystals or build a bunker or repair a depot.

Imagine if instead it were a UI over agents that you could assign to different tasks, build new pipelines.

I don't think stutter-stepping marines to min/max combat was the metaphor he was going for.

onetimeusename 1/27/2026|
> the ratio of productivity between the mean and the max engineer? It's quite possible that this grows *a lot*

I have a professor who has researched auto generated code for decades and about six months ago he told me he didn't think AI would make humans obsolete but that it was like other incremental tools over the years and it would just make good coders even better than other coders. He also said it would probably come with its share of disappointments and never be fully autonomous. Some of what he said was a critique of AI and some of it was just pointing out that it's very difficult to have perfect code/specs.

slfreference 1/27/2026|
I can sense two classes of coders emerging.

Billionaire coder: a person who has "written" billion lines.

Ordinary coders : people with only couple of thousands to their git blame.

More comments...