Auto-grading decade-old Hacker News discussions with hindsight

Posted by __rito__ 12/10/2025

Auto-grading decade-old Hacker News discussions with hindsight(karpathy.bearblog.dev)

Related from yesterday: Show HN: Gemini Pro 3 imagines the HN front page 10 years from now - https://news.ycombinator.com/item?id=46205632

686 points | 270 commentspage 2

MBCook 12/10/2025|

#272, I got a B+! Neat.

It would be very interesting to see this applied year after year to see if people get better or worse over time in the accuracy of their judgments.

It would also be interesting to correlate accuracy to scores, but I kind of doubt that can be done. Between just expressing popular sentiment and the first to the post people getting more votes for the same comment than people who come later it probably wouldn’t be very useful data.

pjc50 12/10/2025|

#250, but then I wasn't trying to make predictions for a future AI. Or anyone else, really. Got a high score mostly for status quo bias, e.g. visual languages going nowhere and FPGAs remain niche.

embedding-shape 12/10/2025||

Yeah, it be much more interesting to see the people who made (at the time) outrageous claims, but they came to be true, rather than a list of people who could state that the status quo most likely would stay as it is.

nixpulvis 12/11/2025||

Quick give everyone colors to indicate their rank here and ban anyone with a grade less than C-.

Seriously, while I find this cool and interesting, I also fear how these sorts of things will work out for us all.

Sophira 12/11/2025||

It somehow feels right to see what GPT-5 thinks of the article titled "Machine learning works spectacularly well, but mathematicians aren’t sure why" and its discussion: https://karpathy.ai/hncapsule/2015-12-04/index.html#article-...

DonHopkins 12/11/2025||

I'd love to see an "Annie Hall" analysis of hn posts, for incidents where somebody says something about some piece of software or whatever, and the person who created it replies, like Marshall McLuhan stepping out from behind a sign in Annie Hall.

https://www.youtube.com/watch?v=vTSmbMm7MDg

moultano 12/10/2025||

Notable how this is only possible because the website is a good "web citizen." It has urls that maintain their state over a decade. They contain a whole conversation. You don't have to log in to see anything. The value of old proper websites increases with our ability to process them.

chrisweekly 12/10/2025||

Yes! See "Cool URIs Don't Change"^1 by Sir TBL himself.

1. https://www.w3.org/Provider/Style/URI

dietr1ch 12/10/2025|||

> because the website is a good "web citizen." It has urls that maintain their state over a decade.

It's a shame that maintaining the web is so hard that only a few websites are "good citizens". I wish the web was a -bit- way more like git. It should be easier to crawl the web and serve it.

Say, you browse and get things cached and shared, but only your "local bookmarks" persist. I guess it's like pinning in IPFS.

moultano 12/10/2025|||

Yes, I wish we could serve static content more like bittorent, where your uri has an associate hash, and any intermediate router or cache could be an equivalent source of truth, with the final server only needing to play a role if nothing else has it.

It is not possible right now to make hosting democratized/distributed/robust because there's no way for people to donate their own resources in a seamless way to keeping things published. In an ideal world, the internet archive seamlessly drops in to serve any content that goes down in a fashion transparent to the user.

oncallthrow 12/10/2025||

This is IPFS

shpx 12/11/2025|||

In my experience from the couple of times I clicked an IPFS link years ago, it loaded for a long time and never actually loaded anything, failing the first "I wish we could serve static content" part.

If you make it possible for people to donate bandwidth you might just discover no one wants to.

dietr1ch 12/11/2025||

I think that many are able to toss a almost permanently online raspberry pi in their homes and that's probably enough for sustaining a decently good distributed CAS network that shares small text files.

The wanting to is in my mind harder. How do you convince people that having the network is valuable enough? It's easy to compare it with the web backed by few feuds that offer for the most part really good performance, availability and somewhat good discovery.

drdec 12/10/2025||||

> It's a shame that maintaining the web is so hard that only a few websites are "good citizens"

It's not hard actually. There is a lack of will and forethought on the part of most maintainers. I suspect that monetization also plays a role.

DANmode 12/10/2025|||

Let Reddit and friends continue to out themselves for who they are.

Keeps the spotlight on carefully protected communities like this one.

jeffbee 12/10/2025||

There are things that you have to log in to see, and the mods sometimes move conversations from one place to another, and also, for some reason, whole conversations get reset to a single timestamp.

embedding-shape 12/10/2025|||

> and the mods sometimes move conversations from one place to another

This only manipulates the children references though, never the item ID itself. So if you have the item ID of an item (submission, comment, poll, pollItem), it'll be available there as long as moderators don't remove it, which happens very seldom.

latexr 12/10/2025|||

> for some reason, whole conversations get reset to a single timestamp.

What do you mean?

embedding-shape 12/10/2025|||

Submissions put in the second-chance pool briefly appear (sometimes "again") on the frontpage, and the conversation timestamps are reset so it appears like they were written after the second-chance submission, not before.

Y_Y 12/10/2025||

I never noticed that. What a weird lie!

I suppose they want to make the comments seem "fresh" but it's a deliberate misrepresentation. You could probably even contrive a situation where it could be damaging, e.g. somebody says something before some relevant incident, but the website claims they said it afterwards.

embedding-shape 12/10/2025||

I think the reason is much simpler than that. Resetting the timestamp lets them easily resurface things on the frontpage, because the current time - posting time delta becomes a lot smaller, so it's again ranked higher. And avoiding adding a special case, lets the rest of the codebase work exactly like it was before, basically just need to add a "set submission time to now" function and you get the rest for free.

But, I'm just guessing here based on my own refactoring experience through the years, may be a completely different reason, or even by mistake? Who knows? :)

jeffbee 12/10/2025|||

There is some action that moderators can take that throws one of yesterday's articles back on the front page and when that happens all the comments have the same timestamp.

consumer451 12/10/2025||

I believe that this is called "the second chance pool." It is a bit strange when it unexpectedly happens to one's own post.

Tossrock 12/11/2025||

So where do I collect my prize for this 2015 comment? https://news.ycombinator.com/item?id=9882217

johncolanduoni 12/11/2025|

Never call a man happy until he is dead. Also I don’t think your argument generalizes well - there are plenty of private research investment bubbles that have popped and not reached their original peaks (e.g. VR).

Tossrock 12/11/2025||

It wasn't a generalized argument, though, it was a specific one, about AI.

johncolanduoni 12/11/2025|||

Okay, but the only part that’s specific to AI (that the companies investing the money are capturing more value than they’re putting into it) is now false. Even the hyperscalers are not capturing nearly the value they’re investing, though they’re not using debt to finance it. OpenAI and Anthropic are of course blowing through cash like it’s going out of style, and if investor interest drops drastically they’ll likely need to look to get acquired.

xpe 12/11/2025|||

Here is one sentence from the referenced prediction:

> I don't think there will be any more AI winters.

This isn't enough to qualify as a testable prediction, in the eyes of people that care about such things, because there is no good way to formulate a resolution criteria for a claim that extends indefinitely into the future. See [1] for a great introduction.

[1]: https://www.astralcodexten.com/p/prediction-market-faq

scosman 12/10/2025||

Anyone have a branch that I can run to target my own comments? I'd love to see where I was right and where I was off base. Seems like a genuinely great way to learn about my own biases.

xpe 12/11/2025|

I appreciate your intent, but this tool needs a lot of work -- maybe an entire redesign -- before it would be suitable for the purpose you seek. See discussion at [1].

Besides, in my experience, only a tiny fraction of HN comments can be interpreted as falsifiable predictions.

Instead I would recommend learning about calibration [2] and ways to improve one's calibration, which will likely lead you into literature reviews of cognitive biases and what we can do about them. Also, jumping into some prediction markets (as long as they don't become too much of a distraction) is good practice.

[1]: https://news.ycombinator.com/item?id=46223959

[2]: https://www.lesswrong.com/w/calibration

xpe 12/11/2025||

Many people are impressed by this, and I can see why. Still, this much isn't surprising: the Karpathy + LLM combo can deliver quickly. But there are downsides of blazing speed.

If you dig in, there are substantial flaws in the project's analysis and framing, such as the definition of a prediction, assessing comments, data quality overall, and more. Go spelunking through the comments here and notice people asking about methodology and checking the results.

Social science research isn't easy; it requires training, effort, and patience. I would be very happy if Karpathy added a Big Flashing Red Sign to this effect. It would raise awareness and focus community attention on what I think are the hardest and most important aspects of this kind of project: methodology, rigor, criticism, feedback, and correction.

GaggiX 12/10/2025||

I think the most fun thing is to go to: https://karpathy.ai/hncapsule/hall-of-fame.html

And scroll down to the bottom.

MBCook 12/10/2025||

It’s interesting, if you go down near the bottom you see some people with both A’s and D’s.

According to the ratings for example, one person both had extremely racist ideas but also made a couple of accurate points about how some tech concepts would evolve.

brian_spiering 12/10/2025||

That is interesting because of the Halo effect. There is a cognitive bias that if a person is right in one area, they will be right in another unrelated area.

I try to temper my tendency to believe the Halo effect with Warren Buffett's notion of the Circle of Competence; there is often a very narrow domain where any person can be significantly knowledgeable.

xpe 12/14/2025|||

> A circle of competence is the subject area which matches a person's skills or expertise. The concept was developed by Warren Buffett and Charlie Munger as what they call a mental model, a codified form of business acumen, concerning the investment strategy of limiting one's financial investments in areas where an individual may have limited understanding or experience, while concentrating in areas where one has the greatest familiarity. -Wikipedia

> I try to temper my tendency to believe the Halo effect with Warren Buffett's notion of the Circle of Competence; there is often a very narrow domain where any person can be significantly knowledgeable. (commenter above)

Putting aside Buffett in particular, I'm wary of claims like "there is often a very narrow domain where any person can be significantly knowledgeable". How often? How narrow of a domain? Doesn't it depend on arbitrary definitions of what qualifies as a category? Is this a testable theory? Is it a predictive theory? What does empirical research and careful analysis show?

Putting that aside, there are useful mathematical ways to get an idea of some of the backing concepts without making assumptions about people, culture, education, etc. I'll cook one up now...

Start with 70K balls split evenly across seven colors: red, orange, yellow, green, blue, indigo, and violet. 1,000 show up demanding balls. So we mix them up and randomly distribute 10 balls to every person. What does the distribution tend to look like? What particulars would you tune and/or definitions would you choose to make this problem "sort of" map to something sort of like assessing the diversity of human competence across different areas?

Note the colored balls example assumes independence between colors (subjects or skills or something). But in real life, there are often causally significant links between skills. For example, general reasoning ability improves performance in a lot of other subjects.

Then a goat exploded, because I don't know how to end this comment gracefully.

bgwalter 12/10/2025|

"If LLMs are watching, humans will be on their best behavior". Karpathy, paraphrasing Larry Ellison.

The EU may give LLM surveillance an F at some point.

More comments...