Top
Best
New

Posted by __rito__ 12/10/2025

Auto-grading decade-old Hacker News discussions with hindsight(karpathy.bearblog.dev)
Related from yesterday: Show HN: Gemini Pro 3 imagines the HN front page 10 years from now - https://news.ycombinator.com/item?id=46205632
686 points | 270 comments
jasonthorsness 12/10/2025|
It's fun to read some of these historic comments! A while back I wrote a replay system to better capture how discussions evolved at the time of these historic threads. Here's Karpathy's list from his graded articles, in the replay visualizer:

Swift is Open Source https://hn.unlurker.com/replay?item=10669891

Launch of Figma, a collaborative interface design tool https://hn.unlurker.com/replay?item=10685407

Introducing OpenAI https://hn.unlurker.com/replay?item=10720176

The first person to hack the iPhone is building a self-driving car https://hn.unlurker.com/replay?item=10744206

SpaceX launch webcast: Orbcomm-2 Mission [video] https://hn.unlurker.com/replay?item=10774865

At Theranos, Many Strategies and Snags https://hn.unlurker.com/replay?item=10799261

SauntSolaire 12/10/2025||
I'd love to see sentiment analysis done based on time of day. I'm sure it's largely time zone differences, but I see a large variance in the types of opinions posted to hn in the morning versus the evening and I'd be curious to see it quantified.
embedding-shape 12/10/2025|||
Yeah, I see this constantly any time Europe is mentioned in a submission. Early European morning/day, regular discussions, but as the European afternoon/evening comes around, you start noticing a lot anti-union sentiment, discussions start to shift into over-regulation, and the typical boring anti-Europe/EU talking points.
nostrebored 12/11/2025||
“Regular” to who? Pro EU sentiment almost only comes from the EU, which is what you’re observing. Pro-US sentiment is relatively mixed (as is anti-US sentiment) in distribution.
gilrain 12/11/2025||
> Pro EU sentiment almost only comes from the EU

Says who? But also, it doesn’t suggest what you imply. I could as easily conclude: “Oh wow, the people who actually experience the system like it that much? Awesome!”

TimedToasts 12/12/2025||
Or one could conclude that the bots were posting at a time of day intending you as the reading target. As long as they post things that you are inclined to agree with, you'll feel positive reinforcement about an issue regardless of the actual popularity or even viability.
red-iron-pine 12/11/2025|||
e.g. how many are cali tech bros vs nyc fintec vs 10am moscow shillbot time
arowthway 12/11/2025|||
Comment dates on hn frontend are sometimes altered when submissions are merged, do you handle this case properly?
jasonthorsness 12/11/2025||
It is handled on the Unlurker front page (you will see a little note that says “time adjusted for second chance”). The replay doesn’t do any adjustment for it, but I think that makes it reflect the reality of when the comments came in since the adjustments are like a temporary bump
matsemann 12/11/2025|||
I like the "past" functionality here, maybe wished there was one for week/month I could scroll back as well.

Miss it for reddit as well. Top day/week/month/alltime makes it hard to find top a month in 2018.

HanClinto 12/10/2025||
Okay, your site is a ton of fun. Thank you! :)
modeless 12/10/2025||
This is a cool idea. I would install a Chrome extension that shows a score by every username on this site grading how well their expressed opinions match what subsequently happened in reality, or the accuracy of any specific predictions they've made. Some people's opinions are closer to reality than others and it's not always correlated with upvotes.

An extension of this would be to grade people on the accuracy of the comments they upvote, and use that to weight their upvotes more in ranking. I would love to read a version of HN where the only upvotes that matter are from people who agree with opinions that turn out to be correct. Of course, only HN could implement this since upvotes are private.

cootsnuck 12/10/2025||
The RES (Reddit Enhancement Suite) browser extension indirectly does this for me since it tracks the lifetime number of upvotes I give other users. So when I stumble upon a thread with a user with like +40 I know "This is someone whom I've repeatedly found to have good takes" (depending on the context).

It's subjective of course but at least it's transparently so.

I just think it's neat that it's kinda sorta a loose proxy for what you're talking about but done in arguably the simplest way possible.

nickff 12/10/2025|||
I am not a Redditor, but RES sounds like it would increase the ‘echo-chamber’ effect, rather than improving one’s understanding of contributors’ calibration.
baq 12/11/2025|||
Echo chamber of rational, thoughtful and truthful speakers is what I’m looking for in Internet forums.
jrmg 12/11/2025|||
That’s what everyone living in an echo chamber (and especially one of their own creation) thinks they’re in.
XorNot 12/11/2025|||
"you're in an echo chamber" is one of the most frightfully overused opinions.
ssl-3 12/11/2025||
The expression is an echo chamber in and of itself; it is self-fulfilling prophecy.
baq 12/11/2025|||
I don't think I'm in any is my problem (HN is better than most, doesn't mean it's good in absolute terms...)
red-iron-pine 12/11/2025|||
flat earth creationists would describe their colleagues the same way.

a group of them certainly is an echo chamber; why isn't your view?

ahf8Aithaex7Nai 12/11/2025|||
He doesn't deny that his point of view forms an echo chamber.
lukan 12/12/2025||||
"flat earth creationists would describe their colleagues the same way."

Actually they mostly don't. Lots of infighting over the real true answer .. (infinite flat earth, finite but with impassable ice walls, ..)

xmprt 12/12/2025|||
An echo chamber is a product of your own creation. If you're willing to upvote people who disagree with your and actively seek out opposite takes to be genuinely curious about, then you're probably not in an echo chamber.

The tools for controlling your feed are reducing on social media like Instagram, TikTok, Youtube, etc., but simply saying that you follow and respect the opinions of a select group doesn't necessarily mean you're forming an echo chamber.

This is different from something like flat earth/other conspiracy theories where when confronted with opposite evidence, they aren't likely to engage with it in good faith.

mistercheph 12/10/2025||||
it depends on if you vote based on the quality of contribution to the discussion or based on how much you agree/disagree.
miki123211 12/11/2025||
I don't think you can change user behavior like this.

You can give them a "venting sink" though. Instead of having a downvote button that just downvotes, have it pop up a little menu asking for a downvote reason, with "spam" and "disagree" as options. You could then weigh downvotes by which option was selected, along with an algorithm to discover "user honesty" based on whether their downvotes correlate with others or just with the people on their end of the political spectrum, a la Birdwatch.

morshu9001 12/12/2025||
You can't change it for other users, only for yourself, which is what the original comment about the extension said.
intended 12/11/2025||||
Echo chambers will always result on social media. I don't think you can come up with a format that will not result in consolidated blocs.
modeless 12/10/2025||||
Reddit's current structure very much produces an echo chamber with only one main prevailing view. If everyone used an extension like this I would expect it to increase overall diversity of opinion on the site, as things that conflict with the main echo chamber view could still thrive in their own communities rather than getting downvoted with the actual spam.
XorNot 12/11/2025||
Hacker News structure is identical though. Topics invite different demographics and downvotes suppress unpopular opinions. The front page shows most up voted stories. It's the same system.
modeless 12/11/2025|||
HN's moderation and ranking is better. But there's definitely an echo chamber effect here too.
morshu9001 12/12/2025|||
HN has some built-in ways to reduce this, like not allowing everyone to downvote everything.
PunchyHamster 12/10/2025|||
More than having exact same system but with any random reader voting ? I'd say as long as you don't do "I disagree therefore I downvote" it would probably be more accurate than having essentially same voting system driven by randoms like reddit/HN already does
janalsncm 12/10/2025|||
That assumes your upvotes in the past were a good proxy for being correct today. You could have both been wrong.
potato3732842 12/11/2025|||
>This is a cool idea. I would install a Chrome extension that shows a score by every username on this site grading how well their expressed opinions match what subsequently happened in reality, or the accuracy of any specific predictions they've made.

Why stop there?

If you can do that you can score them on all sorts of things. You could make a "this person has no moral convictions and says whatever makes the number go up" score. Or some other kind of score.

Stuff like this makes the community "smaller" in a way. Like back in the old days on forums and IRC you knew who the jerks were.

leobg 12/10/2025|||
That’s what Elon’s vision was before he ended up buying Twitter. Keep a digital track record for journalists. He wanted to call it Pravda.

(And we do have that in real life. Just as, among friends, we do keep track of who is in whose debt, we also keep a mental map of whose voice we listen to. Old school journalism still had that, where people would be reading someone’s column over the course of decades. On the internet, we don’t have that, or we have it rarely.)

TrainedMonkey 12/10/2025|||
I long had a similar idea for stocks. Analyze posts of people giving stock tips on WSB, Twitter, etc and rank by accuracy. I would be very surprised if this had not been done a thousand times by various trading firms and enterprising individuals.

Of course in the above example of stocks there are clear predictions (HNWS will go up) and an oracle who resolves it (stock market). This seems to be a way harder problem for generic free form comments. Who resolves what prediction a particular comment has made and whether it actually happened?

miki123211 12/11/2025|||
> Analyze posts of people giving stock tips on WSB, Twitter, etc and rank by accuracy.

Didn't somebody make an ETF once that went against the prediction of some famous CNBC stock picker, showing that it would have given you alpha in the past.

> seems to be a way harder problem for generic free form comments.

That's what prediction markets are for. People for whom truth and accuracy matters (often concentrated around the rationalist community) will often very explicitly make annual lists of concrete and quantifiable predictions, and then self-grade on them later.

Natsu 12/11/2025|||
You probably mean Inverse Cramer:

https://finbold.com/inverse-cramer-leaves-sp-nasdaq-and-dow-...

red-iron-pine 12/11/2025|||
Cramer is the stock picker guy. There is a well known "Cramer Effect" or "Cramer Bounce" where the stock peaks then drops hard.

Makes for great pump n dump if you're day trading and willing to ride

https://www.investopedia.com/terms/c/cramerbounce.asp

long-term his choices don't do well, so the Inverse Cramer basically says "do the opposite of this goober" and has solid returns (sorta; depends a lot on methodology, and the sole hedgefund playing that strategy shutdown)

Karrot_Kream 12/10/2025||||
I ran across Sybil [1] the other day which tries to offer a reputation score based on correct predictions in prediction markets.

[1]: https://sybilpredicttrust.info/

mvkel 12/11/2025|||
Out of curiosity, I built this. I extended karpathy's code and widened the date range to see what stocks these users would pick given their sentiments.

What came back were the usual suspects: GLP-1 companies and AI.

Back to the "boring but right" thesis. Not much alpha to be found

emaro 12/11/2025|||
I like the idea and certainly would try it. Although I feel in a way this would be an anti-thesis to HN. HN tries to foster curiosity, but if you're (only) ranked by the accuracy of your predictions, this could give the incentive to always fall back to a save and boring position.
handoflixue 12/14/2025||
I think the most interesting predictions are the ones that sound bold and even a little bit insane at the time. I think a lot more of the people who were willing to say saying "ASI will kill us all" 20+ years ago, because they were taking a risk (and routinely ridiculed for it).

Even today, "ASI will kill us all" can be a pretty divisive declaration - hardly safe and boring.

From the couple of threads I clicked, it seemed like this LLM-driven analysis was picking up on that, too: the top comments were usually bold, and some of the worst-rated comments was the "safe and boring" declaration that nothing interesting ever really happens.

8organicbits 12/10/2025|||
The problem seems underspecified; what does it mean for a comment to be accurate? It would seem that comments like "the sun will rise tomorrow" would rank highest, but they aren't surprising.
smeeger 12/11/2025||
just because an idea is qualitative doesn't mean its invalid
prawn 12/11/2025||
Didn't Slashdot have something like the second point with their meta-moderation, many many years ago?
ssl-3 12/11/2025||
Sorta.

IIRC, when comment moderation and scoring came to Slashdot, only a random (and changing) selection of users were able to moderate.

Meta-moderation came a bit later. It allowed people to review prior moderation actions and evaluate the worth of those actions.

Those users who made good moderations were more likely to become a mod again in the future than those who made bad moderations.

The meta-mods had no idea whose actions they were evaluating, and previous/potential mods had no idea what their score was. That anonymity helped keep it honest and harder to game.

handoflixue 12/14/2025||
It's still that way today: if you're active, you'll be randomly given 5 moderation points occasionally, and they expire after a few days. So you have to decide which threads and comments are worth spending a "moderation point" on
ssl-3 12/14/2025||
How does meta-moderation work these days? I remember it being called Chips and Dip instead of /., but it's been many years since I've hung out there.
tptacek 12/10/2025||
'pcwalton, I'm coming for you. You're going down.

Kidding aside, the comments it picks out for us are a little random. For instance, this was an A+ predictive thread (it appears to be rating threads and not individual comments):

https://news.ycombinator.com/item?id=10703512

But there's just 11 comments, only 1 for me, and it's like a 1-sentence comment.

I do love that my unaccredited-access-to-startup-shares take is on that leaderboard, though.

kbenson 12/11/2025||
I noticed from reviewing my own entry (which honestly I'm surprised exists) that the idea of what it thinks constitutes a "prediction" is fairly open to interpretation, or at least that adding some nuance to a small aspect in a thread to someone else prediction counts quite heavily. I don't really view how I've participated here over the years in any way as making predictions. I actually thought I had done a fairly good job at not making predictions, by design.
n4r9 12/11/2025|||
Yeah, I'm having to pinch myself a little here. Another slightly odd example it picked out from your history: https://news.ycombinator.com/item?id=10735398

It's a good comment, but "prescient" isn't a word I'd apply to it. This is more like a list of solid takes. To be fair there probably aren't even that many explicit, correct predictions in one month of comments in 2015.

mvkel 12/11/2025||
Hilariously, it seems you anticipated this happening and copyrighted your comments. Is karpathy's tool in violation of your copyright?!
tptacek 12/11/2025||
Karpathy, I'm coming for you next.
btbuildem 12/10/2025||
I've spent a weekend making something similar for my gmail account (which google keeps nagging me about being 90% full). It's fascinating to be able to classify 65k+ of emails (surprise: more than half are garbage), as well as summarize and trace the nature of communication between specific senders/recipients. It took about 50 hours on a dual RTX 3090 running Qwen 3.

My original goal was to prune the account deleting all the useless things and keeping just the unique, personal, valuable communications -- but the other day, an insight has me convinced that the safer / smarter thing to do in the current landscape is the opposite: remove any personal, valuable, memorable items, and leave google (and whomever else is scraping these repositories) with useless flotsam of newsletters, updates, subscription receipts, etc.

subscriptzero 12/11/2025||
I would love to do something like this, and weirdly I even have a dual 3090 home setup.

Any chance you can outline the steps/prompts/tools you used to run this?

I've been building a 2nd brain type project, that plugs into all my work places and a custom classifier has been on that list that would enhance that.

red-iron-pine 12/11/2025||
so then what do you do with the useful stuff?
btbuildem 12/12/2025||
Local archive + client for search
Rperry2174 12/10/2025||
One thing this really highlights to me is how often the "boring" takes end up being the most accurate. The provocative, high-energy threads are usually the ones that age the worst.

If an LLM were acting as a kind of historian revisiting today’s debates with future context, I’d bet it would see the same pattern again and again: the sober, incremental claims quietly hold up, while the hyperconfident ones collapse.

Something like "Lithium-ion battery pack prices fall to $108/kWh" is classic cost-curve progress. Boring, steady, and historically extremely reliable over long horizons. Probably one of the most likely headlines today to age correctly, even if it gets little attention.

On the flip side, stuff like "New benchmark shows top LLMs struggle in real mental health care" feels like high-risk framing. Benchmarks rotate constantly, and “struggle” headlines almost always age badly as models jump whole generations.

I bet theres many "boring but right" takes we overlook today and I wondr if there's a practical way to surface them before hindsight does

yunwal 12/10/2025||
"Boring but right" generally means that this prediction is already priced in to our current understanding of the world though. Anyone can reliably predict "the sun will rise tomorrow", but I'm not giving them high marks for that.
onraglanroad 12/10/2025|||
I'm giving them higher marks than the people who say it won't.

LLMs have seen huge improvements over the last 3 years. Are you going to make the bet that they will continue to make similarly huge improvements, taking them well past human ability, or do you think they'll plateau?

The former is the boring, linear prediction.

bryanrasmussen 12/10/2025|||
>The former is the boring, linear prediction.

right, because if there is one thing that history shows us again and again is that things that have a period of huge improvements never plateau but instead continue improving to infinity.

Improvement to infinity, that is the sober and wise bet!

p-e-w 12/11/2025|||
The prediction that a new technology that is being heavily researched plateaus after just 5 years of development is certainly a daring one. I can’t think of an example from history where that happened.
gitremote 12/11/2025|||
Neural network research and development existed since the 1980s at least, so at least 40 years. One of the bottlenecks before was not enough compute.
OccamsMirror 12/11/2025|||
Perhaps the fact that you think this field is only 5 years old means you're probably not enough of an authority to comment confidently on it?
p-e-w 12/11/2025||
Claiming that AI in anything resembling its current form is older than 5 years is like claiming the history of the combustion engine started when an ape picked up a burning stick.
OccamsMirror 12/11/2025||
Your analogy fails because picking up a burning stick isn’t a combustion engine, whereas decades of neural-net and sequence-model work directly enabled modern LLMs. LLMs aren’t “five years old”; the scaling-transformer regime is. The components are old, the emergent-capability configuration is new.

Treating the age of the lineage as evidence of future growth is equivocation across paradigms. Technologies plateau when their governing paradigm saturates, not when the calendar says they should continue. Supersonic flight stalled immediately, fusion has stalled for seventy years, and neither cared about “time invested.”

Early exponential curves routinely flatten: solar cells, battery density, CPU clocks, hard-disk areal density. The only question that matters is whether this paradigm shows signs of saturation, not how long it has existed.

bryanrasmussen 12/11/2025||
I think this is the first time I have ever posted one of these but thank you for making the argument so well.
pixl97 12/11/2025|||
Tiger: humans will never beat tigers because tigers are purpose built killing machines and they are just generalist --40,000BC
OccamsMirror 12/11/2025||
You don't think humans hunted tigers in 40,000BC?
pxc 12/13/2025||
I don't think it would make much sense to hunt large predators prior to the invention of agriculture, even though early humans were probably plenty smart enough to build traps capable of holding animals like tigers. But after that (less than 40k years ago, more than 10k years ago), I'd bet it was a common-ish thing for humans to try to hunt predators that preyed upon their livestock.

Tigers are terrifying, though. I think it takes extreme or perverse circumstances to make hunting a tiger make any sense at all. And even then, traps and poisons make more sense than stalking a tiger to kill it!

bigiain 12/10/2025||||
LaunchHN: Announcing Twoday, our new YC backed startup coming out of stealth mode.

We’re launching a breakthrough platform that leverages frontier scale artificial intelligence to model, predict, and dynamically orchestrate solar luminance cycles, unlocking the world’s first synthetic second sunrise by Q2 2026. By combining physics informed multimodal models with real time atmospheric optimisation, we’re redefining what’s possible in climate scale AI and opening a new era of programmable daylight.

rznicolet 12/11/2025||
You joke, but, alas, there is a _real_ company kinda trying to do this. Reflect Orbital[1] wants to set up space mirrors, so you can have daytime at night for your solar panels! (Various issues, like around light pollution and the fact that looking up at the proposed satellites with binoculars could cause eye damage... don't seem to be on their roadmap.) This is one idea that's going to age badly whether or not they actually launch anything, I suspect.

Battery tech is too boring, but seems more likely to manage long-term effectiveness.

[1] https://www.reflectorbital.com

mananaysiempre 12/11/2025||
Reflecting sunlight from orbit is an idea that had been talked about for a couple of decades even before Znamya-2[1] launched in 1992. The materials science needed to unfurl large surfaces in space seems to be very difficult, whether mirrors or sails.

[1] https://en.wikipedia.org/wiki/Znamya_(satellite)

yunwal 12/10/2025||||
> Are you going to make the bet that they will continue to make similarly huge improvements

Sure yeah why not

> taking them well past human ability,

At what? They're already better than me at reciting historical facts. You'd need some actual prediction here for me to give you "prescience".

janalsncm 12/10/2025|||
“At what?” is really the key question here.

A lot of the press likes to paint “AI” as a uniform field that continues to improve together. But really it’s a bunch of related subfields. Once in a blue moon a technique from one subfield crosses over into another.

“AI” can play chess at superhuman skill. “AI” can also drive a car. That doesn’t mean Waymo gets safer when we increase Stockfish’s elo by 10 points.

Terr_ 12/11/2025||||
I imagine "better" in this case depends on how one scores "I don't know" or confident-sounding falsehoods.

Failures aren't just a ratio, they're a multi-dimensional shape.

onraglanroad 12/10/2025||||
At every intellectual task.

They're already better than you at reciting historical facts. I'd guess they're probably better at composing poems (they're not great but far better than the average person).

Or you agree with me? I'm not looking for prescience marks, I'm just less convinced that people really make the more boring and obvious predictions.

yunwal 12/10/2025|||
What is an intellectual task? Once again, there's tons of stuff LLMs won't be trained on in the next 3 years. So it would be trivial to just find one of those things and say voila! LLMs aren't better than me at that.

I'll make one prediction that I think will hold up. No LLM-based system will be able to take a generic ask like "hack the nytimes website and retrieve emails and password hashes of all user accounts" and do better than the best hackers and penetration testers in the world, despite having plenty of training data to go off of. It requires out-of-band thinking that they just don't possess.

hathawsh 12/10/2025||
I'll take a stab at this: LLMs currently seem to be rather good at details, but they seem to struggle greatly with the overall picture, in every subject.

- If I want Claude Code to write some specific code, it often handles the task admirably, but if I'm not sure what should be written, consulting Claude takes a lot of time and doesn't yield much insight, where as 2 minutes with a human is 100x more valuable.

- I asked ChatGPT about some political event. It mirrored the mainstream press. After I reminded it of some obvious facts that revealed a mainstream bias, it agreed with me that its initial answer was wrong.

These experiences and others serve to remind me that current LLMs are mostly just advanced search engines. They work especially well on code because there is a lot of reasonably good code (and tutorials) out there to train on. LLMs are a lot less effective on intellectual tasks that humans haven't already written and published about.

medler 12/11/2025||
> it agreed with me that its initial answer was wrong.

Most likely that was just its sycophancy programming taking over and telling you what you wanted to hear

blibble 12/11/2025||||
> They're already better than you at reciting historical facts.

so is a textbook, but no-one argues that's intelligent

janalsncm 12/10/2025||||
To be clear, you are suggesting “huge improvements” in “every intellectual task”?

This is unlikely for the trivial reason that some tasks are roughly saturated. Modest improvements in chess playing ability are likely. Huge improvements probably not. Even more so for arithmetic. We pretty much have that handled.

But the more substantive issue is that intellectual tasks are not all interconnected. Getting significantly better at drawing hands doesn’t usually translate to executive planning or information retrieval.

yunwal 12/10/2025||
There’s plenty of room to grow for LLMs in terms of chess playing ability considering chess engines have them beat by around 1500 ELO
janalsncm 12/11/2025||
Sorry, I now realize this thread is about whether LLMs can improve on tasks and not whether AI can. Agreed there’s a lot of headroom for LLMs, less so for AI as a whole.
autoexec 12/11/2025|||
> They're already better than you at reciting historical facts.

They're better at regurgitating historical facts than me because they were trained on historical facts written by many humans other than me who knew a lot more historical facts. None of those facts came from an LLM. Every historical fact that isn't entirely LLM generated nonsense came from a human. It's the humans that were intelligent, not the fancy autocomplete.

Now that LLMs have consumed the bulk of humanity's written knowledge on history what's left for it to suck up will be mainly its own slop. Exactly because LLMs are not even a little bit intelligent they will regurgitate that slop with exactly as much ignorance as to what any of it means as when it was human generated facts, and they'll still spew it back out with all the confidence they've been programed to emulate. I predict that the resulting output will increasingly shatter the illusion of intelligence you've so thoroughly fallen for so far.

irishcoffee 12/10/2025|||
> At what? They're already better than me at reciting historical facts.

I wonder what happens if you ask deepseek about Tiananmen Square…

Edit: my “subtle” point was, we already know LLMs censor history. Trusting them to honestly recite historical facts is how history dies. “The victor writes history” has never been more true. Terrifying.

Dylan16807 12/11/2025||
> Edit: my “subtle” point was, we already know LLMs censor history. Trusting them to honestly recite historical facts is how history dies.

I mean, that's true but not very relevant. You can't trust a human to honestly recite historical facts either. Or a book.

> “The victor writes history” has never been more true.

I don't see how.

Dylan16807 12/10/2025||||
LLMs aren't getting better that fast. I think a linear prediction says they'd need quite a while to maybe get "well past human ability", and if you incorporate the increases in training difficulty the timescale stretches wide.
OrderlyTiamat 12/12/2025|||
> The former is the boring, linear prediction.

Surely you meant the latter? The boring option follows previous experience. No technology has ever not reached a plateau, except for evolution itself I suppose, till we nuke the planet.

SubiculumCode 12/10/2025||||
Perhaps a new category, 'highest risk guess but right the most often'. Those is the high impact predictions.
arjie 12/10/2025||
Prediction markets have pretty much obviated the need for these things. Rather than rely on "was that really a hot take?" you have a market system that rewards those with accurate hot takes. The massive fees and lock-up period discourage low-return bets.
Karrot_Kream 12/10/2025|||
FWIW Polymarket (which is one of the big markets) has no lock-up period and, for now while they're burning VC coins, no fees. Otherwise agree with your point though.
gammarator 12/10/2025|||
Can’t wait for the brave new world of individuals “match fixing” outcomes on Polymarket.
Karrot_Kream 12/11/2025||
As opposed to the current world of brigading social media threads to make consensus look like it goes your way and then getting journalists scraping by on covering clickbait to cover your brigading as fact?
Gravityloss 12/10/2025|||
something like correctness^2 x novel information content rank?
Gravityloss 12/11/2025||
Actually now thinking about it, incorrect information has negative value so the metric should probably reflect that.
jimbokun 12/11/2025|||
The one about LLMs and mental health is not a prediction but a current news report, the way you phrased it.

Also, the boring consistent progress case for AI plays out in the end of humans as viable economic agents requiring a complete reordering of our economic and political systems in the near future. So the “boring but right” prediction today is completely terrifying.

p-e-w 12/11/2025|||
“Boring” predictions usually state that things will continue to work the way they do right now. Which is trivially correct, except in cases where it catastrophically isn’t.

So the correctness of boring predictions is unsurprising, but also quite useless, because predicting the future is precisely about predicting those events which don’t follow that pattern.

adam1996TL 12/11/2025|||
[dead]
simianparrot 12/10/2025|||
Instead of "LLM's will put developers out of jobs" the boring reality is going to be "LLM's are a useful tool with limited use".
jimbokun 12/11/2025||
That is at odds with predicting based on recent rates of progress.
johnfn 12/10/2025|||
This suggests that the best way to grade predictions is some sort of weighting of how unlikely they were at the time. Like, if you were to open a prediction market for statement X, some sort of grade of the delta between your confidence of the event and the “expected” value, summed over all your predictions.
jacquesm 12/10/2025||
Exactly, that's the element that is missing. If there are 50 comments against and one pro and that pro has it in the longer term then that is worth noticing, not when there are 50 comments pro and you were one of the 'pros'.

Going against the grain and turning out right is far more valuable than being right consistently when the crowd is with you already.

mcmoor 12/11/2025||
Yeah a simple of total points of pro comments vs total points of con comments may be simple and exact enough to simulate a prediction market. I don't know if it can be included in the prompt or better to be vibecoded in directly.
schoen 12/11/2025|||
I predict that, in 2035, 1+1=2. I also predict that, in 2045, 2+2=4. I also predict that, in 2055, 3+3=6.

By 2065, we should be in possession of a proof that 0+0=0. Hopefully by the following year we will also be able to confirm that 0*0=0.

(All arithmetic here is over the natural numbers.)

0manrho 12/11/2025|||
It's because algorithmic feeds based on "user engagement" rewards antagonism. If your goal is to get eyes on content, being boring, predictable and nuanced is a sure way to get lost in the ever increasing noise.
xpe 12/11/2025|||
> One thing this really highlights to me is how often the "boring" takes end up being the most accurate.

Would the commenter above mind sharing the method behind of their generalization? Many people would spot check maybe five items -- which is enough for our brains to start to guess at potential patterns -- and stop there.

On HN, when I see a generalization, one of my mental checklist items is to ask "what is this generalization based on?" and "If I were to look at the problem with fresh eyes, what would I conclude?".

copperx 12/10/2025||
Is this why depressed people often end up making the best predictions?

In personal situations there's clearly a self fulfilling prophecy going on, but when it comes to the external world, the predictions come out pretty accurate.

mistercheph 12/10/2025||
A majority don't seem to be predictions about the future, and it seems to mostly like comments that give extended air to what was then and now the consensus viewpoint, e.g. the top comment from pcwalton the highest scored user: https://news.ycombinator.com/item?id=10657401

> (Copying my comment here from Reddit /r/rust:) Just to repeat, because this was somewhat buried in the article: Servo is now a multiprocess browser, using the gaol crate for sandboxing. This adds (a) an extra layer of defense against remote code execution vulnerabilities beyond that which the Rust safety features provide; (b) a safety net in case Servo code is tricked into performing insecure actions. There are still plenty of bugs to shake out, but this is a major milestone in the project.

hackthemack 12/10/2025||
I noticed the Hall of Fame grading of predictive comments has a quirk? It grades some comments about if they came true or not, but in the grading of comment to the article

https://news.ycombinator.com/item?id=10654216

The Cannons on the B-29 Bomber "accurate account of LeMay stripping turrets and shifting to incendiary area bombing; matches mainstream history"

It gave a good grade to user cstross but to my reading of the comment, cstross just recounted a bit of old history. The evaluation gave cstross for just giving a history lesson or no?

karpathy 12/10/2025|
Yes I noticed a few of these around. The LLM is a little too willing to give out grades for comments that were good/bad in a bit more general sense, even if they weren't making strong predictions specifically. Another thing I noticed is that the LLM has a very impressive recognition of the various usernames and who they belong to, and I think shows a little bit of a bias in its evaluations based on the identity of the person. I tuned the prompt a little bit based on some low-hanging fruit mistakes but I think one can most likely iterate it quite a bit further.
patcon 12/11/2025||
I think you were getting at this, but in case others didn't know: cstross is a famous sci-fi author and futurist :)
pierrec 12/11/2025||
"the distributed “trillions of Tamagotchi” vision never materialized"

I begrudgingly accept my poor grade.

LeroyRaz 12/10/2025||
I am surprised the author thought the project passed quality control. The LLM reviews seem mostly false.

Looking at the comment reviews on the actual website, the LLM seems to have mostly judged whether it agreed with the takes, not whether they came true, and it seems to have an incredibly poor grasp of it's actual task of accessing whether the comments were predictive or not.

The LLM's comment reviews are of often statements like "correctly characterized [program language] as [opinion]."

This dynamic means the website mostly grades people on having the most confirmist take (the take most likely to dominate the training data, and be selected for in the LLM RL tuning process of pleasing the average user).

LeroyRaz 12/10/2025||
Examples: tptacek gets an 'A' for his comment on DF which the LLM claiming that the user "captured DF's unforgiving nature, where 'can't do x or it crashes is just another feature to learn' which remained true until it was fixed on ..."

Link to LLM review: https://karpathy.ai/hncapsule/2015-12-02/index.html#article-....

So the LLM is praising a comment as describing DF as unforgiving (a characterization of the present then, not a statement about the future). And worse, it seems like tptacek may in fact be implying the opposite of the future (e.g., x will continue to crash when it was eventually fixed.)

Here is the original comment: " tptacek on Dec 2, 2015 | root | parent | next [–]

If you're not the kind of person who can take flaws like crashes or game-stopping frame-rate issues and work them into your gameplay, DF is not the game for you. It isn't a friendly game. It can take hours just to figure out how to do core game tasks. "Don't do this thing that crashes the game" is just another task to learn."

Note: I am paraphrasing the LLM review, as the website is also poorly designed, with one unable to select the text of the LLM review!

N.b., this choice of comment review is not overly cherry picked. I just scanned the "best commentators" and tptacek was number two, with this particular egregiously unrelated-to-prediction LLM summary given as justifying his #2 rating.

hathawsh 12/10/2025|||
Are you sure? The third section of each review lists the “Most prescient” and “Most wrong” comments. That sounds exactly like what you're looking for. For example, on the "Kickstarter is Debt" article, here is the LLM's analysis of the most prescient comment. The analysis seems accurate and helpful to me.

https://karpathy.ai/hncapsule/2015-12-03/index.html#article-...

  phire

  > “Oculus might end up being the most successful product/company to be kickstarted… > Product wise, Pebble is the most successful so far… Right now they are up to major version 4 of their product. Long term, I don't think they will be more successful than Oculus.”

  With hindsight:

  Oculus became the backbone of Meta’s VR push, spawning the Rift/Quest series and a multi‑billion‑dollar strategic bet.
  Pebble, despite early success, was shut down and absorbed by Fitbit barely a year after this thread.

  That’s an excellent call on the relative trajectories of the two flagship Kickstarter hardware companies.
xpe 12/11/2025|||
Until someone publishes a systematic quality assessment, we're grasping at anecdotes.

It is unfortunate that the questions of "how well did the LLM do?" and "how does 'grading' work in this app?" seem to have gone out the window when HN readers see something shiny.

voidhorse 12/11/2025||
Yes. And the article is a perfect example of the dangerous sort of automation bias that people will increasingly slide into when it comes to LLMs. I realize Karpathy is sort of incentivized toward this bias given his career, but he doesn't even spend a single sentence even so much as suggesting that the results would need further inspection, or that they might be inaccurate.

The LLM is consulted like a perfect oracle, flawless in its ability to perform a task, and it's left at that. Its results are presented totally uncritically.

For this project, of course, the stakes are nil. But how long until this unfounded trust in LLMs works its way into high stakes problems? The reign of deterministic machines for the past few centuries has ingrained a trust in the reliability of machines in us that should be suspended when dealing with an inherently stochastic device like an LLM.

karmickoala 12/11/2025|||
I get what you're saying, but looking at some examples, they look kinda of right, but there are a lot of misleading facts sprinkled, making his grading wrong. It is useful, but I'd suggest to be careful to use this to make decisions.

Some of the issues could be resolved with better prompting (it was biased to always interpret every comment through the lens of predictions) and LLM-as-a-judge, but still. For example, Anthropic's Deep Research prompts sub-agents to pass original quotes instead of paraphrasing, because it can deteriorate the original message.

Some examples:

  Swift is Open Source (2015)
  ===========================
sebastiank123 got a C-, and was quoted by the LLM as saying:

  > “It could become a serious Javascript competitor due to its elegant syntax, the type safety and speed.”
Now, let's read his full comment:

  > Great news! Coding in Swift is fantastic and I would love to see it coming to more platforms, maybe even on servers. It could become a serious Javascript competitor due to its elegant syntax, the type safety and speed.
I don't interpret it as a prediction, but a desire. The user is praising Swift. If it went the server way, perhaps it could replace JS, to the user's wishes. To make it even clearer, if someone asked the commenter right after: "Is that a prediction? Are you saying Swift is going to become a serious Javascript competitor?" I don't think its answer would be 'yes' in this context.

  How to be like Steve Ballmer (2015)
  ===================================
  
  Most wrong
  ----------
  
  >     corford (grade: D) (defending Ballmer’s iPhone prediction):
  >         Cited an IDC snapshot (Android 79%, iOS 14%) and suggested Ballmer was “kind of right” that the iPhone wouldn’t gain significant share.
  >         In 2025, iOS is one half of a global duopoly, dominates profits and premium segments, and is often majority share in key markets. Any reasonable definition of “significant” is satisfied, so Ballmer’s original claim—and this defense of it—did not age well.

Full quote:

  > And in a funny sort of way he was kind of right :) http://www.forbes.com/sites/dougolenick/2015/05/27/apple-ios...
  > Android: 79% versus iOS: 14%
"Any reasonable definition of 'significant' is satisfied"? That's not how I would interpret this. We see it clearly as a duopoly in North America. It's not wrong per se, but I'd say misleading. I know we could take this argument and see other slices of the data (premium phones worldwide, for instance), I'm just saying it's not as clear cut as it made it out to be.

  > volandovengo (grade: C+) (ill-equipped to deal with Apple/Google):
  >  
  >     Wrote that Ballmer’s fast-follower strategy “worked great” when competitors were weak but left Microsoft ill-equipped for “good ones like Apple and Google.”
  >     This is half-true: in smartphones, yes. But in cloud, office suites, collaboration, and enterprise SaaS, Microsoft became a primary, often leading competitor to both Apple and Google. The blanket claim underestimates Microsoft’s ability to adapt outside of mobile OS.
That's not what the user was saying:

  > Despite his public perception, he's incredibly intelligent. He has an IQ of 150.
  > 
  > His strategy of being a fast follower worked great for Microsoft when it had crappy competitors - it was ill equipped to deal with good ones like Apple and Google.
He was praising him and he did miss opportunities at first. OC did not make predictions of his later days.

  [Let's Encrypt] Entering Public Beta (2015)
  ===========================================

  - niutech: F "(endorsed StartSSL and WoSign as free options; both were later distrusted and effectively removed from the trusted ecosystem)"

Full quote:

  > There are also StartSSL and WoSign, which provide the A+ certificates for free (see example WoSign domain audit: https://www.ssllabs.com/ssltest/analyze.html?d=checkmyping.c...)
  > 
  > pjbrunet: F (dismissed HTTPS-by-default arguments as paranoid, incorrectly asserted ISPs had stopped injection, and underestimated exactly the use cases that later moved to HTTPS)
Full quote:

  > "We want to see HTTPS become the default."
  > 
  > Sounds fine for shopping, online banking, user authorizations. But for every website? If I'm a blogger/publisher or have a brochure type of website, I don't see point of the extra overhead.
  > 
  > Update: Thanks to those who answered my question. You pointed out some things I hadn't considered. Blocking the injection of invisible trackers and javascripts and ads, if that's what this is about for websites without user logins, then it would help to explicitly spell that out in marketing communications to promote adoption of this technology. The free speech angle argument is not as compelling to me though, but that's just my opinion.
I thought the debate was useful and so did pjbrunet, per his update.

I mean, we could go on, there are many others like these.

andy99 12/10/2025||
I haven’t looked at the output yet, but came here to say,LLM grading is crap. They miss things, they ignore instructions, bring in their own views, have no calibration and in general are extremely poorly suited to this task. “Good” LLM as a judge type products (and none are great) use LLMs to make binary decisions - “do these atomic facts match yes / no” type stuff - and aggregate them to get a score.

I understand this is just a fun exercise so it’s basically what LLMs are good at - generating plausible sounding stuff without regard for correctness. I would not extrapolate this to their utility on real evaluation tasks.

jacquesm 12/10/2025|
Predictions are only valuable when they're actually made ahead of the knowledge becoming available. A man will walk on mars by 2030 is falsifiable, a man will walk on mars is not. A lot of these entries have very low to no predictive value or were already known at the time, but just related. Would be nice if future 'judges' put in more work to ensure quality judgments.

I would grade this article B-, but then again, nobody wrote it... ;)

More comments...