Posted by __rito__ 9 hours ago
If an LLM were acting as a kind of historian revisiting today’s debates with future context, I’d bet it would see the same pattern again and again: the sober, incremental claims quietly hold up, while the hyperconfident ones collapse.
Something like "Lithium-ion battery pack prices fall to $108/kWh" is classic cost-curve progress. Boring, steady, and historically extremely reliable over long horizons. Probably one of the most likely headlines today to age correctly, even if it gets little attention.
On the flip side, stuff like "New benchmark shows top LLMs struggle in real mental health care" feels like high-risk framing. Benchmarks rotate constantly, and “struggle” headlines almost always age badly as models jump whole generations.
I bet theres many "boring but right" takes we overlook today and I wondr if there's a practical way to surface them before hindsight does
LLMs have seen huge improvements over the last 3 years. Are you going to make the bet that they will continue to make similarly huge improvements, taking them well past human ability, or do you think they'll plateau?
The former is the boring, linear prediction.
right, because if there is one thing that history shows us again and again is that things that have a period of huge improvements never plateau but instead continue improving to infinity.
Improvement to infinity, that is the sober and wise bet!
We’re launching a breakthrough platform that leverages frontier scale artificial intelligence to model, predict, and dynamically orchestrate solar luminance cycles, unlocking the world’s first synthetic second sunrise by Q2 2026. By combining physics informed multimodal models with real time atmospheric optimisation, we’re redefining what’s possible in climate scale AI and opening a new era of programmable daylight.
Sure yeah why not
> taking them well past human ability,
At what? They're already better than me at reciting historical facts. You'd need some actual prediction here for me to give you "prescience".
Failures aren't just a ratio, they're a multi-dimensional shape.
A lot of the press likes to paint “AI” as a uniform field that continues to improve together. But really it’s a bunch of related subfields. Once in a blue moon a technique from one subfield crosses over into another.
“AI” can play chess at superhuman skill. “AI” can also drive a car. That doesn’t mean Waymo gets safer when we increase Stockfish’s elo by 10 points.
They're already better than you at reciting historical facts. I'd guess they're probably better at composing poems (they're not great but far better than the average person).
Or you agree with me? I'm not looking for prescience marks, I'm just less convinced that people really make the more boring and obvious predictions.
I'll make one prediction that I think will hold up. No LLM-based system will be able to take a generic ask like "hack the nytimes website and retrieve emails and password hashes of all user accounts" and do better than the best hackers and penetration testers in the world, despite having plenty of training data to go off of. It requires out-of-band thinking that they just don't possess.
- If I want Claude Code to write some specific code, it often handles the task admirably, but if I'm not sure what should be written, consulting Claude takes a lot of time and doesn't yield much insight, where as 2 minutes with a human is 100x more valuable.
- I asked ChatGPT about some political event. It mirrored the mainstream press. After I reminded it of some obvious facts that revealed a mainstream bias, it agreed with me that its initial answer was wrong.
These experiences and others serve to remind me that current LLMs are mostly just advanced search engines. They work especially well on code because there is a lot of reasonably good code (and tutorials) out there to train on. LLMs are a lot less effective on intellectual tasks that humans haven't already written and published about.
Most likely that was just its sycophancy programming taking over and telling you what you wanted to hear
This is unlikely for the trivial reason that some tasks are roughly saturated. Modest improvements in chess playing ability are likely. Huge improvements probably not. Even more so for arithmetic. We pretty much have that handled.
But the more substantive issue is that intellectual tasks are not all interconnected. Getting significantly better at drawing hands doesn’t usually translate to executive planning or information retrieval.
I wonder what happens if you ask deepseek about Tiananmen Square…
Edit: my “subtle” point was, we already know LLMs censor history. Trusting them to honestly recite historical facts is how history dies. “The victor writes history” has never been more true. Terrifying.
I mean, that's true but not very relevant. You can't trust a human to honestly recite historical facts either. Or a book.
> “The victor writes history” has never been more true.
I don't see how.
Also, the boring consistent progress case for AI plays out in the end of humans as viable economic agents requiring a complete reordering of our economic and political systems in the near future. So the “boring but right” prediction today is completely terrifying.
So the correctness of boring predictions is unsurprising, but also quite useless, because predicting the future is precisely about predicting those events which don’t follow that pattern.
Going against the grain and turning out right is far more valuable than being right consistently when the crowd is with you already.
In personal situations there's clearly a self fulfilling prophecy going on, but when it comes to the external world, the predictions come out pretty accurate.
Would the commenter above mind sharing the method behind of their generalization? Many people would spot check maybe five items -- which is enough for our brains to start to guess at potential patterns -- and stop there.
On HN, when I see a generalization, one of my mental checklist items is to ask "what is this generalization based on?" and "If I were to look at the problem with fresh eyes, what would I conclude?".
Swift is Open Source https://hn.unlurker.com/replay?item=10669891
Launch of Figma, a collaborative interface design tool https://hn.unlurker.com/replay?item=10685407
Introducing OpenAI https://hn.unlurker.com/replay?item=10720176
The first person to hack the iPhone is building a self-driving car https://hn.unlurker.com/replay?item=10744206
SpaceX launch webcast: Orbcomm-2 Mission [video] https://hn.unlurker.com/replay?item=10774865
At Theranos, Many Strategies and Snags https://hn.unlurker.com/replay?item=10799261
I begrudgingly accept my poor grade.
An extension of this would be to grade people on the accuracy of the comments they upvote, and use that to weight their upvotes more in ranking. I would love to read a version of HN where the only upvotes that matter are from people who agree with opinions that turn out to be correct. Of course, only HN could implement this since upvotes are private.
It's subjective of course but at least it's transparently so.
I just think it's neat that it's kinda sorta a loose proxy for what you're talking about but done in arguably the simplest way possible.
Why stop there?
If you can do that you can score them on all sorts of things. You could make a "this person has no moral convictions and says whatever makes the number go up" score. Or some other kind of score.
Stuff like this makes the community "smaller" in a way. Like back in the old days on forums and IRC you knew who the jerks were.
(And we do have that in real life. Just as, among friends, we do keep track of who is in whose debt, we also keep a mental map of whose voice we listen to. Old school journalism still had that, where people would be reading someone’s column over the course of decades. On the internet, we don’t have that, or we have it rarely.)
Of course in the above example of stocks there are clear predictions (HNWS will go up) and an oracle who resolves it (stock market). This seems to be a way harder problem for generic free form comments. Who resolves what prediction a particular comment has made and whether it actually happened?
Kidding aside, the comments it picks out for us are a little random. For instance, this was an A+ predictive thread (it appears to be rating threads and not individual comments):
https://news.ycombinator.com/item?id=10703512
But there's just 11 comments, only 1 for me, and it's like a 1-sentence comment.
I do love that my unaccredited-access-to-startup-shares take is on that leaderboard, though.
My original goal was to prune the account deleting all the useless things and keeping just the unique, personal, valuable communications -- but the other day, an insight has me convinced that the safer / smarter thing to do in the current landscape is the opposite: remove any personal, valuable, memorable items, and leave google (and whomever else is scraping these repositories) with useless flotsam of newsletters, updates, subscription receipts, etc.
Looking at the comment reviews on the actual website, the LLM seems to have mostly judged whether it agreed with the takes, not whether they came true, and it seems to have an incredibly poor grasp of it's actual task of accessing whether the comments were predictive or not.
The LLM's comment reviews are of often statements like "correctly characterized [program language] as [opinion]."
This dynamic means the website mostly grades people on having the most confirmist take (the take most likely to dominate the training data, and be selected for in the LLM RL tuning process of pleasing the average user).
Link to LLM review: https://karpathy.ai/hncapsule/2015-12-02/index.html#article-....
So the LLM is praising a comment as describing DF as unforgiving (a characterization of the present then, not a statement about the future). And worse, it seems like tptacek may in fact be implying the opposite of the future (e.g., x will continue to crash when it was eventually fixed.)
Here is the original comment: " tptacek on Dec 2, 2015 | root | parent | next [–]
If you're not the kind of person who can take flaws like crashes or game-stopping frame-rate issues and work them into your gameplay, DF is not the game for you. It isn't a friendly game. It can take hours just to figure out how to do core game tasks. "Don't do this thing that crashes the game" is just another task to learn."
Note: I am paraphrasing the LLM review, as the website is also poorly designed, with one unable to select the text of the LLM review!
N.b., this choice of comment review is not overly cherry picked. I just scanned the "best commentators" and tptacek was number two, with this particular egregiously unrelated-to-prediction LLM summary given as justifying his #2 rating.
https://karpathy.ai/hncapsule/2015-12-03/index.html#article-...
phire
> “Oculus might end up being the most successful product/company to be kickstarted… > Product wise, Pebble is the most successful so far… Right now they are up to major version 4 of their product. Long term, I don't think they will be more successful than Oculus.”
With hindsight:
Oculus became the backbone of Meta’s VR push, spawning the Rift/Quest series and a multi‑billion‑dollar strategic bet.
Pebble, despite early success, was shut down and absorbed by Fitbit barely a year after this thread.
That’s an excellent call on the relative trajectories of the two flagship Kickstarter hardware companies.It is unfortunate that the questions of "how well did the LLM do?" and "how does 'grading' work in this app?" seem to have gone out the window when HN readers see something shiny.
Some of the issues could be resolved with better prompting (it was biased to always interpret every comment through the lens of predictions) and LLM-as-a-judge, but still. For example, Anthropic's Deep Research prompts sub-agents to pass original quotes instead of paraphrasing, because it can deteriorate the original message.
Some examples:
Swift is Open Source (2015)
===========================
sebastiank123 got a C-, and was quoted by the LLM as saying: > “It could become a serious Javascript competitor due to its elegant syntax, the type safety and speed.”
Now, let's read his full comment: > Great news! Coding in Swift is fantastic and I would love to see it coming to more platforms, maybe even on servers. It could become a serious Javascript competitor due to its elegant syntax, the type safety and speed.
I don't interpret it as a prediction, but a desire. The user is praising Swift. If it went the server way, perhaps it could replace JS, to the user's wishes. To make it even clearer, if someone asked the commenter right after: "Is that a prediction? Are you saying Swift is going to become a serious Javascript competitor?" I don't think its answer would be 'yes' in this context. How to be like Steve Ballmer (2015)
===================================
Most wrong
----------
> corford (grade: D) (defending Ballmer’s iPhone prediction):
> Cited an IDC snapshot (Android 79%, iOS 14%) and suggested Ballmer was “kind of right” that the iPhone wouldn’t gain significant share.
> In 2025, iOS is one half of a global duopoly, dominates profits and premium segments, and is often majority share in key markets. Any reasonable definition of “significant” is satisfied, so Ballmer’s original claim—and this defense of it—did not age well.
Full quote: > And in a funny sort of way he was kind of right :) http://www.forbes.com/sites/dougolenick/2015/05/27/apple-ios...
> Android: 79% versus iOS: 14%
"Any reasonable definition of 'significant' is satisfied"? That's not how I would interpret this. We see it clearly as a duopoly in North America. It's not wrong per se, but I'd say misleading. I know we could take this argument and see other slices of the data (premium phones worldwide, for instance), I'm just saying it's not as clear cut as it made it out to be. > volandovengo (grade: C+) (ill-equipped to deal with Apple/Google):
>
> Wrote that Ballmer’s fast-follower strategy “worked great” when competitors were weak but left Microsoft ill-equipped for “good ones like Apple and Google.”
> This is half-true: in smartphones, yes. But in cloud, office suites, collaboration, and enterprise SaaS, Microsoft became a primary, often leading competitor to both Apple and Google. The blanket claim underestimates Microsoft’s ability to adapt outside of mobile OS.
That's not what the user was saying: > Despite his public perception, he's incredibly intelligent. He has an IQ of 150.
>
> His strategy of being a fast follower worked great for Microsoft when it had crappy competitors - it was ill equipped to deal with good ones like Apple and Google.
He was praising him and he did miss opportunities at first. OC did not make predictions of his later days. [Let's Encrypt] Entering Public Beta (2015)
===========================================
- niutech: F "(endorsed StartSSL and WoSign as free options; both were later distrusted and effectively removed from the trusted ecosystem)"
Full quote: > There are also StartSSL and WoSign, which provide the A+ certificates for free (see example WoSign domain audit: https://www.ssllabs.com/ssltest/analyze.html?d=checkmyping.c...)
>
> pjbrunet: F (dismissed HTTPS-by-default arguments as paranoid, incorrectly asserted ISPs had stopped injection, and underestimated exactly the use cases that later moved to HTTPS)
Full quote: > "We want to see HTTPS become the default."
>
> Sounds fine for shopping, online banking, user authorizations. But for every website? If I'm a blogger/publisher or have a brochure type of website, I don't see point of the extra overhead.
>
> Update: Thanks to those who answered my question. You pointed out some things I hadn't considered. Blocking the injection of invisible trackers and javascripts and ads, if that's what this is about for websites without user logins, then it would help to explicitly spell that out in marketing communications to promote adoption of this technology. The free speech angle argument is not as compelling to me though, but that's just my opinion.
I thought the debate was useful and so did pjbrunet, per his update.I mean, we could go on, there are many others like these.
I understand this is just a fun exercise so it’s basically what LLMs are good at - generating plausible sounding stuff without regard for correctness. I would not extrapolate this to their utility on real evaluation tasks.
https://news.ycombinator.com/item?id=10654216
The Cannons on the B-29 Bomber "accurate account of LeMay stripping turrets and shifting to incendiary area bombing; matches mainstream history"
It gave a good grade to user cstross but to my reading of the comment, cstross just recounted a bit of old history. The evaluation gave cstross for just giving a history lesson or no?