Auto-grading decade-old Hacker News discussions with hindsight

Posted by __rito__ 12/10/2025

Auto-grading decade-old Hacker News discussions with hindsight(karpathy.bearblog.dev)

Related from yesterday: Show HN: Gemini Pro 3 imagines the HN front page 10 years from now - https://news.ycombinator.com/item?id=46205632

686 points | 270 commentspage 6

bediger4000 12/10/2025|

LLMs are watching (or humans using them might be). Best to be good.

Shades of Roko's Basilisk!

ambicapter 12/10/2025|

More like a Panopticon. As the parenthesis notes, this is just as bad when humans are the final link in the eyeball chain.

Bjartr 12/10/2025||

Neat, I got a shout-out. Always happy to share the random stuff I remember exists!

apparent 12/11/2025||

> And then when you navigate over to the Hall of Fame, you can find the top commenters of Hacker News in December 2015, sorted by imdb-style score of their grade point average.

Now let's make a Chrome extension that subtly highlights these users' comments when browsing HN.

bbcisking 12/11/2025||

Why not rank ESP for each HN user, with evidence?

exasperaited 12/10/2025||

> Everything we do today might be scrutinized in great detail in the future because it will be "free".

s/"free"/stolen/

The bit about college courses for future prediction was just silly, I'm afraid: reminds me of how Conan Doyle has Sherlock not knowing Earth revolves around the Sun. Almost all serious study concerns itself with predicting, modelling and influence over the future behaviour of some system; the problem is only that people don't fucking listen to the predictions of experts. They aren't going to value refined, academic general-purpose futurology any more than they have in the past; it's not even a new area of study.

pnt12 12/11/2025||

On the site itself:

it's great that this was produced in 1h with 60$. This is amazing to create small utilities, explore your curiosity, etc.

But the site is also quite confusing and messy. OK for a vibe coded experiment, sure, but wouldn't be for a final product. But I fear we're gonna see more and more of this. Big companies downsizing their tech departments and embracing vibe coded. Comparing to inflation, shrinkflation and skimpflation/ enshittification , will we soon adopt some word for this? AIflation? LLMflation?

And how will this comment score in a couple of years? :)

slg 12/10/2025||

This is a perfect example of the power and problems with LLMs.

I took the narcissistic approach of searching for myself. Here's a grade of one of my comments[1]:

>slg: B- (accurate characterization of PH’s “networking & facade” feel, but implicitly underestimates how long that model can persist)

And here's the actual comment I made[2]:

>And maybe it is the cynical contrarian in me, but I think the "real world" aspect of Product Hunt it what turned me off of the site before these issues even came to the forefront. It always seemed like an echo chamber were everyone was putting up a facade. Users seemed more concerned with the people behind products and networking with them than actually offering opinions of what was posted.

>I find the more internet-like communities more natural. Sure, the top comment on a Show HN is often a critique. However I find that more interesting than the usual "Wow, another great product from John Developer. Signing up now." or the "Wow, great product. Here is why you should use the competing product that I work on." that you usually see on Product Hunt.

I did not say nor imply anything about "how long that model can persist", I just said I personally don't like using the site. It's a total hallucination to claim I was implying doom for "that model" and you would only know that if you actually took the time to dig into the details of what was actually said, but the summary seems plausible enough that most people never would.

The LLM processed and analyzed a huge amount of data in a way that no human could, but the single in-depth look I took at that analysis was somewhere between misleading and flat out wrong. As I said, a perfect example of what LLMs do.

And yes, I do recognize the funny coincidence that I'm now doing the exact thing I described as the typical HN comment a decade ago. I guess there is a reason old me said "I find that more interesting".

[1] - https://karpathy.ai/hncapsule/2015-12-18/index.html#article-...

[2] - https://news.ycombinator.com/item?id=10761980

npunt 12/11/2025|

I'm not so sure; that may not have been what you meant, but that doesn't mean it's not what others read into it. The broader context is HN is a startup forum and one of the most common discussion patterns is 'I don't like it' that is often a stand-in for 'I don't think it's viable as-is'. Startups are default dead, after all.

With that context, if someone were to read your comment and be asked 'does this person think the product's model is viable in the long run' I think a lot of people would respond 'no'.

slg 12/12/2025||

And this is a perfect example of how some people respond to LLMs, bending over backwards to justify the output like we are some kids around a Ouija board.

The LLM isn't misinterpreting the text, it's just representing people who misinterpreted the text isn't the defense you seem to think it is.

npunt 12/13/2025||

And your response here is a perfect example of confidently jumping to conclusions on what someone's intent is... which is exactly what you're saying the LLM did to you.

I scoped my comment specifically around what a reasonable human answer would be if one were asked the particular question it was asked with the available information it had. That's all.

Btw I agree with your comment that it hallucinated/assumed your intent! Sorry I did not specify that. This was a bit of a 'play stupid games win stupid prizes' prompt by the OP. If one asks an imprecise question one should not expect a precise answer. The negative externality here is reader's takeaways are based on false precision. So is it the fault of the question asker, the readers, the tool, or some mix? The tool is the easiest to change, so probably deserves the most blame.

I think we'd both agree LLMs are notoriously overly-helpful and provide low confidence responses to things they should just not comment on. That to me is the underlying issue - at the very least they should respond like humans do not only in content but in confidence. It should have said it wasn't confident about its response to your post, and OP should have thus thrown its response out.

Rarely do we have perfect info, in regular communications we're always making assumptions which affect our confidence in our answers. The question is what's the confidence threshold we should use? This is the question to ask before the question of 'is it actually right?', which is also an important question to ask, but one I think they're a lot better at than the former.

Fwiw you can tell most LLMs to update its memory to always give you a confidence score 0.0-1.0. This helps tremendously, it's pretty darn accurate, it's something you can program thresholds around, and I think it should be built in to every LLM response.

The way I see it, LLMs have lots and lots of negative externalities that we shouldn't bring into this world (I'm particularly sensitive to the effects on creative industries), and I detest how they're being used so haphazardly, but they do have some uses we also shouldn't discount and figure out how to improve on. The question is where are we today in that process?

The framework I use to think about how LLMs are evolving is that of transitioning mediums. Like movies started as a copy/paste of stage plays before they settled into the medium and understand how to work along the grain of its strengths & weaknesses to create new conventions. Speech & text are now transitioning into LLMs. What is the grain we need to go along?

My best answer is the convention LLMs need to settle into is explicit confidence, and each question asked of them should first be a question of what the acceptable confidence threshold is for such a question. I think every question and domain will have different answers for that, and we should debate and discuss that alongside any particular answer.

0xWTF 12/10/2025||

Now: compared to what? Is there a better source than HN? How's it compare to Reddit or lobsters?

Compared to what happens next? Does tptacek's commentary become market signal equivalent to the Fed Chair or the BLS labor and inflation reports?

tptacek 12/10/2025|

What makes you think it already isn't?

jacquesm 12/10/2025||

You've made me billions by now! Thank you...

npunt 12/11/2025||

One of the few use cases for LLMs that I have high hopes for and feel is still under appreciated is grading qualitative things. LLMs are the first tech (afaik) that can do top-down analysis of phenomena in a manner similar to humans, which means a lot of important human use cases that are judgement-oriented can become more standardized, faster, and more readily available.

For instance, one of the unfortunate aspects of social media that has become so unsustainable and destructive to modern society is how it exposes us to so many more people and hot takes than we have ability to adequately judge. We're overwhelmed. This has led to conversation being dominated by really shitty takes and really shitty people, who rarely if ever suffer reputational consequence.

If we build our mediums of discourse with more reputational awareness using approaches like this, we can better explore the frontier of sustainable positive-sum conversation at scale.

Implementation-wise, the key question is how do we grade the grader and ensure it is predictable and accurate?

Arodex 12/11/2025|

This is wrong, just look at this comment here:

https://news.ycombinator.com/item?id=46222523

LLM can't grade reliably human text. It doesn't understand it.

tgtweak 12/11/2025|

Cool - now make it analyze all of those and come up with the 10 commandments of commenting factually and insightfully on HN posts...

More comments...