Using reinforcement learning and $4.80 of GPU time to find the best HN post

Posted by kcorbitt 10/28/2024

Using reinforcement learning and $4.80 of GPU time to find the best HN post(openpipe.ai)

217 points | 95 comments

jerjerjer 10/28/2024|

> In this case, I included the post title, author, date, and content. All of those factors could be relevant to the chance a story gets voted up.

> Even if the model gets extremely good at predicting final_score_if_it_hits_front_page, there’s still the inherent randomness of probability_of_hitting_front_page that is fundamentally unpredictable.

In addition to date, you might want to include three fields:

- day of week (categorical)

- is weekend/holiday (boolean)

- hour or time of the day (categorical, you can have 24 of them or morning/afternoon/etc.).

The probability of a post hitting the front page is usually affected by these things so it can really help the model.

sitkack 10/28/2024||

I find that the best stories get posted by folks in EU time zones as well as the weekend (more of hacker ethos). The flame bait startup drama is M-F Pacific.

jedberg 10/28/2024|||

I haven't run the data, but anecdotally I can tell you that those things probably don't affect hitting the front page. They do affect the total score, but that is not what is being optimized here.

It's counterintuitive, but if you post at a really popular time, you're competing with a lot of other submissions. If you post at a really slow time, you'll get fewer votes, but it will take fewer to reach the front page and you'll have less competition.

In the end, it kinda evens out. The number of votes it takes to get to the front page and the number of competing submissions are both correlated to your fields above.

floobertoober 10/28/2024|||

I think that this assumes a uniform distribution of "interestingness" in the competing posts across all of those dimensions and I wouldn't be surprised if that isn't the case

jedberg 10/28/2024||

It may not be even, but I don't think interestingness is correlated with time of day. But I could be wrong!

sadeshmukh 10/29/2024||

Interestingness is subjective, and I would imagine different timezone people have different preferences. Interesting thing to ponder for a bit

4m1rk 10/29/2024|||

Popular time for voting vs posting are not the same

josefx 10/29/2024|||

> is weekend/holiday

Somehow this reminded me of someone datamining spiegel.de (german news site) and using the timestamps of the posted articles to extrapolate the writers religion (holidays) and relationships (shared vacations) among dozens of other data points from several years of publicly available data. I think no AI was involved back then.

EffrafaxOfWug 10/29/2024||

For anyone interested, it was this CCC talk by David Kriesel (sadly german only).

https://media.ccc.de/v/33c3-7912-spiegelmining_reverse_engin...

drilbo 10/30/2024||

There is an english translated audio track, actually. (sound quality is not fantastic though)

maaaaattttt 10/28/2024|||

I wonder if hour of day would benefit from being combined with HN's visitors location data to be truly relevant? I think the location is embedded in the time somehow if the visitors' origins are stable over time. If 9am PT is a popular time and most of the visitors are on the PT timezone then even if this 9am PT is encoded as UTC the model will pick it up (I think). Now, if over time visitors get more diverse and a big chunk is now coming from Europe, this original 9am will make less sense to the model. Adding visitors origin stats at time of the post would probably even help surface region trends. But I guess this historical data isn't public.

kcorbitt 10/28/2024|||

Yep that makes sense. Would be interesting to do a follow-up that explicitly includes these variables and see if it meaningfully improves the results.

fennecbutt 11/1/2024|||

The data is massively interconnected too: if Apple releases new m chip, people flood here to see if there's a thread on it, while browsing they may be more or less likely to see other threads given that first case.

rajnathani 10/30/2024|||

I would replace author with a boolean of if the author's account is new or not (the green marker that HN has for new users' posts and comments).

aaron695 10/29/2024||

> might want to include three fields:

This has been studied multiple times on HN posts, most seem to have link-rotted. Web Archive them if looking for insights - https://hn.algolia.com/?q=best+time+to+post

kelnos 10/28/2024||

I don't get the conclusion the author is trying to draw. If you look at the data presented, it seems that the model was actually pretty bad at guessing the real-world behavior of the posts listed. Out of the top ten it picked:

* 1 had a score that was reasonably close (8.4%) to what the model predicted

* 4 had scores wildly lower than the model predicted

* 2 had scores wildly higher than the model predicted

* the remaining 3 were not wildly off, but weren't really that close either (25%-42% off)

Then there's a list of 10 submissions that the model predicted would have scores ranging from 33 to 135, but they all only received a score of 1 in reality.

The graph shown paints a bit of a better picture, I guess, but it's still not all that compelling to me.

kcorbitt 10/28/2024||

This is a fair point. The reason why I think "correlation" is a better metric than "predicts the exact correct score" is because of how I'll be using this model in the next post.

Broadly, the main use case for this model (in the RL context) will be to take two different versions of the same post, and predict which of the two is more likely to be upvoted. So what matters isn't that it gets the exact number of upvotes correctly, but that it correctly predicts the relative difference in likely upvote count between two variants.

Now it still doesn't do a great job at that (the correlation is only 0.53 after all) but it still does a good enough job to provide some useful signal.

espadrine 10/29/2024||

That makes me wonder though what the best loss function was. I assume you used MSE on the logscore. I wonder if a sigmoid on which of two articles has the higher score would yield better results for the downstream RLHF task.

manx 10/29/2024|||

Scores are not a good metric to be compared. I did some data analysis and wrote about it here: https://felx.me/2021/08/29/improving-the-hacker-news-ranking...

nl 10/29/2024||

The score divergence is likely because if a story makes the front page then it almost certainly gets comments and each comment adds one to the score.

But the number of comments depends on the time posted more than the story itself and that information isn't in the model.

youoy 10/28/2024||

Thanks for sharing! Very interesting.

> The correlation is actually not bad (0.53), but our model is very consistently over-estimating the score at the low end, and underestimating it at the high end. This is surprising; some variation on any given data point is expected, but such a consistent mis-estimation trend isn’t what we’d expect.

This is a consequence on the model objective. If you don't know what is really happening, a good way of reducing the overall error is to do that. If you instead try to exactly predict the very highs and very lows, you can see that you will get very high errors on those, resulting in a bigger overall error.

Appart from that, I want to comment on AI alignment here. For me the objective of "most up votes" is not fully correlated with where I get the most value on HN. Most of the time, the most up voted I would have found them anyway on other platforms. It's the middle range what I really like. So be careful implementing this algorithm at scale, it could turn the website into another platform with shitty AI recommendations.

kcorbitt 10/28/2024|

> For me the objective of "most up votes" is not fully correlated with where I get the most value on HN. Most of the time, the most up voted I would have found them anyway on other platforms.

Yes, this is a fantastic point. I'm curious if there's some other measurable proxy metric for "things I get the most value out of on HN"? Upvotes seems like the most natural but optimizing for it too strongly would definitely take HN down a dark path.

losteric 10/28/2024|||

Perhaps selecting for posts with the highest quality reply engagement? If many different people were drawn to lengthy discussions, that suggests the content sparks thoughts that others then feel compelled to engage with. Or select for the emotional content of replies, awe/empathy/anger, depending on what one wants from HN?

hatthew 10/28/2024|||

lots of platforms optimize for engagement, but all that does is encourage ragebait

kcorbitt 10/28/2024|||

Ohh, I really like that as a potential proxy metric!

coolcoder613 10/28/2024|||

Perhaps number of comments, or number of non-flamewar comments, or proportion of flamewar comments together with number of comments?

oli5679 10/28/2024||

If you withhold a small amount of data, or even retrain on a sample of your training data, then isotonicregression is good to solve many calibration problems.

https://scikit-learn.org/dev/modules/generated/sklearn.isoto...

I also agree with your intuition that if your output is censored at 0, with a large mass there, it's good to create two models, one for likelihood of zero karma, and another expected karma, conditional on it being non-zero.

kcorbitt 10/28/2024||

I hadn't heard of isotonicregression before but I like it!

> it's good to create two models, one for likelihood of zero karma, and another expected karma, conditional on it being non-zero.

Another way to do this is to keep a single model but have it predict two outputs: (1) likelihood of zero karma, and (2) expected karma if non-zero. This would require writing a custom loss function which sounds intimidating but actually isn't too bad.

If I were actually putting a model like this into production at HN I'd likely try modeling the problem in that way.

Y_Y 10/28/2024||

Did you dictate this? It looks like you typo'd/brain I'd "centered" into "censored", but even allowing for phonetic mistakes (of which I make many) and predictive text flubs, I still can't understand how this happened.

oli5679 10/28/2024|||

I was thinking of censoring, maybe I should have said another word like floored.

The reason I think of this as censoring is that there are are some classical statistical models that model a distribution with a large mass at a minimum threshold, e.g. "tobit" censored regression.

https://en.wikipedia.org/wiki/Censoring_(statistics)

Y_Y 10/28/2024||

Thanks for the explanation. I never paid much attention in my stats lectures so I deserve to have missed out on that term-of-art. I think the physics lingo would be to call it "capped" or "bounded" or "constrained".

oli5679 10/28/2024||

thanks, it's very understandable that you thought i was mistyping 'centred'.

CaptainFever 10/28/2024||||

I'm not the parent commenter, but whisper based dictation is getting pretty awesome nowadays. It's almost as good as sci-fi.

(Fully dictated, no edits except for this)

1024core 10/28/2024|||

I also thought that the commenter spoke "centered" and the speech recognition model output "censored".

swyx 10/28/2024||

> > This query took 17 seconds to load the dataset into RAM and then aggregating by type was almost instant. It is absolutely incredible to me that I can load every HN post and comment ever into RAM in a few seconds on my (admittedly beefy) dev laptop, and analyze them at will. What an age of abundance!

https://motherduck.com/blog/big-data-is-dead/

Arctic_fly 10/28/2024||

> But in 2015 there is a stark discontinuity, where the number of stories (with text) shoots up by >10x, and the average score drops by 5x! Is this some kind of eternal September?

Based on the later analysis in the post (which I agree with), the total score of a comment is disproportionately tied to whether it hits the front page, and of course how long it stays there. Regardless of the quality of the average post starting in 2015, the sheer quantity would make it impossible for all but a few to stay on the front page for very long. Hacker News got more popular, so each story got less prime time.

kcorbitt 10/28/2024||

Hey all, this project was a labor of love I worked on in my spare time over the last couple of weeks. Happy to answer any questions!

Eisenstein 10/29/2024|

I think it is interesting, but I can't help but feel that things like this result in the homogenizing and blandefying of content. It is like training a model to predict what movies will be successful at the box office -- the result will be the same kinds of movies over and over. No one knows what the breakthrough success is until it shows up, and no model can predict those. Essentially this is teaching people how to make HN full of nothing but complaints and indie success stories.

What is your take on this?

sdflhasjd 10/28/2024||

It's interesting that service complaints are so popular on HN. I always feel a bit bad that my most popular HN contribution was me complaining about a popular service

kelnos 10/28/2024||

I flag most complaint posts, unless the complaint actually brings to light or discusses something surprising or unique that can be generalized and discussed.

I generally find these posts pretty boring, and most comments on them are people recounting their own stories about how that (or a similar) service screwed them over. I suppose they can be a decent way to warn people off of a particular product (scammy, terrible customer support, whatever), but that's not what I come to HN for.

Karrot_Kream 10/28/2024|||

A popular theory on techie parts of the web is that engagement-optimizing sites create this negativity loop, but I disagree. I think negativity is naturally something that people seek no matter what the algorithm is. In an upvote based site, outrage ranks to the top. I also think text based platforms suffer from negative engagement much moreso than multimedia platforms.

Model correlation is decent here but there's certainly more to do to use its outputs predictively.

johnfn 10/29/2024|||

I don't really agree with this. I go and hang out with my friends, and we don't all end up getting outraged about stuff. I go for a walk in the park and no one is shouting at me; I go to a restaurant and people are sitting around normally discussing whatever. If you start quoting outrage bait that you read online, people might look at you strangely.

My point is I don't think people seek out outrage. Social media's algorithms may not explicitly reward it as transparently as `if (post.outrage > 100) post.boost()`, but outrage isn't some default rule of interaction.

miki123211 10/28/2024||||

As a mastodon user, I can definitely confirm this.

Give people the way to repost / retweet / boost, and your feed suddenly turns into mostly negativity, even if your algorithm is "show posts from my followers only, newest to oldest"

Karrot_Kream 10/28/2024||

Yeah my Bluesky followers are carefully curated to stop from swelling into negativity. I've been playing around with a labeller that filters followed posts into those that I find emotionally pleasant which I've been training based on my own labeling of followers' posts. The goal is to follow more people and have the labeller (or feed generator depending on how I go) hide the posts I don't care for.

Vampiero 10/28/2024||||

If that theory were true then, what about every website on the internet pre-2010? What about 4chan?

We're just built like that.

Regarding text platforms suffering more than non-text platforms, I think it's because of the lack of social cues that are otherwise there. You can infer a lot from the way someone talks, or from their body language. You can't infer much from text, which is partly why Poe's law exists -- sarcasm doesn't translate well.

Karrot_Kream 10/28/2024||

> what about every website on the internet pre-2010

It was definitely there. Plenty of forums had "rant threads" that were efforts to quarantine shitty reactionary behavior like this. Also a lot of the healthier forums were smaller forums. I was on plenty of forums that had 10-20 folks on them that today would just be a Telegram group chat or a small Discord "server". These small spaces tend to be a lot lower on toxicity than larger fora. I was part of a few large fora like Gaia Online and they were just as toxic as today's large platforms. Managing large communities with chronological posting is really difficult and upvote based social networks were the first real networks to be able to scale to larger userbases without having hundreds of moderators (like Gaia or the large MUDs.)

> What about 4chan?

4chan is immune because the default emotional register there is indignant dismissal. Because of this it's just a matter of choosing what else to layer ontop of the indignant dismissal, like sarcasm or anger or whatnot.

> Regarding text platforms suffering more than non-text platforms, I think it's because of the lack of social cues that are otherwise there. You can infer a lot from the way someone talks, or from their body language. You can't infer much from text, which is partly why Poe's law exists.

That's an interesting theory actually. My theory was that in the age of multimedia platforms, text platforms tend to attract folks who specifically want to use text over multimedia. Generally text forums will select for folks with social or self-esteem issues. These folks are the least likely to healthily deal with their emotions or disengage positively. This leads to higher toxicity on text based platforms.

Eisenstein 10/29/2024|||

> My theory was that in the age of multimedia platforms, text platforms tend to attract folks who specifically want to use text over multimedia. Generally text forums will select for folks with social or self-esteem issues. These folks are the least likely to healthily deal with their emotions or disengage positively. This leads to higher toxicity on text based platforms.

Some people like to take time to compose thoughts in written form because that is generally the best way to communicate thoughtfully. You can say what you will about a lack of body language, but plenty of people get into verbal fights in person and it doesn't help that they end up talking over each other.

I think that your assertion that people who communicate via text have social issues is without evidence and is reductive.

You could say that people who enjoy looking at themselves and hearing themselves enough to edit their footage and post it online have ego issues and are less likely to listen to what others have to say.

Karrot_Kream 10/29/2024||

My reading of your response is that you identify as a person who prefers written form communication because you feel that it is the best to communicate thoughtfully and you felt personally attacked by my response. I think that's reductive and not really relevant for this train of thought and your response seems to feel like a defense of your identity. I personally prefer communicating in text as well because I like to take my time to compose my thoughts but I know that presents a weakness for me because I'm much less able to articulate my thoughts in fast-moving situations such as work meetings or community emergency planning or other things. I am, indeed, less capable in social situations than others and it's a deficiency I've tried to grow past my entire life.

The direction of my implication comes from observation: text communities tend to all descend into toxicity (observation) -> why does this happen in text communities moreso than non-text communities? (question) -> higher proportion of socially maladapted people (theory). You might well be correct that people who enjoy looking and hearing themselves and have ego issues are the ones that prefer (compose a higher proportion thereof) multimedia social networks. I don't disagree with you, either. That's beside the point. The point is that most text communities tend to descend into toxicity.

Humans aren't perfect and if I'm in a positive community of high egos, I'd much prefer that than a toxic community with "normal" egos.

So I want to zoom in on this:

> Some people like to take time to compose thoughts in written form because that is generally the best way to communicate thoughtfully. You can say what you will about a lack of body language, but plenty of people get into verbal fights in person and it doesn't help that they end up talking over each other.

We're talking about social networks here not real life, because social networks deal with the fundamentally different problem. In a social network (yes this includes IRC) you are interacting with a number of people whom you do not share any real-world context with, whom you do not share any physical space with, and whom generally have a much lower stake in their relationships because of the lack of shared context.

In my experience all textual social networks that grow beyond a certain number of users descend into toxicity: Usenet, IRC (old Freenode and Rizon), Slashdot, Digg, Reddit, HN, Youtube Comments, Nextdoor, Local News Comments, Twitter/X, etc. I think "algorithms" (including counting upvotes) have reduced the moderation burden and allowed social sites to scale much higher than they could before algorithms.

Text communities all eventually collapse into ranting, bullying, hot takes, moral outrage, zealotry, and negativity. I'm open to any and all theories about why this is but I find this specific to text-based communities: Twitch, Instagram and TikTok have so much less of it for example. I think the idea that text leads to thoughtful communication was a hypothesis advanced first during the Usenet era and later during the blogging era but ended up being disproven. I think there's a nostalgia of the pre-media web that pervades these discussions that prevent text-fans from realizing at a macro level that the toxicity that was on comp.lang.lisp is the same toxicity in HN comments and is toxicity that just isn't there on most of Instagram, for better or for worse.

I actually think this identity around being a "text person" is part of the problem. The moment you wrap your identity around something you become both proud and protective of it. For some things this is fine, but if your preferred media itself becomes part of your identity, then you're going to have a blind spot around what makes your preferred social media different from the others.

Eisenstein 10/29/2024|||

Excuse me but you are the one who made the 'text identity' distinction and called people who don't prefer posting videos of themselves 'toxic'.

What exactly is a 'multimedia community' anyway? You haven't defined it. Is it just tik tok?

Karrot_Kream 10/29/2024||

I don't think you're really engaging with my comment. I feel that you're offended at me calling text-only users of the internet toxic, and that you're responding in defense. If that's the case then there's no value in our discussion. You're just going to reply with charged comments until I recant.

If you want another perspective on my point, take a look at https://www.reddit.com/r/slatestarcodex/comments/9rvroo/most...

Have a nice day.

Eisenstein 10/29/2024||

I think that you are not only incredibly patronizing, but you use faux psychology 'active listening' tactics to pretend to engage when you are really just shoving your point through while making yourself think that you are listening to people.

The fact that you cannot even engage to answer what a multimedia community is without claiming that I am acting in bad faith in order to jump out of an escape hatch is telling.

Your lack of self-awareness is astonishing.

drilbo 10/30/2024|||

>Twitch, Instagram and TikTok have so much less of it for example.

I'd be interested in any sort of evidence that supports this

Vampiero 10/28/2024|||

> My theory was that in the age of multimedia platforms, text platforms tend to attract folks who specifically want to use text over multimedia. Generally text forums will select for folks with social or self-esteem issues. These folks are the least likely healthily deal with their emotions or disengage positively. This leads to higher toxicity on text based platforms.

Yeah that's very plausible indeed

int_19h 10/28/2024||||

This video will make you angry: https://www.youtube.com/watch?v=rE3j_RHkqJc

jerjerjer 10/28/2024|||

Humans love having something to be righteously indignant about.

Rick76 10/28/2024|||

I don't like it, but it seems the internet always reacts more to inherently negative posts. That seems to be common across the entire internet, I think that's why the internet doesn't seem as fun as it did 10 years ago.

I'm sure it's just human psyche but I'm trying to overcome it and make my life more positive again

andrewmcwatters 10/28/2024|||

I suspect a large percentage of Dan's work moderating HN is downweighing posts that incite engagement from frustration. I've had on at least one occasion the top comment in a thread by over 100 upvotes that was purely the sentiment of several readers but did not contribute to the curated voice of the community.

pclmulqdq 10/28/2024||

There is a timing factor that you need to consider, too. Anecdotally, Sunday morning is the best time to get onto the front page, while Tuesday or Wednesday morning gets you the most views.

kcorbitt 10/28/2024|

Yep, that's why I included the post date in the information available to the model; in theory (if it's smart enough) it should be able to take that into account. That said I didn't include time-of-day; it would be interesting to see whether adding that information would be able to make the model more accurate!

If the reward model is indeed smart enough to be able to take that into account you could actually use it to plan the optimal time of day to post a specific story! You could just use the reward model to compute a predicted score for 8 different versions of your content, holding the post title/text constant across them all and just changing the date. Based on the differences in scores, you can determine which posting time the RM thinks is most likely to make your post successful!

pixl97 10/28/2024||

>you could actually use it to plan the optimal time of day to post a specific story!

You see this on Reddit pretty commonly.

Someone posts original content at an off time and get a small/moderate amount of upvotes. Then some time later (could be hours, days, or weeks) a bot/karma account will post the content at an optimal time to farm upvotes.

manx 10/29/2024|

Very interesting! Identifying great new content is a big unsolved problem for HN IMHO. Unfortunately, scores are not a good metric to predict, because they are not comparable (see https://felx.me/2021/08/29/improving-the-hacker-news-ranking...). A better metric might be "upvoterate", defined as how much more or less likely users are to upvote a story compared to the average story. More about that here: https://github.com/social-protocols/quality-news?tab=readme-...

More comments...