Using reinforcement learning and $4.80 of GPU time to find the best HN post

Posted by kcorbitt 10/28/2024

Using reinforcement learning and $4.80 of GPU time to find the best HN post(openpipe.ai)

217 points | 95 commentspage 2

Nevermark 10/29/2024|

> It’s super important that your training inputs includes all the information your model will need to make predictions. In this case, I included the post title, author, date, and content. All of those factors could be relevant to the chance a story gets voted up.

You would do better to leave out dates and authors.

Do you really want the model to hone in on dates & authors? If you just trained on those would it create anything useful?

It can’t for dates, since it isn’t getting any future date examples to prepare for future dates. I suppose you could argue that month & day matter. But surely that would be a much lower quality discriminator than forcing the model to stay focused on title & content.

Similarly with author. You can find out which authors produce content with the most upvotes with a simple calculation.

But again, is that the discriminator you want the model to use? Or the title & content? Because it will use the easiest discriminator it can.

gavin_gee 10/28/2024||

Take note HN, this is what great content marketing looks like.

6gvONxR4sf7o 10/28/2024||

Why use RL for this instead of plain old supervised learning?

dinobones 10/28/2024||

I am trying to understand this too.

Supervised learning you train on pairs of (x, y) where x is your input (title/post text/metadata) and y is the output score.

Naively, it's a linear regression model, Y = b0 + b1x1 + b2x2 + b3x3. Where b0 is your bias ("a floor for score points"), and b1, b2, and b3 are bias terms for the actual data of the post. You can solve this, closed form, and find the b1/b2/b3 that minimize the error of fitting to Y.

How do these equations change with RL? I always assumed RL was a multi-step process where actions are taken to get to a reward. If there is only 1 step/decision, to produce a "random" score, it feels much like supervised learning.

jampekka 10/28/2024||

The post is not doing RL. It's just regression as you thought.

billmalarky 10/28/2024||

This post is using regression to build a reward model. The reward model will then be used (in a future post) to build the overall RL system.

Here's the relevant text from the article:

>In this post we’ll discuss how to build a reward model that can predict the upvote count that a specific HN story will get. And in follow-up posts in this series, we’ll use that reward model along with reinforcement learning to create a model that can write high-value HN stories!

jampekka 10/30/2024||

The title is misleading. The $4.80 is spent for supervised learning to find the best post.

The post is interesting and I'll be sure to check out the next parts too. It's just that people, as evidenced by this thread, clearly misunderstood or were what was done.

jampekka 10/28/2024||

It is just plain old supervised learning. A regression from the post features to vote count. The RL discussion in TFA is a bit confusing.

Such a model can be used as the "reward model" for the "reinforcement learning from human feedback" (RLHF) method.

Havoc 10/28/2024||

Nice write up.

Did you ever figure out what happened in 2016?

kcorbitt 10/28/2024|

Nope. I was actually planning on asking dang if he has any insights there. If he sees this thread hopefully he can chime in!

n2d4 10/28/2024|||

Given that Google Trends doesn't show that bump, I'd assume it has to do with how the data was collected. Maybe all stories with < X votes/comments older than 2015 are not included, or deleted from whatever index you used?

kelnos 10/28/2024||||

In case he doesn't, you might as well email him about it. He's a very responsive guy and might find it interesting.

twoodfin 10/28/2024|||

I think text vs. link used to be XOR, but isn’t any longer.

It’s still outside the hn mainstream to use both in the same submission, so that might be biasing the model in strange ways.

jerjerjer 10/28/2024||

From the post:

> But to simplify, instead I’ll just limit to stories that have only text bodies, instead of links.

This line implies that pre- and post- 2016 stories are text only, so this change should not affect the data so much.

1024core 10/28/2024||

Is it my understanding that the reward model is also similar to an LLM (with the difference being it predicts a score instead of the next token)?

kcorbitt 10/28/2024|

Yes! The architecture is almost identical. The only difference is in the final layer. In an LLM used for text generation, the final layer has a separate output for every potential token the model could produce, and we decide which token to generate by choosing the one with the highest likelihood at each generation step (at least that's what the simplest sampling methods do). In an LLM used as a reward model, we only have one output in the final layer, and we interpret its value as the predicted reward.

Everything else in the model before that final layer is exactly identical, architecture-wise.

1024core 10/28/2024||

But a typical LLM has a feedback loop: it looks at the last token it generated and then decides, given the N tokens before that, which token to output next.

In the case of a reward model, are you streaming in the list of tokens; if so, what is the output after each token? Or are you feeding in all of the tokens in one shot, with the predicted reward as the output?

maleldil 10/28/2024||

There are multiple ways to model reward. You can have it be fine-grained, such that every token gets its own reward, but by far the most common is to feed in the whole sequence and generate a single reward at the end.

1024core 10/28/2024||

I guess I'm not sure how the "feed in the whole sequence" works, if there's a single reward at the end.

maleldil 10/31/2024||

It depends on the model and the problem. As an example, BERT-based models have a special [CLS] token that was pre-trained to encode information about the whole sequence. A reward model based on BERT would take the output embedding of that token from the last layer and feed it through a classification head, which would depend on your problem. You could then train this classification head on your alignment dataset like a classification problem.

You can check the examples from the TRL library for more information.

1024core 11/6/2024||

> You can check the examples from the TRL library for more information.

What library is that? Thanks!

hnburnsy 10/29/2024||

Suggestion would be to try and coorolate the best time to post on HN to get it noticed. A good post won't catch fire if it doesn't overcome the initial low visibility. I've posted items that are later posted by others that gain traction.

Maybe the reputation of the poster is also a factor?

metalman 10/30/2024||

now do it again, and this time see where your post on ranking posts,ranks Personaly,I find lauding the dead, and dead past to be some how objectionable. Though I suppose that it is the business of our so called Ai, mining the dead past, hoping to come up with something better than frankenstien's zombie corpse. It is an insurmountable limitation, and dangerous I think as well, the past is that ultimatly perfect thing, its absolute imutability, and totality, as it is all there, to pick and choose from such a thing is brazen indeed. I cant help but imagine a picture of your $4.80 actualy bieng consumed in a bed of fluidised coal, which in fact it was.

eugenekolo 10/28/2024||

What does the model say about this post?

kcorbitt 10/28/2024|

Haha great question. Since it's only trained on on-platform HN content and not external links, this post is a little bit out of distribution for it unfortunately. I'm thinking about scraping a corpus of external links and running the same analysis though, in which case I'd definitely run it on this story because I'm also curious about that. :)

Rick76 10/28/2024||

I would be very interested in the results of that as well

hn_throwaway_99 10/28/2024|

> And in follow-up posts in this series, we’ll use that reward model along with reinforcement learning to create a model that can write high-value HN stories!

Well, thanks HN, you were good while it lasted...

More comments...