Posted by kcorbitt 10/28/2024
You would do better to leave out dates and authors.
Do you really want the model to hone in on dates & authors? If you just trained on those would it create anything useful?
It can’t for dates, since it isn’t getting any future date examples to prepare for future dates. I suppose you could argue that month & day matter. But surely that would be a much lower quality discriminator than forcing the model to stay focused on title & content.
Similarly with author. You can find out which authors produce content with the most upvotes with a simple calculation.
But again, is that the discriminator you want the model to use? Or the title & content? Because it will use the easiest discriminator it can.
Supervised learning you train on pairs of (x, y) where x is your input (title/post text/metadata) and y is the output score.
Naively, it's a linear regression model, Y = b0 + b1x1 + b2x2 + b3x3. Where b0 is your bias ("a floor for score points"), and b1, b2, and b3 are bias terms for the actual data of the post. You can solve this, closed form, and find the b1/b2/b3 that minimize the error of fitting to Y.
How do these equations change with RL? I always assumed RL was a multi-step process where actions are taken to get to a reward. If there is only 1 step/decision, to produce a "random" score, it feels much like supervised learning.
Here's the relevant text from the article:
>In this post we’ll discuss how to build a reward model that can predict the upvote count that a specific HN story will get. And in follow-up posts in this series, we’ll use that reward model along with reinforcement learning to create a model that can write high-value HN stories!
The post is interesting and I'll be sure to check out the next parts too. It's just that people, as evidenced by this thread, clearly misunderstood or were what was done.
Such a model can be used as the "reward model" for the "reinforcement learning from human feedback" (RLHF) method.
Did you ever figure out what happened in 2016?
It’s still outside the hn mainstream to use both in the same submission, so that might be biasing the model in strange ways.
> But to simplify, instead I’ll just limit to stories that have only text bodies, instead of links.
This line implies that pre- and post- 2016 stories are text only, so this change should not affect the data so much.
Everything else in the model before that final layer is exactly identical, architecture-wise.
In the case of a reward model, are you streaming in the list of tokens; if so, what is the output after each token? Or are you feeding in all of the tokens in one shot, with the predicted reward as the output?
You can check the examples from the TRL library for more information.
What library is that? Thanks!
Maybe the reputation of the poster is also a factor?
Well, thanks HN, you were good while it lasted...