Top
Best
New

Posted by tamnd 4 days ago

Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m(huggingface.co)
164 points | 69 commentspage 2
kshacker 3 hours ago|
Good for demo but every 5 minutes? Why?
Imustaskforhelp 2 hours ago|
It can have some good use cases I can think of. Personally I really appreciate the 5 minute update.
alstonite 3 hours ago||
What happened between 2023 and 2024 to cause the usage dropoff?
ghgr 3 hours ago||
I'd say it's less a usage dropoff and more a reversion to the mean after Covid
tehjoker 3 hours ago||
That's a possible hypothesis, but there was also a rising trend prior, it wasn't stable.
imhoguy 2 hours ago||
Return to office
lyu07282 3 hours ago||
Please upload to https://academictorrents.com/ as well if possible
palmotea 4 hours ago||
> At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.

Wouldn't that lose deleted/moderated comments?

BoredPositron 3 hours ago|
I guess that's the point.
Imustaskforhelp 2 hours ago||
Can't someone create an automatic script which can just copy the files say 5 minutes before midnight UTC?
0cf8612b2e1e 4 hours ago||
Under the Known Limitations section

  deleted and dead are integers. They are stored as 0/1 rather than booleans.
Is there a technical reason to do this? You have the type right there.
Imustaskforhelp 2 hours ago||
As someone who had made a project analysing hackernews who had used clickhouse, I really feel like this is a project made for me (especially the updated every 5 minute aspect which could've helped my project back then too!)

Your project actually helps me out a ton in making one of the new project ideas that I had about hackernews that I had put into the back-burner.

I had thought of making a ping website where people can just @Username and a service which can detect it and then send mail to said username if the username has signed up to the service (similar to a service run by someone from HN community which mails you everytime someone responds to your thread directly, but this time in a sort of ping)

[The previous idea came as I tried to ping someone to show them something relevant and thought that wait a minute, something like ping which mails might be interesting and then tried to see if I can use algolia or any service to hook things up but not many/any service made much sense back then sadly so I had the idea in back of my mind but this service sort of solves it by having it being updated every 5 minutes]

Your 5 minute updates really make it possible. I will look what I can do with that in some days but I am seeing some discrepancy in the 5 minute update as last seems to be 16 march in the readme so I would love to know more about if its being updated every 5 minutes because it truly feels phenomenal if true and its exciting to think of some new possibilities unlocked with it.

tonymet 3 hours ago||
what's the license for HN content?
BowBun 1 hour ago||
We have LLMs and links to TOS, this is easily answerable by _anyone_ on the internet at this point.

Comments+posts are defined as user generated content, you have no right to its privacy/control in any capacity once you post it - https://www.ycombinator.com/legal/

YC in theory has the right to go after unauthorized 3rd parties scraping this data. YC funds startups and is deeply vested in the AI space. Why on Earth would they do that.

echelon 3 hours ago||
At this point, you can train on anything without repercussion.

Copyright doesn't seem to matter unless you're an IP cartel or mega cap.

marginalia_nu 2 hours ago||
Laughs nervously in jurisdiction without fair use doctrine
Onavo 4 hours ago||
Is is possible to only download a subset? e.g. Show HNs or HN Whoishiring. The Show HNs and HN Whoishiring are very useful for classroom data science i.e. a very useful set of data for students to learn the basic of data cleaning and engineering.
nelsondev 4 hours ago|
It’s date partitioned, you could download just a date range. It’s also parquet, so you can download just specific columns with the right client
lokimoon 3 hours ago||
You are the product
waynesonfire 2 hours ago|
Your reward is the endorphin hit from writing this comment.
bstsb 4 hours ago|
what’s the license? “do whatever the fuck you want with the data as long as you don’t get caught”? or does that only work for massive corporations
BoredPositron 3 hours ago|
The universal license.