Posted by tamnd 4 days ago
The bigger concern is how large the git history is going to get on the repository.
So it's not really one big file getting replaced all the time. Though a less extreme variation of that is happening day to day.
There is also flexibility in what you define as the dataset. Skinnier, but more focused tables could be space saving vs a wide table that covers everything -will probably break compressible runs of data.
> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.
That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.
So to get all the data you need to grab the archive and all the 5 minute update files.
archive data is here https://huggingface.co/datasets/open-index/hacker-news/tree/...
update files are here (I know that its called "today" but it actually includes all the update files which span multiple days at this point) https://huggingface.co/datasets/open-index/hacker-news/tree/...
probably uncalled for
they are suggesting that the huggingface description should be automatically updating the date & item count when the data gets updated.