Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

Posted by tamnd 4 days ago

Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m(huggingface.co)

164 points | 69 comments

vovavili 31 minutes ago|

Replacing an 11.6GB Parquet file every 5 minutes strikes me as a bit wasteful. I would probably use Apache Iceberg here.

ai-inquisitor 21 minutes ago||

It's not doing that. If you look at the repository, it's adding a new commit with tiny parquet files every 5 minutes. This recent one only was a 20.9 KB parquet file: https://huggingface.co/datasets/open-index/hacker-news/commi... and the ones before it were a median of 5 KB: https://huggingface.co/datasets/open-index/hacker-news/tree/...

The bigger concern is how large the git history is going to get on the repository.

vovavili 12 minutes ago||

This makes more sense. I still wonder if the author isn't just effectively recreating Apache Iceberg manually here.

tomrod 11 minutes ago||

Are they paying for the repo space, I wonder?

zerocrates 20 minutes ago|||

"The dataset is organized as one Parquet file per calendar month, plus 5-minute live files for today's activity. Every 5 minutes, new items are fetched from the source and committed directly as a single Parquet block. At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory."

So it's not really one big file getting replaced all the time. Though a less extreme variation of that is happening day to day.

tomrod 10 minutes ago||

Parquet is a very efficient storage approach. Data interfaces tend to treat paths as partitions, if logical.

fabmilo 28 minutes ago||

Was thinking the same thing. probably once a day would be more than enough. if you really want a minute by minute probably a delta file from the previous day should be more than enough.

xnx 2 hours ago||

The best source for this data used to be Clickhouse (https://play.clickhouse.com/play?user=play#U0VMRUNUIG1heCh0a...), but it hasn't updated since 2025-12-26.

robotswantdata 40 minutes ago||

Where’s the opt out ?

john_strinlai 38 minutes ago||

hackernews is very upfront that they do not really care about deletion requests or anything of that sort, so, the opt out is to not use hackernews.

ratg13 23 minutes ago|||

Create a new account every so often, don’t leave any identifying information, occasionally switch up the way you spell words (British/US English), and alternate using different slang words and shorthand.

fdghrtbrt 16 minutes ago||

And do what I do - paste everything into ChatGPT and have it rephrase it. Not because I need help writing, but because I’d rather not have my writing style used against me.

socksy 5 minutes ago||

I can't stand this and will actively discriminate against comments I notice in that voice. Even this one has "Not because [..], but because [..]"

tantalor 30 minutes ago||

The back button

gkbrk 2 hours ago||

My Hacker News items table in ClickHouse has 47,428,860 items, and it's 5.82 GB compressed and 18.18 GB uncompressed. What makes Parquet compression worse here, when both formats are columnar?

0cf8612b2e1e 2 hours ago||

Sorting, compression algorithm +level, and data types can all have an impact. I noted elsewhere that a Boolean is getting represented as an integer. That’s one bit vs 1-4 bytes.

There is also flexibility in what you define as the dataset. Skinnier, but more focused tables could be space saving vs a wide table that covers everything -will probably break compressible runs of data.

xnx 2 hours ago||

Parquet has a few compression option. Not sure which one they are using.

hirako2000 2 hours ago||

Plus isn't the least wasteful format, native duckdb for instance compacts better. That's not just down to the compression algorithm, which as you say got three main options for parquet.

epogrebnyak 28 minutes ago||

Wonder why median votes count is 0, seems every post is getting at least a few votes - maybe this was not the case in the past

epogrebnyak 27 minutes ago|

Ahhh I get it the moment I asked, there are usually no votes on comments

maxloh 20 minutes ago||

Could you also release the source code behind the automatic update system?

imhoguy 28 minutes ago||

Yay! So much knowledge in just 11GB. Adding to my end of the World hoarding stash!

politician 6 minutes ago||

This is great. I've soured on this site over the past few years due to the heavy partisanship that wasn't as present in the early days (eternal September), but there are still quite a few people whose opinions remain thought-provoking and insightful. I'm going to use this corpus to make a local self-hosted version of HN with the ability to a) show inline article summaries and b) follow those folks.

brtkwr 32 minutes ago||

This comment should make it into the download in a few mins.

tantalor 30 minutes ago|

As should this reply

mlhpdx 2 hours ago|

Static web content and dynamic data?

> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.

That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.

voxic11 47 minutes ago||

That is just the archive part, if you just would finish reading the paragraph you would know that updates since 2026-03-16 23:55 UTC are "are fetched every 5 minutes and committed directly as individual Parquet files through an automated live pipeline, so the dataset stays current with the site itself."

So to get all the data you need to grab the archive and all the 5 minute update files.

archive data is here https://huggingface.co/datasets/open-index/hacker-news/tree/...

update files are here (I know that its called "today" but it actually includes all the update files which span multiple days at this point) https://huggingface.co/datasets/open-index/hacker-news/tree/...

john_strinlai 43 minutes ago||

>if you just would finish reading the paragraph

probably uncalled for

xandrius 59 minutes ago||

I don't get what you meant with this comment.

john_strinlai 50 minutes ago||

the data updates every 5 minutes, but the description on huggingface says the last update was 2 days ago.

they are suggesting that the huggingface description should be automatically updating the date & item count when the data gets updated.

voxic11 46 minutes ago||

No that is the date at which the bulk archive ends and the 5 minute update files begin, so it should not be updated.

More comments...