Top
Best
New

Posted by dvrp 2 days ago

Inside The Internet Archive's Infrastructure(hackernoon.com)
https://github.com/internetarchive/heritrix3
385 points | 94 commentspage 2
ghm2199 17 hours ago|
Does any one know how the size of this compares to archive.today?
textfiles 16 hours ago|
We absolutely lap them with many, many more petabytes of material. But archive.today is also not doing speculative or multiple scheduled captures of the amount of sites that archive.org is.
vladiim 17 hours ago||
How long will it take for them to send the PetaBox to space?
textfiles 16 hours ago|
That project gets discussed every once in a while.
brcmthrowaway 21 hours ago||
Does IA do deduplication?
textfiles 21 hours ago||
Not in the way I think you're talking about. The archive has always tried to maintain a situation where the racks could be pushed out of the door or picked up after being somewhere and the individual drives will contain complete versions of the items. We have definitely reached out to people who seem to be doing redundant work and ask them to stop or for permission to remove the redundant item. But that's a pretty curatorial process.
HumanOstrich 21 hours ago||
[flagged]
zxcvasd 20 hours ago|||
heres the second paragraph in full:

"Here, amidst the repurposed neoclassical columns and wooden pews of a building constructed to worship a different kind of permanence, lies the physical manifestation of the "virtual" world. We tend to think of the internet as an ethereal cloud, a place without geography or mass. But in this building, the internet has weight. It has heat. It requires electricity, maintenance, and a constant battle against the second law of thermodynamics. As of late 2025, this machine—collectively known as the Wayback Machine—has archived over one trillion web pages.1 It holds 99 petabytes of unique data, a number that expands to over 212 petabytes when accounting for backups and redundancy.3"

can you help my small brain by pointing out where in this paragraph they talk about deduplication?

sltkr 20 hours ago|||
I don't think the article mentions anything about deduplication. Can you be less snarky and actually quote the relevant sentence?
jarboot 10 hours ago||
Hate to be the guy in the comments complaining about the css, but the sides of the text of this article are cut off. It looks like I'm zoomed in, and there's no way I can see the first few columns of the text without going to Reader view. I'm on a modern iPhone using safari, accessibility settings font larger than usual.
nandomrumber 6 hours ago||
Same for me, Safari iOS 18.7.1 no accessibility font size set, no browsers font size set.
shmeeed 6 hours ago||
FWIW, it's the same for me on FF Android.
segalord 10 hours ago||
this is every data hoarders dream setup haha
schmuckonwheels 20 hours ago||
Disappointed with the lack of pictures.
parttimelarry 20 hours ago||
Probably because this looks more like a Deep Research agent "delving" into the infrastructure -- with a giant list of sources at the end. The Archive is not just a library; it is a service provider.
schmuckonwheels 17 hours ago||
I wasn't expecting to read a podcast when clicking.
textfiles 16 hours ago||
What do you want some pictures of?
schmuckonwheels 14 hours ago||
An article about "infrastructure" that opens up with a dramatic description of a datacenter stuffed into an old church, I would expect more than just generic clipart you'd see in the back half of Wired magazine.
textfiles 14 hours ago||
Here's some photos I took a long time ago.

https://www.flickr.com/photos/textfiles/albums/7215763372220...

darkwater 6 hours ago|||
That's super cool! Can the IA building be accessed by some random people like myself? Next time I'm in SF (who knows when that will be though) I'd very much like visiting it!
Tempest1981 11 hours ago|||
Thanks! The church attendees (employees?) have a Severence Kier vibe... although I'm guessing the TV show came much later.
brcmthrowaway 21 hours ago||
[flagged]
krunck 19 hours ago||
[flagged]
mjmas 19 hours ago|
Was this reply meant for this story instead? https://news.ycombinator.com/item?id=46637127
lysace 19 hours ago||
The IA needs perhaps not just more money, but also more talented people, IMO. I worry that it has stagnated, from a tech pov.
mixologic 16 hours ago||
They can offer a perk that literally no other tech job can offer: Someday have a statue of your likeness preserved in ceramic: https://www.atlasobscura.com/places/internet-archive-headqua...

"Inside the church's main room, with its still-intact pews, there are more than 120 ceramic sculptures of the Internet Archive's current and former employees, created by artist Nuala Creed and inspired by the statues of the Xian warriors in China."

textfiles 16 hours ago|||
We've hired a few dozen people over the past couple of years. We think they're pretty talented.
lysace 15 hours ago||
Is retreival from the wayback machine intentionally made slow?
textfiles 14 hours ago||
Show me the faster wayback machine we are competing against.
brokensegue 12 hours ago|||
i'm a big fan of IA and wayback machine. i donate. but i do wish it were faster. i understand that would cost a lot more though.

i wonder if maybe donors above a certain level could get priority on archiving pages or something.

lysace 8 hours ago|||
Do you really think that is a good argument against the perception of technical stagnation?
pizza 6 hours ago||
That sounds really entitled.
cowhax 21 hours ago|
>And the rising popularity of generative AI adds yet another unpredictable dimension to the future survival of the public domain archive.

I'd say the nonprofit has found itself a profitable reason for its existence