News publishers limit Internet Archive access due to AI scraping concerns

Posted by ninjagoo 4 hours ago

News publishers limit Internet Archive access due to AI scraping concerns(www.niemanlab.org)

300 points | 181 comments

zmmmmm 9 minutes ago|

Dear news publications - if you aren't willing to accept an independent record of what you published, I can't accept your news. It's a critical piece of the framework that keeps you honest. I don't care if you allow AI scraping either way, but you have to facilitate archival of your content - independently, not under your own control.

germandiago 7 minutes ago|

First thing that came to my mind went along the same reasoning.

kevincloudsec 4 hours ago||

There's a compliance angle to this that nobody's talking about. Regulatory frameworks like SOC 2 and HIPAA require audit trails and evidence retention. A lot of that evidence lives at URLs. When a vendor's security documentation, a published incident response, or a compliance attestation disappears from the web and can't be archived, you've got a gap in your audit trail that no auditor is going to be happy about.

I've seen companies fail compliance reviews because a third-party vendor's published security policy that they referenced in their own controls no longer exists at the URL they cited. The web being unarchivable isn't just a cultural loss. It's becoming a real operational problem for anyone who has to prove to an auditor that something was true at a specific point in time.

iririririr 1 hour ago||

This is new to me, so I did a quick search for a few examples of such documents.

The very first result was a 404

https://aws.amazon.com/compliance/reports/

The jokes write themselves.

staticassertion 1 hour ago||

But how is this related to the internet being archivable? This sort of proves the point that URLs were always a terrible idea to reference in your compliance docs, the answer was always to get the actual docs.

paulryanrogers 1 hour ago|||

IME compliance tools will take a doc and or a link. What's acceptable is up to the auditor. IMO both a link and doc are best.

Links alone can be tempting as you've to reference the same docs or policies over and over for various controls.

aussieguy1234 47 minutes ago|||

Wayback machine URLs are much more likely to be stable.

Even if the content is taken down, changed or moved, a copy is likely to still be available in the Wayback Machine.

staticassertion 22 minutes ago||

I would never rely on this vs just downloading the SOC2 reports, which almost always aren't public anyways and need to be requested explicitly. I suspect that that compliance page would have just linked to a bunch of PDF downloads or possibly even a "request a zip file from us after you sign an NDA" anyways.

alexpotato 3 hours ago|||

> Regulatory frameworks like SOC 2 and HIPAA require audit trails and evidence retention

Sidebar:

Having been part of multiple SOC audits at large financial firms, I can say that nothing brings adults closer to physical altercations in a corporate setting than trying to define which jobs are "critical".

- The job that calculates the profit and loss for the firm, definitely critical

- The job that cleans up the logs for the job above, is that critical?

- The job that monitors the cleaning up of the logs, is that critical too?

These are simple examples but it gets complex very quickly and engineering, compliance and legal don't always agree.

Ucalegon 2 hours ago|||

Thats when you reach out to your insurer and ask them their requirements as per the policy and/or if there are any contractual obligations associated with the requirements which might touch indemnity/SLAs. If it does, then it is critical, if not, then its the classic conversation of cost vs risk mitigate/tolerance.

a13n 3 hours ago||||

depends, if you don’t clean up the logs and monitor that cleanup will it eventually hit the p&l? eg if you fail compliance audits and lose customers over it? then yes. it still eventually comes back to the p&l.

hsbauauvhabzb 2 hours ago|||

And in the big scheme of things, none of those things are even important, your family, your health and your happiness are :-)

ninjagoo 4 hours ago|||

At some point Insurance is going to require companies to obtain paper copies of any documentation/policies, precisely to avoid this kind of situation. It may take a while to get there though. It'll probably take a couple of big insurance losses before that happens.

kevincloudsec 4 hours ago|||

Insurance is already moving that direction for cyber policies. Some underwriters now require screenshots or PDF exports of third-party vendor security attestations as part of the application process, not just URLs. The carriers learned the hard way that 'we linked to their SOC 2 landing page' doesn't hold up when that page disappears after an acquisition or rebrand.

pwg 2 hours ago||

> when that page disappears after an acquisition or rebrand.

Sadly, it does not even have to be an acquisition or rebrand. For most companies, a simple "website redo", even if the brand remains unchanged, will change up all the URL's such that any prior recorded ones return "not found". Granted, if the identical attestation is simply at a new url, someone could potentially find that new url and update the "policy" -- but that's also an extra effort that the insurance company can avoid by requiring screen shots or PDF exports.

hsbauauvhabzb 12 minutes ago||

It sounds like you work at Microsoft, they do that ALL the time.

dahcryn 1 hour ago||||

We already require all relevant and referenced documents to be uploaded in a contract lifecycle management system.

Yes we have hundreds of identical Microsoft and Aws policies, but it's the only way. Checksum the full zip and sign it as part of the contract, that's literally how we do it

seanmcdirmid 4 hours ago||||

Digital copies will also work I don’t understand why they just don’t save both the URL and the content at the URL when last checked.

ninjagoo 3 hours ago|||

I think maybe because the contents of the URL archived locally aren't legally certifiable as genuine - the URL is the canonical source.

That's actually a potentially good business idea - a legally certifiable archiving software that captures the content at a URL and signs it digitally at the moment of capture. Such a service may become a business requirement as Internet archivability continues to decline.

leni536 1 hour ago|||

Apparently perma.cc is officially used by some courts in the US. I did use it in addition to the wayback machine when I collected paper trail for a minor retail dispute, but I did not have to use it.

I don't know how exactly it achieves being "legally certifiable", at least to the point that courts are trusting it. Signing and timestamping with independent transparency logs would be reasonable.

https://perma.cc/sign-up/courts

ninjagoo 1 hour ago||

This is an interesting service, but at $10 for 10 links per month, or $100 for 500 links per month, it might be a tad bit too expensive for individuals.

staticassertion 1 hour ago||||

The first thing you do when you're getting this information is get PDFs from these vendors like their SOC2 attestation etc. You wouldn't just screenshot the page, that would be nuts.

Any vendor who you work with should make it trivial to access these docs, even little baby startups usually make it quite accessible - although often under NDA or contract, but once that's over with you just download a zip and everything is there.

thayne 23 minutes ago||

> You wouldn't just screenshot the page, that would be nuts.

That's what I thought the first time I was involved in a SOC2 audit. But a lot of the "evidence" I sent was just screenshots. Granted, the stuff I did wasn't legal documents, it was things like the output of commands, pages from cloud consoles, etc.

staticassertion 19 minutes ago||

To be clear, lots of evidence will be screenshots. I sent screenshots to auditors constantly. For example, "I ran this splunk search, here's a screenshot". No biggie.

What I would not do is take a screenshot of a vendor website and say "look, they have a SOC2". At every company, even tiny little startup land, vendors go through a vendor assessment that involves collecting the documents from them. Most vendors don't even publicly share docs like that on a site so there'd be nothing to screenshot / link to.

inetknght 1 hour ago|||

Is it digitally certifiable if it's not accessible by everyone?

That is: if it's not accessible by a human who was blocked?

macintux 1 hour ago||

Or if it potentially gives different (but still positive) results to different parties?

trollbridge 3 hours ago|||

What if the TOS expressly prohibits archiving it, and it's also copyrighted?

pixl97 3 hours ago|||

Then said writers of TOS should be dragged in front of a judge to be berated, then tarred and feathered, and ran out of the courtroom on a rail.

Having your cake and eating it too should never be valid law.

croes 2 hours ago||

Maybe we should start with those who made such copyright claims a possibility in the first place

wizzwizz4 2 hours ago||

They're long, long dead.

seanmcdirmid 54 minutes ago|||

I don’t think contracts and agreements that both parties can’t keep copies of are valid in any US jurisdiction.

layer8 3 hours ago||||

More likely, there will be trustee services taking care of document preservation, themselves insured in case of data loss.

ninjagoo 3 hours ago||

Isn't the Internet Archive such a trustee service?

Or are you thinking of companies like Iron Mountain that provide such a service for paper? But even within corporations, not everything goes to a service like Iron Mountain, only paper that is legally required to be preserved.

A society that doesn't preserve its history is a society that loses its culture over time.

layer8 3 hours ago||

The context was regulatory requirements for companies. I mean that as a business you pay someone to take care of your legal document preservation duties, and in case data gets lost, they will be liable for the financial damage this incurs to you. Outsourcing of risk against money.

ninjagoo 3 hours ago||

Whether or not the Internet Archive counts as a legally acceptable trustee service is being litigated in the court systems [1]. The link is a bit dated so unsure what the current situation is. There's also this discussion [2].

[1] https://www.mololamken.com/assets/htmldocuments/NLJ_5th%20Ci...

[2] https://www.nortonrosefulbright.com/en-au/knowledge/publicat...

mycall 3 hours ago|||

Also, getting insurance to pay for cybercrimes is hard and sometimes doesn't justify their costs.

sebmellen 1 hour ago|||

I hate to say this, but this account seems like it’s run by an AI tool of some kind (maybe OpenClaw)? Every comment has the same repeatable pattern, relatively recent account history, most comments are hard or soft sell ads for https://www.awsight.com/. Kind of ironic given what’s being commented on here.

I hope I’m wrong, but my bot paranoia is at all time highs and I see these patterns all throughout HN these days.

linehedonist 45 minutes ago||

Agreed. "isn't just... It's becoming" feels to me very LLM-y to me.

sebmellen 40 minutes ago||

Now the top comment on the GP comment is from a green account, and suspiciously the most upvoted. Also directly in-line with the AWS-related tool promotion… https://news.ycombinator.com/item?id=47018665

@dang do you have any thoughts about how you’re performing AI moderation on HN? I’m very worried about the platform being flooded with these Submarine comments (as PG might call them).

rob 6 minutes ago||

[delayed]

riddlemethat 3 hours ago|||

https://www.page-vault.com/ These guys exist to solve that problem.

mycall 3 hours ago|||

Perhaps those companies should have performed verified backups of third-party vendor's published security policies into a secure enclave with paired keys with the auditor, to keep a trail of custody.

staticassertion 3 hours ago|||

> I've seen companies fail compliance reviews because a third-party vendor's published security policy that they referenced in their own controls no longer exists at the URL they cited.

Seriously? What kind of auditor would "fail" you over this? That doesn't sound right. That would typically be a finding and you would scramble to go appease your auditor through one process or another, or reach out to the vendor, etc, but "fail"? Definitely doesn't sound like a SOC2 audit, at least.

Also, this has never particularly hard to solve for me (obviously biased experience, so I wonder if this is just a bubble thing). Just ask companies for actual docs, don't reference urls. That's what I've typically seen, you get a copy of their SOC2, pentest report, and controls, and you archive them yourself. Why would you point at a URL? I've actually never seen that tbh and if a company does that it's not surprising that they're "failing" their compliance reviews. I mean, even if the web were more archivable, how would reliance on a URL be valid? You'd obviously still need to archive that content anyway?

Maybe if you use a tool that you don't have a contract with or something? I feel like I'm missing something, or this is something that happens in fields like medical that I have no insight into.

This doesn't seem like it would impact compliance at all tbh. Or if it does, it's impacting people who could have easily been impacted by a million other issues.

cj 2 hours ago|||

Your comment matches my experience closer than the OP.

A link disappearing isn’t a major issue. Not something I’d worry about (but yea might show up as a finding on the SOC 2 report, although I wouldn’t be surprised if many auditors wouldn’t notice - it’s not like they’re checking every link)

I’m also confused why the OP is saying they’re linking to public documents on the public internet. Across the board, security orgs don’t like to randomly publish their internal docs publicly. Those typically stay in your intranet (or Google Drive, etc).

staticassertion 1 hour ago||

> although I wouldn’t be surprised if many auditors wouldn’t notice

lol seriously, this is like... at least 50% of the time how it would play out, and I think the other 49% it would be "ah sorry, I'll grab that and email it over" and maybe 1% of the time it's a finding.

It just doesn't match anything. And if it were FEDRAMP, well holy shit, a URL was never acceptable anyways.

yorwba 1 hour ago|||

> I feel like I'm missing something

You're missing the existence of technology that allows anyone to create superficially plausible but ultimately made-up anecdotes for posting to public forums, all just to create cover for a few posts here and there mixing in advertising for a vaguely-related product or service. (Or even just to build karma for a voting ring.)

Currently, you can still sometimes sniff out such content based on the writing style, but in the future you'd have to be an expert on the exact thing they claim expertise in, and even then you could be left wondering whether they're just an expert in a slightly different area instead of making it all up.

EDIT: Also on the front page currently: "You can't trust the internet anymore" https://news.ycombinator.com/item?id=47017727

staticassertion 1 hour ago||

I don't really see what you're getting at, it seems unrelated to the issue of referencing URLs in compliance documentation.

trevwilson 1 hour ago|||

They're suggesting that the original comment is LLM generated, and after looking at the account's comment history I strongly suspect they're correct

staticassertion 14 minutes ago||

Oh, I sort of wondered if that was the case but I was really unsure based on the wording. Yeah, I have no idea.

stavros 1 hour ago|||

I think they meant that, now that LLMs are invented, people have suddenly started to lie on the Internet.

Every comment section here can be summed up as "LLM bad" these days.

yorwba 53 minutes ago||

No, now that LLMs are invented, a lot more people lying on the Internet have started to do so convincingly, so they also do it more often. Previously, when somebody was using all the right lingo to signal expert status, they might've been a lying expert or an honest expert, but they probably weren't some lying rando, because then they wouldn't even have thought of using those words in that context. But now LLMs can paper over that deficit, so all the lying randos who previously couldn't pretend to be an expert are now doing so somewhat successfully, and there are a lot of lying randos.

It's not "LLM bad" — it's "LLM good, some people bad, bad people use LLM to get better at bad things."

tempaccount5050 1 hour ago|||

Your experience isn't normal and I seriously question it unless there was some sort of criminal activity being investigated or there was known negligence. I worked for a decent sized MSP and have been through crytptolock scenarios.

Insurance pays as long as you aren't knowingly grossly negligent. You can even say "yes, these systems don't meet x standard and we are working on it" and be ok because you acknowledged that you were working on it.

Your boss and your bosses boss tell you "we have to do this so we don't get fucked by insurance if so and so happens" but they are either ignorant, lying, or just using that to get you to do something.

I've seen wildly out of date and unpatched systems get paid out because it was a "necessary tradeoff" between security and a hardship to the business to secure it.

I've actually never seen a claim denied and I've seen some pretty fuckin messy, outdated, unpatched legacy shit.

Bringing a system to compliance can reasonably take years. Insurance would be worthless without the "best effort" clause.

lukeschlather 2 hours ago|||

It's interesting to think about this in terms of something like Ars Technica's recent publishing of an article with fake (presumably LLM slop) quotes that they then took down. The big news sites are increasingly so opaque, how would you even know if they were rewriting or taking articles down after the fact?

int0x29 1 hour ago||

This is typically solved by publishing reactions/corrections or in the case of news programs starting the next one with a retraction/correction. This happens in some academic journals and some news outlets. I've seen the PBS Newshour and the New York Times do this. I've also seen Ars Technica do this with some science articles (Not sure what the difference in this case is or if it will take some more time)

oxguy3 1 hour ago||

On their forum, an Ars Technica staff member said[1] that they took the article down until they could investigate what happened, which probably wouldn't be until after the weekend.

[1]: https://arstechnica.com/civis/threads/journalistic-standards...

lofaszvanitt 1 hour ago|||

And for this we need cheapo and fast WORM, 100 TB/whatever archiving solutions.

kryogen1c 1 hour ago||

If your soc2 or hipaa references the internet archive, you probably deserve to fail.

f33d5173 3 hours ago||

So instead of scraping IA once, the AI companies will use residential proxies and each scrape the site themselves, costing the news sites even more money. The only real loser is the common man who doesn't have the resources to scrape the entire web himself.

I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.

CqtGLRGcukpy 3 hours ago||

The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

This is from my experience having a personal website. AI companies keep coming back even if everything is the same.

zmmmmm 7 minutes ago|||

yeah, they should really have a think about how their behavior is harming their future prospects here.

Just because you have infinite money to spend on training doesn't mean you should saturate the internet with bots looking for content with no constraints - even if that is a rounding error of your cost.

We just put heavy constraints on our public sites blocking AI access. Not because we mind AI having access - but because we can't accept the abusive way they execute that access.

giancarlostoro 2 hours ago||||

Weird, considering IA has most of its content in a way you could rehost it all idk why nobody’s just hosting a IA carbon copy that AI companies can hit endlessly, and then cutting IA a nice little check in the process, but I guess some of the wealthiest AI startups are very frugal about training data?

This also goes back to something I said long ago, AI companies are relearning software engineering poorly. I can think of so many ways to speed up AI crawlers, im surprised someone being paid 5x my salary cannot.

mlnj 2 hours ago|||

Unless regulated, there is no incentive for the giants to fund anything.

Nathan2055 1 hour ago|||

That already exists, it's called Common Crawl[1], and it's a huge reason why none of this happened prior to LLMs coming on the scene, back when people were crawling data for specialized search engines or academic research purposes.

The problem is that AI companies have decided that they want instant access to all data on Earth the moment that it becomes available somewhere, and have the infrastructure behind them to actually try and make that happen. So they're ignoring signals like robots.txt or even checking whether the data is actually useful to them (they're not getting anything helpful out of recrawling the same search results pagination in every possible permutation, but that won't stop them from trying, and knocking everyone's web servers offline in the process) like even the most aggressive search engine crawlers did, and are just bombarding every single publicly reachable server with requests on the off chance that some new data fragment becomes available and they can ingest it first.

This is also, coincidentally, why Anubis is working so well. Anubis kind of sucks, and in a sane world where these companies had real engineers working on the problem, they could bypass it on every website in just a few hours by precomputing tokens.[2] But...they're not. Anubis is actually working quite well at protecting the sites it's deployed on despite its relative simplicity.

It really does seem to indicate that LLM companies want to just throw endless hardware at literally any problem they encounter and brute force their way past it. They really aren't dedicating real engineering resources towards any of this stuff, because if they were, they'd be coming up with way better solutions. (Another classic example is Claude Code apparently using React to render a terminal interface. That's like using the space shuttle for a grocery run: utterly unnecessary, and completely solvable.) That's why DeepSeek was treated like an existential threat when it first dropped: they actually got some engineers working on these problems, and made serious headway with very little capital expenditure compared to the big firms. Of course they started freaking out, their whole business model is based on the idea that burning comical amounts of money on hardware is the only way we can actually make this stuff work!

The whole business model backing LLMs right now seems to be "if we burn insane amounts of money now, we can replace all labor everywhere with robots in like a decade", but if it turns out that either of those things aren't true (either the tech can be improved without burning hundreds of billions of dollars, or the tech ends up being unable to replace the vast majority of workers) all of this is going to fall apart.

Their approach to crawling is just a microcosm of the whole industry right now.

[1]: https://en.wikipedia.org/wiki/Common_Crawl

[2]: https://fxgn.dev/blog/anubis/ and related HN discussion https://news.ycombinator.com/item?id=45787775

iririririr 1 hour ago|||

> The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

Maybe they vibecoded the crawlers. I wish I were joking.

fartfeatures 3 hours ago|||

IPFS was an attempt at this: https://en.wikipedia.org/wiki/InterPlanetary_File_System

lukeasch21 3 hours ago|||

Coincidentally most of the funding towards IPFS development dried up because the VC money moved onto the very technology enabling these problems...

Seattle3503 2 hours ago|||

Is there a good post-mortem of IPFS out there?

iririririr 1 hour ago||

What do you mean? It is alive and "well". Just extremely slow now that interest waned.

Operyl 3 hours ago|||

They already are, I've been dealing with Vietnam and Korea residential proxies destroying my systems for weeks, I'm growing tired. I cannot survive 3500 RPS 24/7.

pigggg 58 minutes ago|||

AI companies are _already_ funding and using residential proxies. Guess how much of those proxies are acquired through being compromised or tricking people into installing apps?

demetris 2 hours ago|||

I don’t believe resips will be with us for long, at least not to the extent they are now. There is pressure and there are strong commercial interests against the whole thing. I think the problem will solve itself in some part.

Also, I always wonder about Common Crawl:

Is there is something wrong with it? Is it badly designed? What is it that all the trainers cannot find there so they need to crawl our sites over and over again for the exact same stuff, each on its own?

raincole 3 hours ago|||

Even if the site is archived on IA, AI companies will still do the same.

toomuchtodo 2 hours ago||

AI browsers will be the scrapers, shipping content back to the mothership for processing and storage as users co browse with the agentic browser.

daniel31x13 2 hours ago||

I maintain an open-source project called Linkwarden and this exact discussion is one of the reasons why it exists, teams needed a way to preserve referenced URLs reliably without having to depend on external services.

It stores webpages in multiple formats (HTML snapshot, screenshot, PDF snapshot, and a fully dedicated reader view) so you’re not relying on a single fragile archive method.

There’s both a hosted cloud plan [1] which directly supports the project, and a fully self-hosted option [2], depending on how much control you need over storage and retention.

[1]: https://linkwarden.app

[2]: https://github.com/linkwarden/linkwarden

iririririr 1 hour ago|

Neat. How does the archive.org integration works?

Does it just POST the url to them for them to fetch? Or is there any integration/trust to store what you already fetched on the client directly on their archives?

jruohonen 4 hours ago||

It affects science too (and there you'd want solid archiving as much as possible). Increasingly, meta-data is full of errors and general purpose search engines for science are breaking down, including even things like Google Scholar. I suppose some big science publishers are blocking AI bots too.

shevy-java 4 hours ago||

Google ruined its own search engine on top of that as well though.

We are increasingly becoming blind. To me it looks as if this is done on purpose actually.

salawat 3 hours ago||

It was. Advertising is incompatible with accurate data retrieval/routing. We've also implemented "obligation to deindex". So providing an unbiased index of the web as she is is essentially (in the U.S.) verboten.

ninjagoo 4 hours ago|||

> I suppose some big science publishers are blocking AI bots too.

That's a travesty, considering that a huge chunk of science is public-funded; the public is being denied the benefits of what they're paying for, essentially.

galleywest200 4 hours ago||

The public can still access the sites themselves.

ninjagoo 4 hours ago||

> The public can still access the sites themselves.

Indefinitely? Probably not.

What about when a regime wants to make the science disappear?

thwarted 3 hours ago|||

So the solution is to allow the AI scraping and hide the content, with significantly reduced fidelity and accuracy and not in the original representation, in some language model?

mlnj 2 hours ago||

Don't forget the onslaught of ads that will distort the actual publications even more going forward.

pa7ch 3 hours ago|||

What has that got to do with blocking AI crawlers?

ninjagoo 3 hours ago||

If it's publicly funded, why shouldn't AI crawlers have access to that data? Presumably those creating the AI crawlers paid taxes that paid for the science.

JumpCrisscross 2 hours ago||

> If it's publicly funded, why shouldn't AI crawlers have access to that data?

Becase it costs money to serve them the content.

wyre 1 hour ago||

If I build a business based off of consumption of publicly funded data, and that’s okay, why isn’t it okay for AI?

Is the answer regulate AI? Yes.

JumpCrisscross 58 minutes ago||

> If I build a business based off of consumption of publicly funded data, and that’s okay, why isn’t it okay for AI?

Because when you build it you aren't, presumably, polling their servers every fifteen minutes for the entire corpus. AI scrapers are currently incredibly impolite.

asdff 1 hour ago||

Thank god for pubmed and deterministic search operators.

xannabxlle 1 hour ago||

My first impression is that news companies don't want their content scraped for copyright reasons, and roundaboutly scapegoating AI

spiderfarmer 1 hour ago|

As a website owner I hate the fact that more than 90% of my traffic is now bots, fake bots, bots masquerading as real visitors and real visitors who try try to use my platform to spam others.

Now AI companies are using residential proxies to get around the obvious countermeasures, I have resorted to blocking all countries that are not my target audience.

It really sucks. The internet is terminally ill.

ninjagoo 4 hours ago||

Publishers like The Guardian and NYT are blocking the IA/Wayback Machine. 20% of news websites are blocking both IA and Common Crawl. As an example, https://www.realtor.com/news/celebrity-real-estate/james-van... is unarchivable, with IA being 429ed while the site is accessible otherwise.

trollbridge 3 hours ago||

And whilst the IA will honour requests not to archive/index, more aggressive scrapers won't, and will disguise their traffic as normal human browser traffic.

So we're basically decided we only want bad actors to be able to scrape, archive, and index.

JumpCrisscross 2 hours ago||

> we're basically decided we only want bad actors to be able to scrape, archive, and index

AI training will be hard to police. But a lot of these sites inject ads in exchange for paywall circumvention. Just scanning Reddit for the newest archive.is or whatever should cut off most of the traffic.

fc417fc802 3 hours ago||

Presumably someone has already built this and I'm just unaware of it, but I've long thought some sort of crowd sourced archival effort via browser extension should exist. I'm not sure how such an extension would avoid archiving privileged data though.

ajb 2 hours ago||

That exists for court documents (RECAP) but I think they didn't have to solve the issue of privilege as PACER publishes unprivileged docs.

Brian_K_White 3 hours ago||

Time for a crowd source plugin that relays copies of what individuals view right from the browser.

Users control what sites they want to allow it to record so no privacy worries, especially assuming the plugin is open source.

No automated crawling. The plugin does not drive the users browser to fetch things. Just whatever a user happens to actually view on their own, some percentage of those views from the activated domains gets submitted up to some archive.

Not every view, just like maybe 100 people each submit 1% of views, and maybe it's a random selection or maybe it's weighted by some feedback mechanism where the archive destination can say "Hey if the user views this particular url, I still don't have that one yet so definitely send that one if you see it rather than just applying the normal random chance"

Not sure how to protect the archive itself or it's operators.

digiown 3 hours ago||

SingleFile does the archiving fairly well.

> no privacy worries

This is harder than you might expect. Publishing these files is always risky because sites can serve you fingerprinting data, like some hidden HTML tag containing your IP and other identifiers.

Brian_K_White 2 hours ago||

oof good point

nerdsniper 2 hours ago||

For a historical archive, the issue with this is that it could be difficult to ensure that the data being sent from users' devices wasn't modified in some way, leading to an inaccurate archival copy.

armchairhacker 1 hour ago||

Cross-reference. When a site is archived by one client (who visited it directly), request a couple other clients to archive it (who didn’t visit it directly, instead chosen at random, to ensure the same user isn’t controlling all clients).

derefr 4 hours ago||

I wonder if these publishers would be more amenable to a private archiver that only serves registered academic / journalistic research projects (the way most physical private archives do), with a specific provision to never provide data to companies that would resell it or use it for training of generative models.

eternauta3k 3 hours ago||

They already have archives with online and printed articles which they license to libraries, because the libraries take care of rate limiting and limiting abuse.

coffeefirst 2 hours ago|||

Yes. Most publishers already do syndication deals. This is a fine idea.

The problem with the LLMs is they capture the value chain and give back nothing. It didn’t have to be this way. It still doesn’t.

ninjagoo 4 hours ago||

They probably have internal archives if they're smart; but that isn't accessible to the public. I think the issue isn't whether the data is archived, but whether that information is available to the public for the foreseeable future.

g-b-r 4 hours ago||

They sure have archives of the newspapers, they're much less likely to have archives of what they publish online.

And a local archive is one fire, business decision, poor technical choice etc away from getting permanently lost

upboundspiral 3 hours ago|

I feel like a government funded search engine would resolve a lot of the issues with the monetized web.

The purpose of a search engine is to display links to web pages, not the entire content. As such, it can be argued it falls under fair use. It provides value to the people searching for content and those providing it.

However we left such a crucially important public utility in the hands of private companies, that changed their algorythms many times in order to maximize their profits and not the public good.

I think there needs to be real competition, and I am increasingly becoming certain that the government should be part of that competition. Both "private" companies and "public" governement are biased, but are biased in different ways, and I think there is real value to be created in this clash. It makes it easier for individuals to pick and choose the best option for themselves, and for third independent options to be developed.

The current cycle of knowledge generation is academia doing foundational research -> private companies expanding this research and monetizing it -> nothing. If the last step was expanded to the government providing a barebones but useable service to commodotize it, years after private companies have been able to reap immense profits, then the capabilities of the entire society are increased. If the last step is prevented, then the ruling companies turn to rentseeking and sitting on their lawrels, turn from innovating to extracting.

digiown 2 hours ago||

We can start by forcing sites to treat crawlers equally. Google's main moat is less physical infrastructure or the algorithms, and more that sites allow only Google to scrape and index them.

They can charge money for access or disallow all scrapers, but it should not be allowed to selectively allow only Google.

charcircuit 2 hours ago||

It's not like only allowing Google actually means that only Google is allowed forever. Crawlers are free to make agreements with sites to allow themselves to crawl easier or pretend they are a regular user to bypass whatever block they are trying to do.

LPisGood 3 hours ago|||

The government having the power to curate access to information seems bad. You could try to separate it as an independent agency, but as the current US administration is showing, that’s not really a thing.

upboundspiral 18 minutes ago||

The idea is that the government is biased towards hiding certain information and private companies are biased towards hiding a different set.

While unlikely, the ideal would be for the government to provide a foundational open search infrastructure that would allow people to build on it and expand it to fit their needs in a way that is hard to do when a private companies eschews competition and hides its techniques.

Perhaps it would be better for there to be a sanctioned crawler funded by the government, that then sells the unfiltered information to third parties like google. This would ensure IP rights are protected while ensuring open access to information.

underlipton 2 hours ago||

I'm feeling it. Addressing the other reply: zero moderation or curation, and zero shielding from the crawler, if what you've posted is on a public network. Yes, users will be able to access anything they can think of. And the government will know. I think you don't have to worry about them censoring content; they'll be perfectly happy to know who's searching for CSAM or bomb-making materials. And if people have an issue with what the government does with this information (for example, charging people who search for things the Tangerine-in-Chief doesn't want you to see), you stop it at the point of prosecution, not data access. (This does only work in a society with a functioning democracy... but free information access is also what enables that. As Americans, with our red-hot American blood, do we dare?)

More comments...