News publishers limit Internet Archive access due to AI scraping concerns

Posted by ninjagoo 8 hours ago

News publishers limit Internet Archive access due to AI scraping concerns(www.niemanlab.org)

376 points | 226 commentspage 3

jackfranklyn 6 hours ago|

There's a mundane version of this that hits small businesses every day. Platform terms of service pages, API documentation, pricing policies, even the terms you agreed to when you signed up for a SaaS product - these all live at URLs that change or vanish.

I've been building tools that integrate with accounting platforms and the number of times a platform's API docs or published rate limits have simply disappeared between when I built something and when a user reports it broken is genuinely frustrating. You can't file a support ticket saying "your docs said X" when the docs no longer say anything because they've been restructured.

For compliance specifically - HMRC guidance in the UK changes constantly, and the old versions are often just gone. If you made a business decision based on published guidance that later changes, good luck proving what the guidance actually said at the time. The Wayback Machine has saved me more than once trying to verify what a platform's published API behaviour was supposed to be versus what it actually does.

The SOC 2 / audit trail point upthread is spot on. I'd add that for smaller businesses, it's not just formal compliance frameworks - it's basic record keeping. When your payment processor's fee schedule was a webpage instead of a PDF and that webpage no longer exists, you can't reconcile why your fees changed.

bmiekre 6 hours ago||

Explain it to me like I’m 5, why is ai scraping the way back machine bad?

notepad0x90 6 hours ago||

The internet isn't so simple anymore. I think it's important to separate commercial websites from non-commercial ones. Commercial sites shouldn't be expected to be achievable to begin with, unless it's part of their business model. A lot of sites (like reddit), started of as ad-supported sites, but now they're commercial (not just post-IPO, but accept payments and sell things to/from consumers). Even for ad-supported sites, there is a difference between ad-supported non-profit, and sites that exist to generate revenue from ads. As in, the primary purpose of the site is to generate ad-revenue, the content is just a means to that end.

I've said it before, and I'll say it again: The main issue is not design patterns, but lack of acceptable payment systems. The EU with their dismantling of visa and mastercard now have the perfect opportunity to solve this, but I doubt they will. They'll probably just create a european wechat.

mellosouls 6 hours ago||

editorialised. Original title (submitted previously a few times correctly by others):

News publishers limit Internet Archive access due to AI scraping concerns

tl2do 4 hours ago||

The issue of digital decay and publishers blocking archiving efforts is indeed concerning. It's especially striking given that news publishers, perhaps more than any other entity, have profoundly benefited from the vast accumulation of human language and cultural heritage throughout history. Their very existence and influence are built upon this foundation. To then, in an age where information preservation is more critical than ever (and their content is frequently used for AI training), actively resist archiving or demand compensation for their contributions to the collective digital record feels disingenuous, if not outright shameless. This stance ultimately harms the public good and undermines the long-term accessibility of our shared knowledge and historical narrative.

JumpCrisscross 7 hours ago||

Let’s be honest, one of the most-common uses of these archive sites has been paywall circumvention. An academics-only archive might make sense, or one that is mutually-owned and charges a fee for lookup. But a public archive for content that costs money to make obviously doesn’t work.

lurking_swe 7 hours ago|

if that’s the real motive, why don’t they allow access to scrape content after some period? when that news is not as relevant. For example after 6 months.

JumpCrisscross 6 hours ago|||

> why don’t they allow access to scrape content after some period? when that news is not as relevant. For example after 6 months

I belive many publications used to do this. The novel threat is AI training. It doesn't make sense to make your back catalog de facto public for free like that. There used to be an element of goodwill in permitting your content to be archived. But if the main uses are circumventing compensation and circumventing licensing requirements, that goodwill isn't worth much.

otterley 6 hours ago|||

Enabling research is a business model for many publications. Libraries pay money for access to the publishers’ historical archives. They don’t want to cannibalize any more revenue streams; they’re already barely still operating as it is.

lurking_swe 5 hours ago||

i see, i didn’t consider this angle. thanks for pointing that out.

zachlatta 7 hours ago||

The death of trust on the cloud.

g-b-r 7 hours ago||

This is awful, they need to at the very least allow private archivals.

Maybe the Internet Archive might be ok to keeping some things private until x time passes; or they could require an account to access them

gosub100 4 hours ago||

But wait, I thought AI was so great for all industries? Publishers can have AI-generated articles, and instantly fix grammar problems, And translate it seamelessly to every language, and even use AI-generated images where appropriate to enrich the article. It was going to make us all so productive? What happened? Why would you want to _block_ AI from ingesting the material?

zeagle 8 hours ago|

I mean why wouldn’t they? All their IP was scraped for at their own cost of hosting it for AI training. It further pulls away from their own business models as people ask the AI models the questions instead of reading primary sources. Plus it doesn’t seem likely they’ll ever be compensated for that loss given the economy is all in on AI. At least search engines would link back.

szmarczak 7 hours ago||

Those countermeasures don't really have an effect in terms of scraping. Anyone skilled can overcome any protection within a week or two. By officially blocking IA, IA can't archive those websites in a legal way, while all major AI companies use copyrighted content without permission.

zeagle 7 hours ago||

For sure. There are many billions and brilliant engineers propping up AI so they will win any cat and mouse game of blocking. It would be ideal if sites gave their data to IA and IA protected it exactly from what you say. But as someone that intentionally uses AI tools almost daily (mainly open evidence) IMO blame the abuser not the victim that it has come to this.

szmarczak 7 hours ago||

I'm not blaming the victim, but don't play the 'look what you made me do' game. Making content accessible to anyone (even behind a paywall) is a risk they need to take nevertheless. It's impossible to know upfront if the content is used for consumption or to create derived products (e.g. write an article in NYT style etc.). If this was a newspaper, this would be equivalent to scanning paper and then training AI. You can't prevent scanning, as the process is based on exactly the same phenomenon what makes your eyes see, iow information being sent and received. The game was lost before it even started.

ninjagoo 7 hours ago||

That is a good question. However, copyright exists (for a limited time) to allow for them to be compensated. AI doesn't change that. It feels like blocking AI-use is a ploy to extract additional revenue. If their content is regurgitated within copyright terms, yes, they should be compensated.

fc417fc802 7 hours ago||

The problem is that producing a mix of personalized content that doesn't appear (at least on its face) to violate copyright still completely destroys their business model. So either copyright law needs to be updated or their business model does.

Either way I'm fairly certain that blocking AI agent access isn't a viable long term solution.

ninjagoo 7 hours ago||

> Either way I'm fairly certain that blocking AI agent access isn't a viable long term solution.

Great point. If my personal AI assistant cannot find your product/website/content, it effectively may no longer exist! For me. Ain't nobody got the time to go searching that stuff up and sifting through the AI slop. The pendulum may even swing the other way and the publishers may need to start paying me (or whoever my gatekeeper is) for access to my space...

More comments...