News publishers limit Internet Archive access due to AI scraping concerns

Posted by ninjagoo 6 hours ago

News publishers limit Internet Archive access due to AI scraping concerns(www.niemanlab.org)

300 points | 181 commentspage 2

nananana9 5 hours ago|

The silver lining is that it's increasingly not worth being archived as well.

idiotsecant 5 hours ago||

We really lucked out existing at a time when the internet was a place for weirdos and enthusiasts. I think those days are well and done.

JuniperMesos 11 minutes ago||

The internet can't simultaneously be a place for weirdos and enthusiasts, and a vital part of the economy that everyone uses for a huge number of disparate things in daily life. Parts of the internet can be places for weirdos and enthusiasts, but spaces that cater to weirdos and enthusiasts are by necessity not popular or viral spaces.

Flavius 5 hours ago||

Agreed. It’s mostly just disposable clickbait masquerading as journalism at this point. Outside of feeding people's FOMO, there's little content worth preserving for history.

sunaookami 2 hours ago||

Yeah sure, "AI scraping concerns". No, they don't want to get caught secretly editing and deleting articles.

IshKebab 2 hours ago|

It's obviously not that, or they would have done this years ago. It very clearly is AI scraping concerns. Their content has new value because it's high quality text that AI scrapers want, and they don't want to give it away for free via the internet archive.

They will announce official paid AI access plans soon. Bookmark my works.

RajT88 5 hours ago||

Proposed solution:

Sell a "truck full of DAT tapes" type service to AI scrapers with snapshots of the IA. Sort of like the cloud providers have with "Data Boxes".

It will fund IA, be cheaper than building and maintaining so many scrapers, and may relieve the pressure on these news sites.

atrus 5 hours ago||

Even sites with that option already (like wikipedia) still report being hammered by scrapers. It's the full-funded aligned with the incompetent at work here.

digiown 5 hours ago||

IA has always been in legal jeopardy without offering paid access. For that to work we need to get rid of copyright first.

RajT88 1 hour ago||

Or offer it in countries with lax copyright. The industry will find ways to work around it.

But - as another poster pointed out - Wikipedia offers this, and still gets hammered by scrapers. Why buy when free, I guess?

yellowapple 4 hours ago||

Framing this as some anti-AI thing is wild. The simpler, more obvious, and more evidenced reason for this is that these sites want to make money with ads and paywalls that an archived copy tends to omit by design. Scapegoating AI lets them pretend that they're not the greedy bad guys here — just like how the agricultural sector is hell-bent on scapegoating AI (and lawns, and golf courses, and long showers, and free water at restaurants) for excess water consumption when even the worst-offending datacenters consume infinitesimally-tiny fractions of the water farms in their areas consume.

JuniperMesos 5 minutes ago|

Yeah I assume what the news publishers actually care about is the thing where, when someone posts a paywalled news article on hacker news, one of the first comments is invariably a link to an archive site that bypasses the paywall so people can read it without paying for it.

> just like how the agricultural sector is hell-bent on scapegoating AI (and lawns, and golf courses, and long showers, and free water at restaurants) for excess water consumption when even the worst-offending datacenters consume infinitesimally-tiny fractions of the water farms in their areas consume.

When I learned about how much water agriculture and industry uses in the state of California where I live, I basically entirely stopped caring about household water conservation in my daily life (I might not go this far if I had a yard or garden that I watered, but I don't where I currently live). If water is so scarce in an urban area that an individual human taking a long shower or running the dishwasher a lot is at all meaningful, then either the municipal water supply has been badly mismanaged, or that area is too dry to support human settlement; and in either case it would be wise to live somewhere else.

shevy-java 5 hours ago||

> The Financial Times, for example, blocks any bot that tries to scrape its paywalled content, including bots from OpenAI, Anthropic, Perplexity, and the Internet Archive

But then it was not really open content anyway.

> When asked about The Guardian’s decision, Internet Archive founder Brewster Kahle said that “if publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.”

Well - we need something like wikipedia for news content. Perhaps not 100% wikipedia; instead, wikipedia to store the hard facts, with tons of verification; and a news editorial that focuses on free content but in a newspaper-style, e. g. with professional (or good) writers. I don't know how the model could work, but IF we could come up with this then newspapers who have gatewalls to information would become less relevant automatically. That way we win long-term, as the paid gatewalls aren't really part of the open web anyway.

ninjagoo 5 hours ago||

Wikipedia relies on the institutional structure of journalism, with newsroom independence, journalistic standards, educational system and probably a ton of other dependencies.

Journalism as an institution is under attack because the traditional source of funding - reader subscriptions to papers - no longer works.

To replicate the Wikipedia model would need to replicate the structure of Journalism for it to be reliable. Where would the funding for that come from? It's a tough situation.

zozbot234 1 hour ago|||

> Well - we need something like wikipedia for news content.

The Wikipedia folks had their own Wikinews project which is essentially on hold today because maintenance in a wiki format is just too hard for that kind of uber-ephemeral content. Instead, major news with true long-term relevance just get Wikipedia articles, and the ephemera are ignored.

riquito 5 hours ago|||

> we need something like wikipedia for news content

Interesting idea. It could be something that archives first and releases at a later date, when the news aren't as much new

JumpCrisscross 5 hours ago|||

> it was not really open content anyway

Practically no quality journalism is.

> we need something like wikipedia for news

Wikipedia editors aren’t flying into war zones.

fc417fc802 5 hours ago|||

Statistically, at least a few of them live in war zones. And I'm sure some of them would fly in to collect data if you paid them for it.

JumpCrisscross 4 hours ago||

> at least a few of them live in war zones

Which is a valuable perspective. But it's not a subsitute for a seasoned war journalist who can draw on global experience. (And relating that perspective to a particular home market.)

> I'm sure some of them would fly in to collect data if you paid them for it

Sure. That isn't "a news editorial that focuses on free content but in a newspaper-style, e. g. with professional (or good) writers."

One part of the population imagines journalists as writers. They're fine on free, ad-supported content. The other part understands that investigation is not only resource intensive, but also requires rare talent and courage. That part generally pays for its news.

Between the two, a Wikipedia-style journalistic resource is not entertaining enough for the former and not informative enough for the latter. (Importantly, compiling an encyclopedia is principally the work of research and writing. You can be a fine Wikipedia–or scientific journal or newspaper–editor without leaving your room.)

zmgsabst 45 minutes ago||

Those roles seem to be diverging:

- crowdsourced data, eg, photos of airplane crashes

- people who live in an area start vlogs

- independent correspondents travel there to interview, eg Ukraine or Israel

We see that our best war reporting comes from analyst groups who ingest that data from the “firehose” of social media. Sometimes at a few levels, eg, in Ukraine the best coverage is people who compare the work of multiple groups mapping social media reports of combat. You have on top of that punditry about what various movements mean for the war.

So we don’t have “journalist”:

- we have raw data (eg, photos)

- we have first hand accounts, self-reported

- we have interviewers (of a few kinds)

- we have analysts who compile the above into meaningful intelligence

- we have anchors and pundits who report on the above to tell us narratives

The fundamental change is that what used to be several roles within a new agency are now independent contractors online. But that was always the case in secret — eg, many interviewers were contracted talent. We’re just seeing the pieces explicitly and without centralized editorial control.

So I tend not to catastrophize as much, because this to me is what the internet always does:

- route information flows around censorship

- disintermediate consumers from producers when the middle layer provides a net negative

As always in business, evolve or die. And traditional media has the same problem you outline:

- not entertaining enough for the celebrity gossip crowd

- too slow and compromised by institutional biases for the analyst crowd, eg, compare WillyOAM coverage of Ukraine to NYT coverage

https://www.youtube.com/@willyOAM

ghaff 5 hours ago|||

Well, and it would be considered "original research" anyway which some admin would revert.

aspenmayer 2 hours ago||

Original reporting is allowed and encouraged by the Wikimedia Foundation sister org Wikinews, which may be cited by Wikipedia.

https://en.wikinews.org/wiki/Wikinews:Original_reporting

zozbot234 1 hour ago||

Wikinews is on hold nowadays. Original research that is of real long-term relevance can go onto Wikijournal, which does peer review.

fc417fc802 5 hours ago||

> a news editorial that focuses on free content but in a newspaper-style

Isn't that what state funded news outlets are?

cdrnsf 5 hours ago||

This is a natural response to AI companies plundering the web to enrich themselves and provide no benefit to the sites being scraped.

CivBase 2 hours ago|

Seems more like an easy excuse to shut down a means for people to bypass their paywalls. It would be trivial for AI companies to continue getting this data without using the Internet Archive.

tl2do 2 hours ago||

The issue of digital decay and publishers blocking archiving efforts is indeed concerning. It's especially striking given that news publishers, perhaps more than any other entity, have profoundly benefited from the vast accumulation of human language and cultural heritage throughout history. Their very existence and influence are built upon this foundation. To then, in an age where information preservation is more critical than ever (and their content is frequently used for AI training), actively resist archiving or demand compensation for their contributions to the collective digital record feels disingenuous, if not outright shameless. This stance ultimately harms the public good and undermines the long-term accessibility of our shared knowledge and historical narrative.

Havoc 6 hours ago||

Yup. Recently built something that needs to do low volume scraping. About 40% success rate - rest hits bot detection even on first try

ninjagoo 6 hours ago|

Did you have rate limits built in? Ultimately scraping tools will need to mimic humans. Ironic.

I wonder if bots/ai will need to build their own specialized internet for faster sharing of data, with human centered interfaces to human spaces.

fc417fc802 5 hours ago||

IPFS and IPNS already exist.

WesBrownSQL 5 hours ago||

As someone who has been dealing with SOC 2, HIPAA, ISO 9001, etc., for years, I have always maintained copies of the third-party agreements for all of our downstream providers for compliance purposes. This documentation is collected at the time of certification, and our policies always include a provision for its retrieval on schedule. The problem is when you certify their policy said X and were in compliance, they quietly change that and don't send proper notification downstream to us, and captain lawsuit comes by, we have to be able to prove that they did claim they were in compliance and the time we certified. We don't want to rely on their ability to produce that documentation. We can't prove that it wasn't tampered with, or that there is a chain of custody for their documentation and policies. If I wanted to use a vendor that wouldn't provide that information, then I didn't use them. Welcome to the world of highly regulated industries.

bmiekre 4 hours ago|

Explain it to me like I’m 5, why is ai scraping the way back machine bad?

More comments...