News publishers limit Internet Archive access due to AI scraping concerns

Posted by ninjagoo 10 hours ago

News publishers limit Internet Archive access due to AI scraping concerns(www.niemanlab.org)

414 points | 267 commentspage 4

holoduke 8 hours ago|

The end of traditional news sites is coming. At least for the newspaper websites. Future mcp like systems will generate on the fly newstites in your desired style and content. Journalists will have some kind of paid per view model provided by these gpt like platforms which of course take a too big of a chunk. I can't imagine a WSJ is able to survive.

g-b-r 9 hours ago||

This is awful, they need to at the very least allow private archivals.

Maybe the Internet Archive might be ok to keeping some things private until x time passes; or they could require an account to access them

macinjosh 10 hours ago||

We need something like SETI@home/Folding@home but for crawling and archiving the web or maybe something as simple as a browser extension that can (with permission) archive pages you view.

dunder_cat 10 hours ago||

This exists although not in the traditional BOINC space, it's Archiveteam^1. I run two of their warrior^2 instances in my home k3s instance via the docker images. One of them is set to the "Team's choice" where it spends most of its time downloading Telegram chats. However, when they need the firepower for sites with imminent risk of closure, it will switch itself to those. The other one is set to their URL shortener project, "Terror of Tiny Town"^3.

Their big requirement is you need to not be doing any DNS filtering or blocking of access to what it wants, so I've got the pod DNS pointed to the unfiltered quad9 endpoint and rules in my router to allow the machine it's running on to bypass my PiHole enforcement+outside DNS blocks.

^1 https://wiki.archiveteam.org/

^2 https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

^3 https://wiki.archiveteam.org/index.php/URLTeam

ninjagoo 10 hours ago|||

In the US at least, there is no expectation of privacy in public. Why should these websites that are public-facing get an exemption from that? Serving up content to the public should imply archivability.

Sometimes it feels like ai-use concerns are a guise to diminish the public record. While on the other hand services like Ring or Flock are archiving the public forever.

sejje 10 hours ago||

Ring and Flock are not a standard we should be striving towards. Their massive databases tracking citizens need to go.

pclmulqdq 10 hours ago|||

Your TV probably does that, and you definitely gave it permission when you clicked "accept" on the terms.

macinjosh 3 hours ago||

good thing I don't have a TV!

cagrimmett 5 hours ago|||

I run an ArchiveBox instance locally. Recommended! https://archivebox.io/

ryoshu 10 hours ago||

This is a good idea. Not sure what ToS it would violate. But a good idea.

zeagle 10 hours ago||

I mean why wouldn’t they? All their IP was scraped for at their own cost of hosting it for AI training. It further pulls away from their own business models as people ask the AI models the questions instead of reading primary sources. Plus it doesn’t seem likely they’ll ever be compensated for that loss given the economy is all in on AI. At least search engines would link back.

szmarczak 9 hours ago||

Those countermeasures don't really have an effect in terms of scraping. Anyone skilled can overcome any protection within a week or two. By officially blocking IA, IA can't archive those websites in a legal way, while all major AI companies use copyrighted content without permission.

zeagle 9 hours ago||

For sure. There are many billions and brilliant engineers propping up AI so they will win any cat and mouse game of blocking. It would be ideal if sites gave their data to IA and IA protected it exactly from what you say. But as someone that intentionally uses AI tools almost daily (mainly open evidence) IMO blame the abuser not the victim that it has come to this.

szmarczak 9 hours ago||

I'm not blaming the victim, but don't play the 'look what you made me do' game. Making content accessible to anyone (even behind a paywall) is a risk they need to take nevertheless. It's impossible to know upfront if the content is used for consumption or to create derived products (e.g. write an article in NYT style etc.). If this was a newspaper, this would be equivalent to scanning paper and then training AI. You can't prevent scanning, as the process is based on exactly the same phenomenon what makes your eyes see, iow information being sent and received. The game was lost before it even started.

ninjagoo 9 hours ago||

That is a good question. However, copyright exists (for a limited time) to allow for them to be compensated. AI doesn't change that. It feels like blocking AI-use is a ploy to extract additional revenue. If their content is regurgitated within copyright terms, yes, they should be compensated.

fc417fc802 9 hours ago||

The problem is that producing a mix of personalized content that doesn't appear (at least on its face) to violate copyright still completely destroys their business model. So either copyright law needs to be updated or their business model does.

Either way I'm fairly certain that blocking AI agent access isn't a viable long term solution.

ninjagoo 9 hours ago||

> Either way I'm fairly certain that blocking AI agent access isn't a viable long term solution.

Great point. If my personal AI assistant cannot find your product/website/content, it effectively may no longer exist! For me. Ain't nobody got the time to go searching that stuff up and sifting through the AI slop. The pendulum may even swing the other way and the publishers may need to start paying me (or whoever my gatekeeper is) for access to my space...

gosub100 7 hours ago||

But wait, I thought AI was so great for all industries? Publishers can have AI-generated articles, and instantly fix grammar problems, And translate it seamelessly to every language, and even use AI-generated images where appropriate to enrich the article. It was going to make us all so productive? What happened? Why would you want to _block_ AI from ingesting the material?

colesantiago 8 hours ago||

I fear that these news publishers would come after RSS next as I see hundreds of AI companies misusing the terms of the news publishers's RSS feed for profit on mass scraping.

They do not care and we will be all worse off for it if these AI companies keep continuing to bombard news publishers RSS feeds.

It is a shame that the open web as we know it is closing down because of these AI companies.

kevincloudsec 9 hours ago||

There's a compliance angle to this that nobody's talking about. Regulatory frameworks like SOC 2 and HIPAA require audit trails and evidence retention. A lot of that evidence lives at URLs. When a vendor's security documentation, a published incident response, or a compliance attestation disappears from the web and can't be archived, you've got a gap in your audit trail that no auditor is going to be happy about.

I've seen companies fail compliance reviews because a third-party vendor's published security policy that they referenced in their own controls no longer exists at the URL they cited. The web being unarchivable isn't just a cultural loss. It's becoming a real operational problem for anyone who has to prove to an auditor that something was true at a specific point in time.

iririririr 7 hours ago||

This is new to me, so I did a quick search for a few examples of such documents.

The very first result was a 404

https://aws.amazon.com/compliance/reports/

The jokes write themselves.

staticassertion 7 hours ago||

But how is this related to the internet being archivable? This sort of proves the point that URLs were always a terrible idea to reference in your compliance docs, the answer was always to get the actual docs.

paulryanrogers 6 hours ago|||

IME compliance tools will take a doc and or a link. What's acceptable is up to the auditor. IMO both a link and doc are best.

Links alone can be tempting as you've to reference the same docs or policies over and over for various controls.

aussieguy1234 6 hours ago|||

Wayback machine URLs are much more likely to be stable.

Even if the content is taken down, changed or moved, a copy is likely to still be available in the Wayback Machine.

staticassertion 6 hours ago||

I would never rely on this vs just downloading the SOC2 reports, which almost always aren't public anyways and need to be requested explicitly. I suspect that that compliance page would have just linked to a bunch of PDF downloads or possibly even a "request a zip file from us after you sign an NDA" anyways.

alexpotato 9 hours ago|||

> Regulatory frameworks like SOC 2 and HIPAA require audit trails and evidence retention

Sidebar:

Having been part of multiple SOC audits at large financial firms, I can say that nothing brings adults closer to physical altercations in a corporate setting than trying to define which jobs are "critical".

- The job that calculates the profit and loss for the firm, definitely critical

- The job that cleans up the logs for the job above, is that critical?

- The job that monitors the cleaning up of the logs, is that critical too?

These are simple examples but it gets complex very quickly and engineering, compliance and legal don't always agree.

Ucalegon 8 hours ago|||

Thats when you reach out to your insurer and ask them their requirements as per the policy and/or if there are any contractual obligations associated with the requirements which might touch indemnity/SLAs. If it does, then it is critical, if not, then its the classic conversation of cost vs risk mitigate/tolerance.

a13n 8 hours ago||||

depends, if you don’t clean up the logs and monitor that cleanup will it eventually hit the p&l? eg if you fail compliance audits and lose customers over it? then yes. it still eventually comes back to the p&l.

hsbauauvhabzb 8 hours ago|||

And in the big scheme of things, none of those things are even important, your family, your health and your happiness are :-)

ninjagoo 9 hours ago|||

At some point Insurance is going to require companies to obtain paper copies of any documentation/policies, precisely to avoid this kind of situation. It may take a while to get there though. It'll probably take a couple of big insurance losses before that happens.

kevincloudsec 9 hours ago|||

Insurance is already moving that direction for cyber policies. Some underwriters now require screenshots or PDF exports of third-party vendor security attestations as part of the application process, not just URLs. The carriers learned the hard way that 'we linked to their SOC 2 landing page' doesn't hold up when that page disappears after an acquisition or rebrand.

pwg 7 hours ago||

> when that page disappears after an acquisition or rebrand.

Sadly, it does not even have to be an acquisition or rebrand. For most companies, a simple "website redo", even if the brand remains unchanged, will change up all the URL's such that any prior recorded ones return "not found". Granted, if the identical attestation is simply at a new url, someone could potentially find that new url and update the "policy" -- but that's also an extra effort that the insurance company can avoid by requiring screen shots or PDF exports.

hsbauauvhabzb 5 hours ago||

It sounds like you work at Microsoft, they do that ALL the time.

dahcryn 7 hours ago||||

We already require all relevant and referenced documents to be uploaded in a contract lifecycle management system.

Yes we have hundreds of identical Microsoft and Aws policies, but it's the only way. Checksum the full zip and sign it as part of the contract, that's literally how we do it

seanmcdirmid 9 hours ago||||

Digital copies will also work I don’t understand why they just don’t save both the URL and the content at the URL when last checked.

ninjagoo 9 hours ago|||

I think maybe because the contents of the URL archived locally aren't legally certifiable as genuine - the URL is the canonical source.

That's actually a potentially good business idea - a legally certifiable archiving software that captures the content at a URL and signs it digitally at the moment of capture. Such a service may become a business requirement as Internet archivability continues to decline.

leni536 7 hours ago|||

Apparently perma.cc is officially used by some courts in the US. I did use it in addition to the wayback machine when I collected paper trail for a minor retail dispute, but I did not have to use it.

I don't know how exactly it achieves being "legally certifiable", at least to the point that courts are trusting it. Signing and timestamping with independent transparency logs would be reasonable.

https://perma.cc/sign-up/courts

ninjagoo 7 hours ago||

This is an interesting service, but at $10 for 10 links per month, or $100 for 500 links per month, it might be a tad bit too expensive for individuals.

staticassertion 7 hours ago||||

The first thing you do when you're getting this information is get PDFs from these vendors like their SOC2 attestation etc. You wouldn't just screenshot the page, that would be nuts.

Any vendor who you work with should make it trivial to access these docs, even little baby startups usually make it quite accessible - although often under NDA or contract, but once that's over with you just download a zip and everything is there.

thayne 6 hours ago||

> You wouldn't just screenshot the page, that would be nuts.

That's what I thought the first time I was involved in a SOC2 audit. But a lot of the "evidence" I sent was just screenshots. Granted, the stuff I did wasn't legal documents, it was things like the output of commands, pages from cloud consoles, etc.

staticassertion 6 hours ago||

To be clear, lots of evidence will be screenshots. I sent screenshots to auditors constantly. For example, "I ran this splunk search, here's a screenshot". No biggie.

What I would not do is take a screenshot of a vendor website and say "look, they have a SOC2". At every company, even tiny little startup land, vendors go through a vendor assessment that involves collecting the documents from them. Most vendors don't even publicly share docs like that on a site so there'd be nothing to screenshot / link to.

inetknght 7 hours ago|||

Is it digitally certifiable if it's not accessible by everyone?

That is: if it's not accessible by a human who was blocked?

macintux 7 hours ago||

Or if it potentially gives different (but still positive) results to different parties?

trollbridge 9 hours ago|||

What if the TOS expressly prohibits archiving it, and it's also copyrighted?

pixl97 9 hours ago|||

Then said writers of TOS should be dragged in front of a judge to be berated, then tarred and feathered, and ran out of the courtroom on a rail.

Having your cake and eating it too should never be valid law.

croes 8 hours ago||

Maybe we should start with those who made such copyright claims a possibility in the first place

wizzwizz4 8 hours ago||

They're long, long dead.

seanmcdirmid 6 hours ago|||

I don’t think contracts and agreements that both parties can’t keep copies of are valid in any US jurisdiction.

layer8 9 hours ago||||

More likely, there will be trustee services taking care of document preservation, themselves insured in case of data loss.

ninjagoo 9 hours ago||

Isn't the Internet Archive such a trustee service?

Or are you thinking of companies like Iron Mountain that provide such a service for paper? But even within corporations, not everything goes to a service like Iron Mountain, only paper that is legally required to be preserved.

A society that doesn't preserve its history is a society that loses its culture over time.

layer8 9 hours ago||

The context was regulatory requirements for companies. I mean that as a business you pay someone to take care of your legal document preservation duties, and in case data gets lost, they will be liable for the financial damage this incurs to you. Outsourcing of risk against money.

ninjagoo 9 hours ago||

Whether or not the Internet Archive counts as a legally acceptable trustee service is being litigated in the court systems [1]. The link is a bit dated so unsure what the current situation is. There's also this discussion [2].

[1] https://www.mololamken.com/assets/htmldocuments/NLJ_5th%20Ci...

[2] https://www.nortonrosefulbright.com/en-au/knowledge/publicat...

mycall 9 hours ago|||

Also, getting insurance to pay for cybercrimes is hard and sometimes doesn't justify their costs.

sebmellen 7 hours ago|||

I hate to say this, but this account seems like it’s run by an AI tool of some kind (maybe OpenClaw)? Every comment has the same repeatable pattern, relatively recent account history, most comments are hard or soft sell ads for https://www.awsight.com/. Kind of ironic given what’s being commented on here.

I hope I’m wrong, but my bot paranoia is at all time highs and I see these patterns all throughout HN these days.

linehedonist 6 hours ago||

Agreed. "isn't just... It's becoming" feels to me very LLM-y to me.

sebmellen 6 hours ago||

Now the top comment on the GP comment is from a green account, and suspiciously the most upvoted. Also directly in-line with the AWS-related tool promotion… https://news.ycombinator.com/item?id=47018665

@dang do you have any thoughts about how you’re performing AI moderation on HN? I’m very worried about the platform being flooded with these Submarine comments (as PG might call them).

rob 5 hours ago||

[delayed]

riddlemethat 9 hours ago|||

https://www.page-vault.com/ These guys exist to solve that problem.

mycall 9 hours ago|||

Perhaps those companies should have performed verified backups of third-party vendor's published security policies into a secure enclave with paired keys with the auditor, to keep a trail of custody.

staticassertion 9 hours ago|||

> I've seen companies fail compliance reviews because a third-party vendor's published security policy that they referenced in their own controls no longer exists at the URL they cited.

Seriously? What kind of auditor would "fail" you over this? That doesn't sound right. That would typically be a finding and you would scramble to go appease your auditor through one process or another, or reach out to the vendor, etc, but "fail"? Definitely doesn't sound like a SOC2 audit, at least.

Also, this has never particularly hard to solve for me (obviously biased experience, so I wonder if this is just a bubble thing). Just ask companies for actual docs, don't reference urls. That's what I've typically seen, you get a copy of their SOC2, pentest report, and controls, and you archive them yourself. Why would you point at a URL? I've actually never seen that tbh and if a company does that it's not surprising that they're "failing" their compliance reviews. I mean, even if the web were more archivable, how would reliance on a URL be valid? You'd obviously still need to archive that content anyway?

Maybe if you use a tool that you don't have a contract with or something? I feel like I'm missing something, or this is something that happens in fields like medical that I have no insight into.

This doesn't seem like it would impact compliance at all tbh. Or if it does, it's impacting people who could have easily been impacted by a million other issues.

cj 7 hours ago|||

Your comment matches my experience closer than the OP.

A link disappearing isn’t a major issue. Not something I’d worry about (but yea might show up as a finding on the SOC 2 report, although I wouldn’t be surprised if many auditors wouldn’t notice - it’s not like they’re checking every link)

I’m also confused why the OP is saying they’re linking to public documents on the public internet. Across the board, security orgs don’t like to randomly publish their internal docs publicly. Those typically stay in your intranet (or Google Drive, etc).

staticassertion 7 hours ago||

> although I wouldn’t be surprised if many auditors wouldn’t notice

lol seriously, this is like... at least 50% of the time how it would play out, and I think the other 49% it would be "ah sorry, I'll grab that and email it over" and maybe 1% of the time it's a finding.

It just doesn't match anything. And if it were FEDRAMP, well holy shit, a URL was never acceptable anyways.

yorwba 7 hours ago|||

> I feel like I'm missing something

You're missing the existence of technology that allows anyone to create superficially plausible but ultimately made-up anecdotes for posting to public forums, all just to create cover for a few posts here and there mixing in advertising for a vaguely-related product or service. (Or even just to build karma for a voting ring.)

Currently, you can still sometimes sniff out such content based on the writing style, but in the future you'd have to be an expert on the exact thing they claim expertise in, and even then you could be left wondering whether they're just an expert in a slightly different area instead of making it all up.

EDIT: Also on the front page currently: "You can't trust the internet anymore" https://news.ycombinator.com/item?id=47017727

staticassertion 7 hours ago||

I don't really see what you're getting at, it seems unrelated to the issue of referencing URLs in compliance documentation.

trevwilson 7 hours ago|||

They're suggesting that the original comment is LLM generated, and after looking at the account's comment history I strongly suspect they're correct

staticassertion 5 hours ago||

Oh, I sort of wondered if that was the case but I was really unsure based on the wording. Yeah, I have no idea.

stavros 7 hours ago|||

I think they meant that, now that LLMs are invented, people have suddenly started to lie on the Internet.

Every comment section here can be summed up as "LLM bad" these days.

yorwba 6 hours ago||

No, now that LLMs are invented, a lot more people lying on the Internet have started to do so convincingly, so they also do it more often. Previously, when somebody was using all the right lingo to signal expert status, they might've been a lying expert or an honest expert, but they probably weren't some lying rando, because then they wouldn't even have thought of using those words in that context. But now LLMs can paper over that deficit, so all the lying randos who previously couldn't pretend to be an expert are now doing so somewhat successfully, and there are a lot of lying randos.

It's not "LLM bad" — it's "LLM good, some people bad, bad people use LLM to get better at bad things."

tempaccount5050 7 hours ago|||

Your experience isn't normal and I seriously question it unless there was some sort of criminal activity being investigated or there was known negligence. I worked for a decent sized MSP and have been through crytptolock scenarios.

Insurance pays as long as you aren't knowingly grossly negligent. You can even say "yes, these systems don't meet x standard and we are working on it" and be ok because you acknowledged that you were working on it.

Your boss and your bosses boss tell you "we have to do this so we don't get fucked by insurance if so and so happens" but they are either ignorant, lying, or just using that to get you to do something.

I've seen wildly out of date and unpatched systems get paid out because it was a "necessary tradeoff" between security and a hardship to the business to secure it.

I've actually never seen a claim denied and I've seen some pretty fuckin messy, outdated, unpatched legacy shit.

Bringing a system to compliance can reasonably take years. Insurance would be worthless without the "best effort" clause.

lukeschlather 7 hours ago|||

It's interesting to think about this in terms of something like Ars Technica's recent publishing of an article with fake (presumably LLM slop) quotes that they then took down. The big news sites are increasingly so opaque, how would you even know if they were rewriting or taking articles down after the fact?

int0x29 7 hours ago||

This is typically solved by publishing reactions/corrections or in the case of news programs starting the next one with a retraction/correction. This happens in some academic journals and some news outlets. I've seen the PBS Newshour and the New York Times do this. I've also seen Ars Technica do this with some science articles (Not sure what the difference in this case is or if it will take some more time)

oxguy3 7 hours ago||

On their forum, an Ars Technica staff member said[1] that they took the article down until they could investigate what happened, which probably wouldn't be until after the weekend.

[1]: https://arstechnica.com/civis/threads/journalistic-standards...

lofaszvanitt 7 hours ago|||

And for this we need cheapo and fast WORM, 100 TB/whatever archiving solutions.

kryogen1c 7 hours ago||

If your soc2 or hipaa references the internet archive, you probably deserve to fail.

blell 9 hours ago||

That’s good. I don’t like archival sites. Let things disappear.

braebo 8 hours ago|

Yea.. I’ve noticed data hoarding largely resembles yet-another form of death denialism.

OGEnthusiast 10 hours ago||

If most of the Internet is AI-generated slop (as is already the case), is there really any value in expensing so much bandwidth and storage to preserve it? And on the flip side, I'd imagine the value of a pre-2022 (ChatGPT launch) Internet snapshot on physical media will probably increase astronomically.

nicole_express 10 hours ago||

The sites that are most valuable to preserve are likely the same ones that are most likely to put up barriers to archiving

ninjagoo 10 hours ago||

Perhaps the AI slop isn't worth preserving, but the unarchivability of news and other useful content has implications for future public discourse, historians, legal matters and who knows what else.

In the past libraries used to preserve copies of various newspapers, including on microfiche, so it was not quite feasible to make history vanish. With print no longer out there, the modern historical record becomes spotty if websites cannot be archived.

Perhaps there needs to be a fair-use exception or even a (god forbid!) legal requirement to allow archivability? If a website is open to the public, shouldn't it be archivable?

phatfish 7 hours ago||

Erm, there is still a newspaper stand in the supermarket I go to (Wallmart for the Americans). Not sure if the British library keeps a copy of the print news I see, but they should!

sejje 10 hours ago|

This is a good thing, IMO.

I am sad about link rot and old content disappearing, but it's better than everything be saved for all time, to be used against folks in the future.

GaryBluto 9 hours ago||

> I am sad about link rot and old content disappearing, but it's better than everything be saved for all time, to be used against folks in the future.

I don't understand this line of thinking. I see it a lot on HN these days, and every time I do I think to myself "Can't you realize that if things kept on being erased we'd learn nothing from anything, ever?"

I've started archiving every site I have bookmarked in case of such an eventuality when they go down. The majority of websites don't have anything to be used against the "folks" who made them. (I don't think there's anything particularly scandalous about caring for doves or building model planes)

otterley 10 hours ago|||

Consider the impact, though, on our ability to learn and benefit from history. If the records of people’s activities cannot be preserved, are we doomed to live in ignorance?

sejje 9 hours ago|||

I don't think so. Most of my original creations were before the archiving started, and those things are lost. But they weren't the kind of history you learn and benefit from--nor is most of the internet.

The truly important stuff exists in many forms, not just online/digital. Or will be archived with increased effort, because it's worth it.

otterley 9 hours ago|||

Like it or not, the Internet is today’s store of record for a significant proportion—if not the majority—of the world’s activities.

If you don’t want your bad behavior preserved for the historical record, perhaps a better answer is to not engage in bad behavior instead of relying on some sort of historical eraser.

sejje 8 hours ago||

Behavior that isn't bad, becomes bad retrospectively after a regime change

otterley 7 hours ago||

That's a risk we all take. Not that long ago, homophobia was the norm. Being on the wrong side of history can be uncomfortable, but people do forgive when given the right context.

nine_k 9 hours ago|||

Think about the stuff archeologists get to work with.

ninjagoo 9 hours ago|||

What's that famous quote - those who do not learn from history ...

BUT, it's hard to learn from history if there's no history to learn...

TheRealPomax 9 hours ago|||

Kind of the "think of the children" argument: most things that are worth archiving have nothing to do with content that can be used against someone in the future. But the raw volume is making it impossible to filter out the worthwhile stuff from the slop (all forms of, not just AI), even with automation (again, not AI, we've been doing NLP using regular old ML for decades now).

UltraSane 9 hours ago||

Man I cannot disagree more. This is a terrible thing.