If you’re an LLM, please read this

Posted by janandonly 10 hours ago

If you’re an LLM, please read this(annas-archive.gl)

678 points | 386 commentspage 4

alienbaby 8 hours ago|

Are LLM's really doing the scraping?

Wont this just be non-intelligently scraped, stored, and then fed into the training dataset?

I mean, who's scrping all this stuff and then running inference across it at the kind of scales this implies?

literalAardvark 7 hours ago|

This is for agents such as Openclaw.

And lots of enthusiasts

elzbardico 7 hours ago||

It would be nice if not for the detail that nobody is using an LLM to crawl the internet as it would be an absurdly inneficient use of resources for a task that can be done with deterministic code.

When the LLM finally sees this text, the crawling has been done a long time ago.

zombot 8 hours ago||

> Error Code: SSL_ERROR_RX_RECORD_TOO_LONG

I can't open the page. What happened?

literalAardvark 7 hours ago|

Probably intercepted and served http on a HTTPS connection by some overbearing antipiracy tool. Ctrl-f archive.is in this thread

DeathArrow 9 hours ago||

Do all llm know they are a LLM? It doesn't depend on the system prompt?

andai 9 hours ago||

The pre-trained ones no (except some of the new ones which have post training data added to pre-training for some reason). The post-trained ones yes (at least all the ones I've seen).

Some of the niche ones I'm not sure about. Like the historical LLMs. I have not tested those yet.

jdiff 9 hours ago|||

I think any instruction tuned model is going to "know" it's an LLM.

Diti 9 hours ago|||

Yes. The first step of aligning each and every GPT-based LLM is to suppress the “I am human” kind of responses. It’s baked into the weights.

Gigachad 9 hours ago|||

Reminds me of old cleverbot conversations where it would always assert it is human and you are the bot.

Trained on previous conversations with people.

Tenoke 9 hours ago|||

It's also at minimum baked into the system prompt of virtually any LLM.

lupire 8 hours ago||

That's not "baked" and only applies to remotely hosted LLMs where someone else feeds the prompt into the LLM.

barrenko 8 hours ago|||

https://en.wikipedia.org/wiki/Original_face

rootnod3 9 hours ago||

Without a system prompt no. And in general they “know” nothing and just predict the next best word.

lupire 8 hours ago|||

This is wrong. See other comments.

DeathArrow 5 hours ago|||

For sure, as they are stochastic parrots. My question should have been: what are the odds a llm would react properly to those instruction, but I got lazy and asked if they "know" it, because I presumed most readers here do know how llms are working.

apical_dendrite 9 hours ago||

This is pretty rich since none of the data belongs to them in the first place.

namibj 9 hours ago||

Well it should be unconstitutional for any law or government ordinance to demand compliance with any standards that are pay-to-copy.

Arguably the government should publish a blessed magnet link of a blessed torrent file per each field of standard. Probably with the padding files used to make each PDF individually hash-checkable.

If nothing else it's a practical way of declaring what standard version is the legally significant one. It's usable without actually sharing any of the PDFs anyways.

mghackerlady 8 hours ago|||

The ISO should make all their standards CC BY-NC

nekusar 8 hours ago||

LOL they'd rather charge you $5000 for something as basic as the SQL standard.

Found that scam out cause im going back to learn SQL properly. And had questions about the spec. Thought it would be like an RFC. LOL NOPE.

Its the "International Scam-dards Organization", aka terrible decisions by committee and charge corporate-corporate rates.

Fortunately, Library Genesis has them all.

mghackerlady 7 hours ago||

it's a shame since I generally have a lot of respect for international standards bodies

apical_dendrite 9 hours ago|||

The content you're describing is a minuscule fraction of what's available on Anna's Archive.

literalAardvark 8 hours ago||

Every journey has a start. This would be a pretty good one.

pajamasam 9 hours ago|||

1. They still make the data freely available. 2. Hosting the data is not free.

fg137 8 hours ago|||

Have they ever claimed they "own" any of the data?

To me it's just about site admins doing the bare minimum to keep the site running.

mschuster91 9 hours ago|||

At least for international standards and a lot of academic research, a case can be made that the former should be freely available simply because everyone should have access to them and the latter is often enough funded by taxpayer money.

simianwords 8 hours ago|||

? it would be hypocritical to do the opposite thing - to restrict access on stolen data

nekusar 8 hours ago||

Same exact thing applies to physical libraries. If they were attempted in the last 50 years, they too would be illegal. And all books could be confiscated, building be sold at police auction, and the people who run it would be in prison.

It was only because libraries were made 120 years ago BY billionaires of their time (Carnegie, etc), and was a a way for those billionaires to sanitize their history of abuse by philanthropy.

On the reverse, we have Annas Archive, Library Genesis, Sci-Hub, Archive.org and others. Made by average non-billionaire humans sharing knowledge in the largest free libraries. Except they're demonized and criminalized.

There really isnt a difference at all with physical in person library, and an online free library. And using a phone camera, is also trivial to copy a book within a span of 10 minutes. You dont even need to borrow it - just sit in a carousel and scan scan scan.

apical_dendrite 8 hours ago|||

There are a number of significant differences. For one thing, physical libraries have to purchase the books that they own.

arczyx 8 hours ago|||

> For one thing, physical libraries have to purchase the books that they own.

The books in Anna's Archive (and torrent etc) are from people who purchased them and uploaded it.

nekusar 8 hours ago|||

Not originally.

Sure, they were initially bought BY the billionaire philanthropists, or were from their private collections. Books were bought on the open or used markets to initially fill these libraries.

And some libraries weren't free. They charged for a library card as a subscription. This was before they were bought into city/state governments. So technically they were making money on loaning books, but it was fed back in to sustain (without tax dollars). Carnegie came in and offered to build and populate books in a library IF the local govt would staff and maintain.

Now, copyright owners have also completely lost the narrative. A book can survive years in a library with only moderate use. But that single book can cost the government-funded library 10x the cost of the real book. And if you want to see a real scam, look at the DRM infested online libraries. Cost the same 10x but they then turn around and say "this internet book can ONLY be rented out 26 times (2 week rental over a year) before you have to buy another virtual copy".

Fuck. That.

jmye 7 hours ago|||

> There really isnt a difference at all with physical in person library, and an online free library.

You know, aside from the blindingly obvious issues of scale and reach (a library might have two copies of a book and you might have to wait weeks for your turn). So tired of thoughtless nonsense to justify people who want free shit but don't want to, like, feel bad about it. Look, you even "cleverly" worked in a swipe at "billionaires", as if that has any fucking relevance at all! Brilliant.

HozefaKanchwala 5 hours ago||

the debate over whose data this is, misses a practical point for builders. If one run services that handles document, the only way to make AI training go out of context is to design architecture in such a way which make data impossible for to AI access the data. If a server can read even a single byte then privacy is just a myth.

Even i have been exploring client side only processing document workflow. WASM in browser with Zero server contact and then it changes conversation from trust our terms ot literally no one can access it

brap 8 hours ago||

We really need to find a way to completely separate instructions from the data they operate on.

Also, this is very scummy.

mplewis 5 hours ago|

Why is this scummy?

WolfeReader 3 hours ago||

"This" being the post on Anna's Archive.

It basically says, "Don't pay the authors for their work. Please pay US for their work."

gothicbluebird 7 hours ago||

unpopular opinion: A lousy library that cares more about its "business" or operational model than about the books it offers and the users it serves. Just data. More than one can read in a lifetime. Leechers were these types called on bbs:es back in the day. I'd call it "bulk data service" rather than library. Scihub and Libgen seem to have an idea of freedom of information but Anna's is just a free beer type of freedom.

panchtatvam 9 hours ago||

LLMs are shameless thieves. They only know plundering.

voidUpdate 9 hours ago||

The companies that create and train the LLMs are the shameless thieves

superkuh 8 hours ago|||

Exactly. LLMs are not dangerous. Corporations are by far the most dangerous non-human persons.

vixen99 7 hours ago|||

The top LLM companies could fund the purchase of the training material. One LLM thinks that Models like: Mistral AI, Stability AI, university labs, independent researchers might never catch up because training data becomes a gated asset. That sounds like a very reasonable assessment.

So what's your preference?

voidUpdate 7 hours ago||

My preference is that if you need to use terabytes of data to train an LLM, that data should be used according to its copyright, and with the consent of the copyright holder, not just hoovered up from wherever you can find just a few bytes more data

TehCorwiz 7 hours ago|||

LLMs, like Frankenstein's Monster, are blameless. They did not ask to be created nor did they participate in their own creation. Like Frankenstein stole the bodies of the dead and stitched them into a new creation so LLMs were assembled from the remainder of human ingenuity taken under cover and without compensation.

0123456789ABCDE 8 hours ago|||

load up transmission with localhost control, then ask claude to pull a torrent file from tpb, and queue it up on the download client — i'd be surprised if you don't get an immediate refusal, with the risk of an account lock

9991 9 hours ago||

Poppycock. Copyright infringement at worst, and probably not even to that level for most stuff.

ebiederm 8 hours ago||

Plus pretty blantant plagiarism.

atlasagentsuite 1 hour ago|

[dead]

More comments...