Posted by janandonly 10 hours ago
Wont this just be non-intelligently scraped, stored, and then fed into the training dataset?
I mean, who's scrping all this stuff and then running inference across it at the kind of scales this implies?
And lots of enthusiasts
When the LLM finally sees this text, the crawling has been done a long time ago.
I can't open the page. What happened?
Some of the niche ones I'm not sure about. Like the historical LLMs. I have not tested those yet.
Trained on previous conversations with people.
Arguably the government should publish a blessed magnet link of a blessed torrent file per each field of standard. Probably with the padding files used to make each PDF individually hash-checkable.
If nothing else it's a practical way of declaring what standard version is the legally significant one. It's usable without actually sharing any of the PDFs anyways.
Found that scam out cause im going back to learn SQL properly. And had questions about the spec. Thought it would be like an RFC. LOL NOPE.
Its the "International Scam-dards Organization", aka terrible decisions by committee and charge corporate-corporate rates.
Fortunately, Library Genesis has them all.
To me it's just about site admins doing the bare minimum to keep the site running.
It was only because libraries were made 120 years ago BY billionaires of their time (Carnegie, etc), and was a a way for those billionaires to sanitize their history of abuse by philanthropy.
On the reverse, we have Annas Archive, Library Genesis, Sci-Hub, Archive.org and others. Made by average non-billionaire humans sharing knowledge in the largest free libraries. Except they're demonized and criminalized.
There really isnt a difference at all with physical in person library, and an online free library. And using a phone camera, is also trivial to copy a book within a span of 10 minutes. You dont even need to borrow it - just sit in a carousel and scan scan scan.
The books in Anna's Archive (and torrent etc) are from people who purchased them and uploaded it.
Sure, they were initially bought BY the billionaire philanthropists, or were from their private collections. Books were bought on the open or used markets to initially fill these libraries.
And some libraries weren't free. They charged for a library card as a subscription. This was before they were bought into city/state governments. So technically they were making money on loaning books, but it was fed back in to sustain (without tax dollars). Carnegie came in and offered to build and populate books in a library IF the local govt would staff and maintain.
Now, copyright owners have also completely lost the narrative. A book can survive years in a library with only moderate use. But that single book can cost the government-funded library 10x the cost of the real book. And if you want to see a real scam, look at the DRM infested online libraries. Cost the same 10x but they then turn around and say "this internet book can ONLY be rented out 26 times (2 week rental over a year) before you have to buy another virtual copy".
Fuck. That.
You know, aside from the blindingly obvious issues of scale and reach (a library might have two copies of a book and you might have to wait weeks for your turn). So tired of thoughtless nonsense to justify people who want free shit but don't want to, like, feel bad about it. Look, you even "cleverly" worked in a swipe at "billionaires", as if that has any fucking relevance at all! Brilliant.
Even i have been exploring client side only processing document workflow. WASM in browser with Zero server contact and then it changes conversation from trust our terms ot literally no one can access it
Also, this is very scummy.
It basically says, "Don't pay the authors for their work. Please pay US for their work."
So what's your preference?