Top
Best
New

Posted by lispybanana 10/22/2024

The Tragedy of Google Books (2017)(www.theatlantic.com)
503 points | 179 commentspage 2
boramalper 10/23/2024|
Of course someone needs to scan/digitise those books but for those which already are, there is Anna’s Archive.

https://en.wikipedia.org/wiki/Anna%27s_Archive

fx1994 10/23/2024|
it's a shame you have to pirate your way to find a book that is practically unavailable, but I support pirating old unavailable stuff
theendisney4 10/23/2024||
Programmers not law makers really control what goes and doesnt online.

Bittorent and ipfs etc are nice but things would be better if there was a large static archive with desktop clients exchanging chunks in a complex modular way.

Say: I have pages 1-15 of file 123456, you have page 16 but are looking for page 1 of doc 2345, if i can obtain that page a fast exchange is possible. If not a different module can issue an iou that either means i owe something, you are owed something or both. Other modules could create groups that aim to store part of the archive without duplication amoung members. Spam driven modules could also be interesting.

The archive can be organized by how dubious the copyright is so that one can limit participation to 50 or 100+ year old publications and/or living or dead authors.

Its not unlike living on a far away island with the british empire seeking to control every aspect of your life without sufficient means of force.

carlosjobim 10/22/2024||
For Kagi users, I recommend putting books.google.com as a pinned domain. This way, you'll many times be presented with some of the best sources for any search query. Then it's a matter of finding the ePub file of that book. To read on MacOS, FBReader is a high quality app.
emmelaich 10/23/2024|
Thanks. Looks like it's available for Windows/Linux too. At last as of FBReader 2.1.2 30th September 2024.
Animats 10/22/2024||
We need a Copyright Term Reduction Act.

It's time. 50 years, renewal is possible but expensive.

mjevans 10/22/2024||
Just my opinion but as a starting point for the argument...

  * 20 years from date of first publish (renewable up to CAP? 50 years)
  * Must remain available every year
  * 10 year renewal blocks with massive registration fee increases
  * Compulsory maximum license fee cap (can offer for less) in the laws
Note this is not TRADE MARK; trade marks are _consumer protection_ related to 'brand ownership'.
js8 10/23/2024|||
Google Books is a tragedy of the commons problem, created by copyright, which is supposedly a solution to the tragedy of the commons problem.
ASalazarMX 10/22/2024|||
Even 50 is a lot, because it starts at the death of the author. Popular culture shouldn't remain locked out for generations. 50 maximum would be ideal, two generations from the one who experienced it in the original cultural context.
Animats 10/22/2024||
50 years from first publication. That's all the TRIPS agreement requires.[1]

[1] https://en.wikipedia.org/wiki/TRIPS_Agreement

ASalazarMX 10/23/2024||
And that's still a lot, since fifty years from publication is the minimum to abide to TRIPS, but we're used to much worse, so it doesn't sound as bad now. It could be shorter, things move a lot faster nowadays, a single generation of monopoly means more today than a hundred years ago.
gosub100 10/23/2024||
And reign in the damages for infringement to some amount closer to what was actually lost. For instance, if someone has a million books on a drive they haven't deprived the publisher of a million sales for chrissakes
rekabis 10/24/2024||
Let’s rewrite copyright law:

1. The author gets to say, “I produced this”, and to control if it gets published.

2. Exclusive copyright for 15 year terms.

3. Renewal possible if author still alive. Non-human rights holders (corporations, etc.) limited to 30 years total (one renewal) from date of first publication, regardless of item ownership. Failure to renew automatically opens up the product.

4. Existing copyright can be overridden if demand isn’t being adequately serviced (sliding scale, challenger must capture minimum % of existing market demand to prove). Pricing of overriding attempts must be reasonable, only cost of production can be directly paid for, everything else goes into an escrow account until the attempt is concluded. This is where anti-abuse rules for both sides are most extensive.

Information and knowledge must be free. Our civilization depends vitally upon that freedom.

senkora 10/22/2024||
I’m sure the lawyers will eventually figure out a way to train an LLM on them.
datadrivenangel 10/22/2024|
They probably already have! It seems like an amazing training dataset even if you can't share source data.
amelius 10/22/2024||
How do you train an LLM such that it is guaranteed to never regurgitate its training data?
ASalazarMX 10/22/2024||
You punish it if parts of the answer can be found in its training data, and reward it otherwise.
amelius 10/22/2024||
But the whole point of the training is that you reward it if it correctly reproduces the next token.
zeroxfe 10/23/2024||
That's not the whole point of the training. It's just (very loosely) a measure of loss used during pre-training. There are many post-training and alignment stages in a typical model that are designed to reward high-quality responses.

Technically, yes, it's impossible to guarantee that it won't just regurgitate source material (which is mostly around the tails of the data distribution), but the whole point of training is to build generalized intelligence.

amelius 10/23/2024||
I guess I used the wrong wording but it doesn't change the argument. Yes, the whole point of training is to build generalized intelligence (or at least that's what we __hope__ for). But as far as I understand, we do it __mainly__ by training for the next word in the sequence.

PS: you speak of "pre-training" and "post-training", so I'm curious what you think is the main part of the training (?)

einpoklum 10/23/2024||
Written from a capitalist perspective, extolling "market forces" and legitimizing corporate and government limitations on copying.

"between 1923 and 1963 ... copyrights back then had to be renewed, and often the rightsholder wouldn’t bother filing the paperwork" - oh no, how terrible. How lucky we are that in these modern times one doesn't even have to file paperwork in order to prevent you from copying information.

and they go on to suck to Google and decry how they didn't get to legitimize their control over a large swath of human knowledge and cultural heritage.

"It certainly seems unlikely that someone is going to spend political capital—especially today—trying to change the licensing regime for books, let alone old ones." <- copyright regime, licensing regime - all of this stuff is illegitimate apriori. Poetry, literature, music, software, papers and books - we cannot and must not tolerate restrictions on their dissemination.

What arrangements the commercial and governmental entities come to, our "arrangement" should be that everything gets disseminated widely and without restriction, so that curtailment, censorship, commercial control etc. just fail.

shadytrees 10/23/2024||
James Somers writes beautifully; https://www.newyorker.com/contributors/james-somers has some of his other writing
mcepl 10/23/2024||
> Copyright terms have been radically extended in this country largely to keep pace with Europe, where the standard has long been that copyrights last for the life of the author plus 50 years. But the European idea, “It’s based on natural law as opposed to positive law,” Lateef Mtima, a copyright scholar at Howard University Law School, said. “Their whole thought process is coming out of France and Hugo and those guys that like, you know, ‘My work is my enfant,’” he said, “and the state has absolutely no right to do anything with it—kind of a Lockean point of view.” As the world has flattened, copyright laws have converged, lest one country be at a disadvantage by freeing its intellectual products for exploitation by the others. And so the American idea of using copyright primarily as a vehicle, per the constitution, “to promote the Progress of Science and useful Arts,” not to protect authors, has eroded to the point where today we’ve locked up nearly every book published after 1923.

This is disingenuous: the article doesn’t mention that the biggest proponent of the prolonging of the copyright terms were Americans (e.g., Walt Disney Corp and Jack Valenti, see “Mickey Mouse Protection Act” for more) not Europeans.

2OEH8eoCRo0 10/22/2024|
The tragedy is that Google is tasked with this at all. It would be cool if public libraries could work together on a massive public digital library. This shouldn't be Google's responsibility.
Jtsummers 10/22/2024||
Google wasn't tasked (by a third party) with this, they chose to do it.
ants_everywhere 10/22/2024||
arguably Google was invented to fund this project.

The books project predates the search engine and the search engine grew out of the project of creating a universal digital library. The PageRank algorithm is one of a class of algorithms used to score citations in books and papers.

dredmorbius 10/23/2024|||
HathiTrust was ... nearly this.

Until it too was emasculated.

<https://en.wikipedia.org/wiki/HathiTrust>

Otherwise, we have Project Gutenberg (public domain), OpenLibrary (Internet Archive, both PD and copyrighted works), ZLibrary, Library Genesis, and Anna's Archive.

NoMoreNicksLeft 10/22/2024||
All humans everywhere have a responsibility to preserve culture and knowledge to the best of their ability. I think what you meant to say is that none of us can trust Google with this important task.
renewiltord 10/24/2024||
One of the great tragedies of civilization is that we leave things in the hands of those who do them rather than in the hands of those who tell us about our responsibility to do them.
NoMoreNicksLeft 10/24/2024||
I personally do what I can. I've been trying to find old phone books and catalogs at garage sales, scanning them when I can get them. I teach my children that this is a responsibility of theirs.

But if you're offering me even a fraction of Google's budget, I think I might manage to scale things up.

renewiltord 10/24/2024||
Perhaps a second tragedy is that we give money to those who provide us with something. A better world might be where we give Google’s money to people so that they can teach children to buy phone books at garage sales. In this way, civilization may prosper.
Apocryphon 10/24/2024||
A simple search for "phone books" on Google Books yields no actual phone books, so the poster is objectively doing a better job than Google on that front.
More comments...