Posted by lispybanana 4 days ago
If you are a student or researcher at one of the participating HathiTrust institutions, you can also get access to scans of books that are still in copyright.
The one advantage Google Books still has is that its search tools are much faster and sometimes better, so it can be useful to search for phrases or topics on Google Books and then jump over to HathiTrust to read specific books surfaced by the search.
As in, the business of running a nuclear energy plant?
Participation is limited to tertiary academic institutions, and possibly only four-year (rather than two-year) ones. This excludes local (city/county) libraries, as well as primary/secondary (grammar / middle / high school in the US) libraries.
Even public-domain records cannot be downloaded in whole, but rather can be saved one page at a time as PDFs. I'm pretty sure that those interested in more useful archival will and/or have created automated tools to do so, but HathiTrust remains the most notable point-of-access for such works, and the additional generation of conversion and republication further degrades the quality of original-publication formats. (It's less a problem for regenerated works from OCR'd or manually-converted documents, but those of course lose all the characteristics of original publication.)
And of course, many materials still under copyright are not accessible to the general public at all, no matter how obscure. I'd run into a case of this some months back trying to get a date attribution of an Alan Watts lecture which had been posted to HN:
<https://news.ycombinator.com/item?id=41231047> (thread).
And my request still stands. Anyone with an academic affiliation who can check <https://catalog.hathitrust.org/Record/000678503> and see how it relates to this post (<https://news.ycombinator.com/item?id=41230841>) would have my gratitude.
But I think this journal does not contain the date.
Searching for "his religion" (with quotation marks) in volume 6 via HathiTrust shows a single match on page 11. Searching for the same text via the Google Books link from your other post shows the following entry among a list of what I assume are lectures:
> 919 Jesus: His Religion, Or The Religion About Him ... 10.00 7.00
The first number is some kind of index or serial number. The second number is the cassette cost and the third is the reel cost. You can see the column headings by searching for the number 900.
Searching for "Watts" in the same book via Google Books shows the title of page 11, "New Alan Watts Lectures".
Searching for the year numbers, the matches on that page seem to be for some text about the indexing of works in MMRI-1970, 1971 and 1972, rather than a publication year.
And for confirming that HT is even more useless than I'd thought.
Again: Fuck copyright.
Watts, Alan. Myth and Religion : the Edited Transcripts. First edition. Boston: Charles E. Tuttle Co., 1996.
It contains "Jesus - His Religion, Or the Religion About Him", which appears to be a very slightly different title from the work that you are searching for.
The text includes the transcripts, but doesn't include the original date(s) of delivery / publication. And it's published a quarter century after the initial records of the lecture.
As noted, I'd emailed the Alan Watts institute but have received no reply.
I was at Google in 2009 on a team adjacent to Dan Clancy when he was most excited about the Authors’ Guild negotiations to publish orphan works and create a portal to pay copyright holders who signed up, and I recall that one opponent that he was frustrated at was Brewster Kahle of the Internet Archive, who filed a jealous amicus brief (https://docs.justia.com/cases/federal/district-courts/new-yo...) complaining that the Authors’ Guild settlement would not grant him access to publishing orphan works too. In my opinion Kahle was wrong; the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing which is what actually happened in the 15 year since then. Instead of one company selling out-of-print but in-copyright books, or multiple organizations, no one is allowed to sell them today.
Since then, of course, Brewster Kahle launched an e-library of copyrighted books without legal authorization anyway which will probably be the death of the current organization that runs the Internet Archive. Tragic all around.
For me, I became concerned when they fibbed about why the Internet Archive Credit Union was liquidated. IA alleged it was shut down due to onerous regulations, but the government said IA actually never lived up to their goal of allowing local, low-income folk to sign-up for their service. https://ncua.gov/newsroom/press-release/2016/internet-archiv...
For instance, they can spend a lot of effort digitizing an archive they got from a business active from 1890 to 1970 - and then put it all in a single collection, which the public won't get access to until 2070. There's no reason to think the business handled sensitive personal information, but it's too much work to check, so they assume it did. They could classify individual documents according to whether they were actually from before 1920, but that's too much work too.
All of the actual librarians and archivists I know hate this situation - it’s not a job you take if you don’t want people to access things – but that tends to translate into requests for copyright exemptions.
A really big one is orphan works where they have things like digitized music which can’t even be linked to a known copyright holder because it’s unclear who owns it after decades of contract shuffles and acquisitions, where you could potentially solve the problem by changing copyright law to require periodic payments to maintain protected status so someone at, say, Sony would have to cut a check every year to say that they still want to protect some obscure old blues track from 1952 which they don’t even offer for sale any more. I especially liked the proposals linking that to availability for mainstream sale: say that there’s no charge for anything which is normally available on iTunes, Play, Amazon, etc. but you need to pay a fee for works which aren’t available.
> a jealous amicus brief that the Authors’ Guild settlement would not grant him access to publishing orphan works too
that's not a fair overview of the amicus brief, there are good points there about the process of notifying orphan works rights holders and about the risk of a monopolistic position. I do agree with you on this part though
> the existence of one orphan works clearinghouse would have encouraged Congress to grant more libraries access instead of doing nothing
Edit: I also agree with you that the way the IA subsequently created its e-library was not ideal.
What I meant by “jealous” is that the Internet Archive’s interest was not to improve author notification or to protect foreign authors; it was to provide a competing service under similar or better terms than Google was able to negotiate without spending the time and money that Google did litigating. Kahle wanted what was in Google’s settlement.
And what I meant by “Kahle was wrong” is not that every argument that his lawyers thought up was false; I think the agreement was later amended to fix some issues. My point is that Kahle’s theory of change was wrong. He thought that when the settlement was rejected, then Google would push Congress to create an orphan works law which the Internet Archive could use to publish old books too. As he wrote in his op-ed, “We need to focus on legislation to address works that are caught in copyright limbo. … We are very close to having universal access to all knowledge. Let's not stumble now.” https://www.washingtonpost.com/wp-dyn/content/article/2009/0... As it turns out, the rejection of the class action settlement did not cause Congress to create an orphan works law. In retrospect, we would have been more likely to get an orphan works law if Google had been allowed to set up a proof of the concept, making the monopoly on orphan works temporary.
https://www.pewresearch.org/internet/2016/09/09/libraries-20... https://www.ala.org/news/2019/12/new-ala-report-gen-z-millen...
Sounds legit
Maybe. I think that is a pretty optimistic view of congress and our political process. I would argue that having a powerful, rich company with a monopoly to lose would have made passing such a law less likely, not more.
I do think we would have been better off with a Google monopoly on unpublished unclaimed books than with the lack of access we have today.
The article says:
> You’d get in a lot of trouble, they said, but all you’d have to do, more or less, is write a single database query. You’d flip some access control bits from off to on. It might take a few minutes for the command to propagate.
If it's so easy, I'm suprised nobody has done it and accepted the consequences. It seems one of the largest single positive impacts any person could make on the world. Once it's released, it'll never go back in the box. A modern Pandora.
And I agree, someone should do that, because keeping books locked up that the rightsholders can’t be bothered to even attempt to make money from is asinine. Too bad our government doesn’t agree with me, so we’ll just have to wait for it to accidentally walk out the door.
The class as I understand it was the copyright holders of every book in a library, and if approved by the court they can all be legally said to have agreed to the settlement’s terms if they didn’t opt out. Now as with anything legal the whole thing depends on someone’s interpretation of whether it’s ok, but this was a plausible reading of Rule 23 of the Federal Rules of Civil Procedure.
One could file a class action suit on behalf of everyone who ordered a strawberry shake at a McDonald’s in 2023, without ever knowing who or exactly how many they are, and if a judge certified it as a class action and McDonalds cut a deal with those representing the class, the terms would bind them all (except those who explicitly excluded themselves).
It will have consequences far beyond the immediate lawsuit too.
The very concept has basically been iced for a generation and the net is only getting more locked down not less.
Is that legal? Technically yes, in my country. Is it ethical? Debatable, depending on who you’re asking. But for me personally, I have found it to be getting substantially easier to find high quality copies of copyrighted anything in the past 3-5 years compared to 10-15 years ago, so I don’t necessarily agree with the blanket statement that “the net is only getting more locked down.”
[1] I like to use the library as much as possible, if for nothing else than to increase usage numbers to marginally positively decrease the likelihood of finding cuts.
This is basically LibGen / Anna’s Archive. A bit clunky around the download process (maybe things get better if you get a paid subscription though!), but overall it works pretty well.
Not at all. You visit libgen, search for your book, find it (usually available) click one of the two available links for it, click to download, have it download. Done.
It couldn't be easier.
Guess where the first backup copy of the Internet Archive is located.
That ranges from the personal book collection numbering from one to many thousands in private hands, private institutional libraries (the Mechanics Institute in San Francisco is one that comes to mind, many private universities and grammar schools have their own, as do numerous corporations, some of which are catalogued by Worldcat).
Preservation of Western culture, notably the Greek and Roman canons, as well as much literature and knowledge of the Jewish, Byzantine, and Islamic worlds, occurred through religious institutions. Though in some regards those were the governments of the time. Indian, Chinese, and other further East Asian collections were preserved through multiple means.
Book digitisation at US academic institutions (the University of Michigan being a major contributor to both Google Books and HathiTrust) has had its own exctremely combative relationship with commercial publishers, as has the US Library of Congress, which issues US copyright in the first place.
Avoid slurs, it's an HN guideline: <https://news.ycombinator.com/newsguidelines.html>.
I've used it to track down when wording on a site (for someday relevant to my job) changed, for example.
In our collection were Thomas Edison's first motion pictures, wire spool recordings from reporters at D-Day, and LPs of some of the greatest musicians of all time. And that was just our Division. Others - like American Heritage - had photos from the US Civil War and more.
Anyway, while the Rights information is one big, ugly tangled web, the other side is the hardware to read the formats. Much of the media is fragile and/or dangerous to use so you have to be exceptionally careful. Then you have to document all the settings you used because imagine that three months from now, you learn some filter you used was wrong or the hardware was misconfigured.. you need to go back and understand what was affected how.
Cool space. I wish I'd worked there longer.
If you have an LP or wire spool recording, the audio is the key, obvious work. But then you have the album cover, the spool case, and the physical condition of the media. Being able to see an album cover or read a reporter's notes/labeling is almost as important as the audio.
If they don't have that prerogative, they probably should, and Congress should legislate that to be the case.
Larry Page had some cool ideas… can’t imagine Books will ever be resurrected, unfortunately.
He also had a plan (with George Church) to build enormous warehouses holding large-scale biology research infrastructure right next to google data centers. Because most biology research is done at locations that have reached their limit on computational/storage capacity.
Larry had many good ideas but he struggled to get the majority of them off the ground. For example, when Trump was president and invited all the major tech leaders, Larry came with a plan to upgrade the US electrical system with long-range DC.
I feel like some crucial detail is missing here. They already use HVDC for long-distance transmission lines, inside and outside of the US. Texas could benefit from it I suppose, but the US in general already uses it where appropriate AFAIK.
I fail to see how that would be a good idea.
To me, it looks like some magnate in a completely unrelated industry, who is megalomaniac enough to believe that they can enter a completely unrelated industry and explain to experts how things ought to get done.
Man, Nissan just gets no love at all. They did the EV thing before Elon ran X into the ground (the bank, not the website formerly known as Twitter): https://www.motortrend.com/features/nissan-leaf-ev-history-p...
If it's well known but no one has done it, maybe there are reasons for it? Building long-distance lines is a very capital-intensive decision, and if the cost-benefit analysis was better, projects would be done. Looking by DOE's communication [1], we can see that the cost-benefit analysis doesn't look good now for a big network, but also that HVDC projects that actually make economic sense have been built for decades.
Sure, sometimes innovative people make a big change, but what's the innovation there?
[1]: https://www.energy.gov/oe/articles/connecting-country-hvdc
Not sure where you see me claiming the opposite?
> And capital intensive, revolutionary projects are ripe for disruption from people who have already pulled off similar feats
If Larry Page is looking to waste a few billions in cables, he's absolutely welcome to do it. One HVDC connection between France and the UK is private, for instance [1]. If he believes there's money to be made similarly in the US, he should go for it.
That's not exactly the same offer as "Larry page has suggested the government should invest in expensive infrastructure for an industry he's never worked in".
Hmm, maybe the part where you said "but no one has done it"?
What is the point of this level of degeneracy in conversation?
You can do something similar to this already, by mapping which books are cited in Wikipedia articles. If you know how to do such a thing, because I don't.
https://aarontay.medium.com/3-new-tools-to-try-for-literatur...
Specific to Wikipedia:
Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia [2024]
https://arxiv.org/abs/2406.19291v1
https://doi.org/10.48550/arXiv.2406.19291
> Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives. Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any given Wikipedia dump in the cloud-based settings. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 European languages so that they are parsed and mapped into a generic structured citation template. This paper presents our open-source software pipeline to retrieve, classify, and disambiguate citations on demand from a given Wikipedia dump.
Prior work referenced in above abstract with some team overlap:
Wikipedia citations: A comprehensive data set of citations with identifiers extracted from English Wikipedia [2021]
https://direct.mit.edu/qss/article/2/1/1/97565/Wikipedia-cit...
https://doi.org/10.1162/qss_a_00105
Datasets:
A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2024)
https://zenodo.org/records/10782978
https://doi.org/10.5281/zenodo.10782978
A Comprehensive Dataset of Classified Citations with Identifiers from Multilingual Wikipedia (2024)
https://zenodo.org/records/11210434
https://doi.org/10.5281/zenodo.11210434
Code (MIT License):
https://github.com/albatros13/wikicite
https://github.com/albatros13/wikicite/tree/multilang
Bonus links:
https://www.mediawiki.org/wiki/Alternative_parsers
https://scholarlykitchen.sspnet.org/2022/11/01/guest-post-wi...
They call it Founder's Copyright. The also use Creative Commons. The goal is to make out of print books available at no cost.
Exciting!
Follows link
Link no longer exists, gets O'Reilly front page instead
"Introducing the AI Academy, Help your entire org put GenAI to work"
Thanks O'Reilly.
The new dream of the internet: Some information, that aligns with the values of our advertisers, delivered via an LLM that sometimes makes shit up.
If anything the internet today is more loaded than ever with cool information and useful stuff, especially as ever larger bodies of formerly analog content get digitized and often with full open access. If one can get over their myopic naval gazing and cultivation of fetishism about everything having gone to shit, it's not even hard to find most of that useful information.
The internet -like any complex thing with multiple interests involved in its existence and operation- is just whatever works best for different people in different contexts, commercially, personally, technically and so forth. It's neither an ideal that one should obsess over or something to be neatly pigeonholed into a box of how it "should be". Adapt, use its tools to make whatever parts of it you can fit whatever your personal ideal is, instead of endlessly blaming advertisers or people just trying to make a living from one more commercial landscape.
I find it personally difficult to look at the entirety of the internet in 2024 and say that it’s definitely better for society than it was in 2004. I guess now at least we can mostly book appointments on our phones without having to speak with someone in real-time as they read dates and times off of a calendar interface that we can now just use ourselves directly.
Personal blogs, creative efforts and wonderful resources still abound on the internet and can still usually be found quite easily if you put a bit of effort into looking.
yes I know adblocker, pihole, etc.
https://www.google.com/search?q=site%3Aoreilly.com+inurl%3Ao...
So it seems like it mainly lost the overview page?
https://web.archive.org/web/20240607220047/http://www.oreill...
Definitely perplexing, I can’t find the reason to kill what appears like a simple HTML page unless they’ve killed the project entirely.
A third party page still has links to some (possibly all) of the books: https://zapier.com/blog/free-oreilly-press-books/
A very cynical and dark view is that the New things/people need that oblivion in order to feel great, for not haveing to compare with old great-er ones. Rewriting history as it seems fit the current powers-that-be, is easier this way.
Or may be it's just collective stupidity? or societal immaturity ?
(i am coming from completely different killed project on a different continent, but the idea is the same)
I think it's neatly summarized in two words: shareholder growth.
Of course, it's best to preserve past knowledge, but I think the idea that this is part of some kind of conspiracy to keep people buying new stuff is pretty silly. People are always going to want new stuff, as society grows and changes.
While you're most likely right about the cure for cancer, I did want to note that this is kind of how the cure for malaria was found (artemisinin). Tu Youyou who won the nobel prize systematically investigated Traditional Chinese Medecine remedies until she came across one that was effective. That particular remedy was described in a 1600 years old text The Handbook of Prescriptions for Emergency Treatments, written in 340 by Ge Hong.
Note: before anyone think that the fact that a remedy described in traditional Chinese texts mean that TCM is reliable and a viable alternative in all text, she screened over 2,000 traditional Chinese recipes and made 380 herbal extracts, from some 200 herbs which were tested on mice. So, yes one of the remedy were successful but the success rate of TCM was not particularly high :)
Lots of old knowledge is readily available to the public. The complete works of Shakespeare are a good example here, as is Homer's epics The Odyssey and The Iliad. (I don't know if that TCM stuff is or not.) These kinds of things are considered "classics" and are frequently reproduced, now available online in countless places, etc. Obviously, lots of people besides historians think they're important, and so they're copied frequently and made available at large. As the famous rule goes, "99% of everything is crap", so probably all the best stuff from the past is well-preserved, and the rest, not so much. I seriously doubt that something as good as Shakespeare is locked away in someone's private library and virtually unknown to almost anyone.
Of course, there's always exceptions and you never know when some overlooked tidbit of info from the distant past might be really useful, as you showed here, so I do think it's important to preserve and enable easy access to as many old works as possible.
Interesting take on what "knowledge" means and what makes knowledge valuable.
If I understand "knowledge" as "information directly relevant to a technical problem", then:
- the knowledge which remains relevant to that problem will stay available to practitioners (i.e. the properties of a Gaussian distribution, from Gauss, 1809)
- the knowledge which is no longer relevant to that problem will probably be lost (how to compute the integral of a Gaussian using a slide rule. Slide rules first developed circa 1620, last used circa 1970)
In other words, yes, your point is profoundly true. Knowledge relevant to a specific task stays available, not relevant gets pruned quickly.
My question would be if we want to use that definition of relevant and that understanding of what drives value. i.e. I'm not asking if you are correct, I've just shown that you are correct. My question is if the assumptions/values which make this correct are assumptions/values we are comfortable with. In other words, is is wise?
I personally have actually tried to contribute to libgen a particular difficult-to-find-online book by buying it, scanning it, and uploading it. There need to be more people doing this.
(The reality is that publishers would put lazy photocopies up for sale at ten zillion dollars a piece.)
Anyone know an academic specializing in Old English who would like to oversee this reprinting? I have a typeset PDF which only wants proofreading and updating of the index.
Rather than the old model of printing a reasonable print run, selling books as demand allowed, and keeping unsold books in warehouses only paying tax as they were sold, the new laws required paying tax on inventory each year --- so any books not sold in the first year were not as profitable, hence book remaindering, and the current mess.
And here is the rub. You’ll end up with three or four super authors with the rest being ripped off.
Much better for it to revert to the author in that situation IMO.
Regarding unprofitable books, they'll fall out of print anyways because they're unprofitable. Those authors won't be getting ripped off because they won't be making money either way beyond initial commissions and what few sales they get.
> Much better for it to revert to the author in that situation IMO.
The publisher doesn't hold the copyright, the author does, so copyright (the particular right under discussion) can't revert to the author as it never left the author. What the publisher holds is publishing rights per a contract with the author. That could revert back to the author (or be voided or however it's structured), and that would be reasonable but we don't need any laws for it, that would fall under normal contract terms. Whether it's a common thing now or feasible for a particular author (with no clout? maybe not, with billions in sales from prior books? probably) is another matter.
Although, if I were writing the law I would require selling DRM free ebooks for ebooks to count for maintaining the copyright.
Well, it is a use case for this challenge https://www.kaggle.com/competitions/gemini-long-context