Posted by pmaze 22 hours ago
I built a system for Claude Code to browse 100 non-fiction books and find interesting connections between them.
I started out with a pipeline in stages, chaining together LLM calls to build up a context of the library. I was mainly getting back the insight that I was baking into the prompts, and the results weren't particularly surprising.
On a whim, I gave CC access to my debug CLI tools and found that it wiped the floor with that approach. It gave actually interesting results and required very little orchestration in comparison.
One of my favourite trail of excerpts goes from Jobs’ reality distortion field to Theranos’ fake demos, to Thiel on startup cults, to Hoffer on mass movement charlatans (https://trails.pieterma.es/trail/useful-lies/). A fun tendency is that Claude kept getting distracted by topics of secrecy, conspiracy, and hidden systems - as if the task itself summoned a Foucault’s Pendulum mindset.
Details:
* The books are picked from HN’s favourites (which I collected before: https://hnbooks.pieterma.es/).
* Chunks are indexed by topic using Gemini Flash Lite. The whole library cost about £10.
* Topics are organised into a tree structure using recursive Leiden partitioning and LLM labels. This gives a high-level sense of the themes.
* There are several ways to browse. The most useful are embedding similarity, topic tree siblings, and topics cooccurring within a chunk window.
* Everything is stored in SQLite and manipulated using a set of CLI tools.
I wrote more about the process here: https://pieterma.es/syntopic-reading-claude/
I’m curious if this way of reading resonates for anyone else - LLM-mediated or not.
The book was really big and it got stuck in "indexing". (Possibly broke the indexer?) But thanks to the CLI integration, it was able to just iteratively grep all the info it needed out of it. I found this very amusing.
Anthropic's article on retrieval emphasizes the importance of keyword search, since they often outperform embeddings depending on the query. Their own approach is a hybrid:
Edit/update: if you are looking for the phantom thread between texts, believe me that an LLM cannot achieve it. I have interrogated the most advanced models for hours, and they cannot do the task to any sort of satisfactory end that a smoked-out half-asleep college freshman could. The models don't have sufficient capacity...yet.
¹ Oh, that's just LLMs in general? Cool!
Take for example the OODA loop. How are the connections made here of any use? Seems like the words are semantically related but the concept are not. And even if they are, so what?
I am missing the so what.
Now imagine a human had read all these books. It would have come up with something new, I’m pretty sure about that.
… realize that it’s nonsense and the LLM is not smart enough to figure out much without a reranker and a ton of technology that tells it what to do with the data.
You can run any vector query against a rag and you are guaranteed a response. With chunks that are unrelated any way.
I'm seeing "Thanos committing fraud" in a section about "useful lies". Given that the founder is currently in prison, it seems odd to consider the lie useful instead of harmful. It kinda seems like the AI found a bunch of loosely related things and mislabeled the group.
If you've read these books I'm not seeing what value this adds.
Another model can be post-rationalization. People just do stuff instinctively, then rationalize why they did them after the fact. "She lied without thinking about it, then constructed a reasoning why the lie was rational to begin with".
At the extremes, some people will never lie, even to their detriment. Usually they seem to attribute this to virtue. Others will always lie. They seem to feel not lying is surrendering control. Most people are somewhere in between.
Theranos is the fraud mentioned in the piece.
What the LLM eats doesn't make you shit.
I just spent time getting it all running on docker compose and moved my web ui from express js to flask. I want to get the code cleaned up and open source it at some point.
Have you read the Syntopicon by Mortimer J Adler?
It's right up your alley on this one. It's essentially this, but in 1965, by hand, with Isaac Asimov and William F Buckley Jr, among others.
Where did you get the books from? I've been trying to do something like this myself, but haven't been able to get good access to books under copyright.
Yeah, thinking a bit more here, you've created a Syntopicon. I've always wanted to make a modern one too! You can do the old school late night Wikipedia reading session with the trails idea of yours. Brilliant!
Really though, how can I help you make this bigger?
https://habr.com/en/articles/456476/
https://android-review.googlesource.com/c/platform/system/bt...
This is the part that always stuck with me:
I have often felt that programming is an art form, whose real value can only be appreciated by another versed in the same arcane art; there are lovely gems and brilliant coups hidden from human view and admiration, sometimes forever, by the very nature of the process. You can learn a lot about an individual just by reading through his code, even in hexadecimal. Mel was, I think, an unsung genius.
If it wasn't expression everyone would get the same result. But no one else at Royal McBee did things the way Mel Kaye did things.
Kaye had a strong artistic vision for how things should be done; he didn't want to use the ergonomic features of the RPC-4000 because they didn't align with his vision. I think he found the idea of rigging the blackjack program offensive in part for the same reason.
Speaking for myself, I have always found the story and "pessimal" instructions beautiful. It's my favorite piece of folklore of all time. Kaye and Nather are both artists to me.
Tangentially, Kaye is standing on the far right in this photo.
https://zappa.brainiac.com/MelKaye.png
And here is Nather.
https://en.wikipedia.org/wiki/Ed_Nather#/media/File:Ednather...
The one I found most connected that the LLm didn’t was a connection between Jobs and the The Elephant in the Brain
The Elephant in the Brain: The less we know of our own ugly motives, the easier it is to hide them from others. Self-deception is therefore strategic, a ploy our brains use to look good while behaving badly.
Jobs: “He can deceive himself,” said Bill Atkinson. “It allowed him to con people into believing his vision, because he has personally embraced and internalized it.”
In "Father wound" the words "abandoned at birth" are connected to "did not". Which makes it look like those visual connections are just a stylistic choice and don't carry any meaning at all.
Anyway, it introduced me to the idea of using computational methods in the humanities, including literature. I found it really interesting at the time!
One of the the terms it introduced me to is "distant reading", whose name mirrors that of a technique you may have studied in your gen eds if you went to university ('close reading"). The idea is that rather than zooming in on some tiny piece of text to examine very subtle or nuanced meanings, you zoom out to hundreds or thousands of texts, using computers to search them for insights that only emerge from large bodies of work as wholes. The book argued that there are likely some questions that it is only feasible to ask this way.
An old friend of mine used techniques like this for dissertation in rhetoric, learning enough Python along the way to write the code needed for the analyses she wanted to do. I thought it was pretty cool!
I imagine LLMs are probably positioned now to push distant reading forward in an number of ways: enabling new techniques, allowing old techniques to be used without writing code, and helping novices get started with writing some code. (A lot of the maintainability issues that come with LLM code generation happily don't apply to research projects like this.)
Anyway, if you're interested in other computational techniques you can use to enrich this kind of reading, you might enjoy looking into "distant reading": https://en.wikipedia.org/wiki/Distant_reading
LLMs are great at finding media by vague descriptions. ;)
The book is almost certainly by *Franco Moretti*, who coined the term "distant reading." Given the timeframe ("maybe a decade ago") and the description, it's most likely one of these two:
1. *"Distant Reading"* (2013) — A collection of Moretti's essays that directly takes the concept as its title. This would fit well with "about a decade ago."
2. *"Graphs, Maps, Trees: Abstract Models for Literary History"* (2005) — His earlier and very influential work that laid out the quantitative, computational approach to literary analysis, even if it didn't use "distant reading" as prominently in the title.
Moretti, who founded the Stanford Literary Lab, was the major proponent of the idea that we should analyze literature not just through careful reading of individual canonical texts, but through large-scale computational analysis of hundreds or thousands of works—looking at trends in genre evolution, plot structures, title lengths, and other patterns that only emerge at scale.
Given that the commenter specifically remembers learning the term "distant reading" from the book, my best guess is *"Distant Reading" (2013)*, though "Graphs, Maps, Trees" is also a strong possibility if their memory of "a decade" is approximate.