Top
Best
New

Posted by pmaze 22 hours ago

Show HN: I used Claude Code to discover connections between 100 books(trails.pieterma.es)
I think LLMs are overused to summarise and underused to help us read deeper.

I built a system for Claude Code to browse 100 non-fiction books and find interesting connections between them.

I started out with a pipeline in stages, chaining together LLM calls to build up a context of the library. I was mainly getting back the insight that I was baking into the prompts, and the results weren't particularly surprising.

On a whim, I gave CC access to my debug CLI tools and found that it wiped the floor with that approach. It gave actually interesting results and required very little orchestration in comparison.

One of my favourite trail of excerpts goes from Jobs’ reality distortion field to Theranos’ fake demos, to Thiel on startup cults, to Hoffer on mass movement charlatans (https://trails.pieterma.es/trail/useful-lies/). A fun tendency is that Claude kept getting distracted by topics of secrecy, conspiracy, and hidden systems - as if the task itself summoned a Foucault’s Pendulum mindset.

Details:

* The books are picked from HN’s favourites (which I collected before: https://hnbooks.pieterma.es/).

* Chunks are indexed by topic using Gemini Flash Lite. The whole library cost about £10.

* Topics are organised into a tree structure using recursive Leiden partitioning and LLM labels. This gives a high-level sense of the themes.

* There are several ways to browse. The most useful are embedding similarity, topic tree siblings, and topics cooccurring within a chunk window.

* Everything is stored in SQLite and manipulated using a set of CLI tools.

I wrote more about the process here: https://pieterma.es/syntopic-reading-claude/

I’m curious if this way of reading resonates for anyone else - LLM-mediated or not.

418 points | 130 comments
andai 7 minutes ago|
I tried using Claude Web to help me understand a textbook recently.

The book was really big and it got stuck in "indexing". (Possibly broke the indexer?) But thanks to the CLI integration, it was able to just iteratively grep all the info it needed out of it. I found this very amusing.

Anthropic's article on retrieval emphasizes the importance of keyword search, since they often outperform embeddings depending on the query. Their own approach is a hybrid:

https://www.anthropic.com/engineering/contextual-retrieval

drakeballew 15 hours ago||
This is a beautiful piece of work. The actual data or outputs seem to be more or less...trash? Maybe too strong a word. But perhaps you are outsourcing too much critical thought to a statistical model. We are all guilty of it. But some of these are egregious, obviously referential LLM dog. The world has more going on than whatever these models seem to believe.

Edit/update: if you are looking for the phantom thread between texts, believe me that an LLM cannot achieve it. I have interrogated the most advanced models for hours, and they cannot do the task to any sort of satisfactory end that a smoked-out half-asleep college freshman could. The models don't have sufficient capacity...yet.

liqilin1567 12 hours ago||
When I saw that the trail goes through just one word like "Us/Them", "fictions" I thought it might be more useful if the trail went through concepts.
tmountain 5 hours ago||
The links drawn between the books are “weaker than weak” (to quote Little Richard). This is akin to just thumbing the a book and saying, “oh, look, they used the word fracture and this other book used the word crumble, let’s assign a theme.” It’s a cool idea, but fails in the execution.
usefulposter 3 hours ago||
Yes. It's flavor-of-the-month Anthropic marketing drivel: tenuous word associations edition¹.

¹ Oh, that's just LLMs in general? Cool!

rtgfhyuj 8 hours ago|||
give it a more thorough look maybe?

https://trails.pieterma.es/trail/collective-brain/ is great

eloisius 8 hours ago|||
It’s any interesting thread for sure, but while reading through this I couldn’t help but think that the point of these ideas are for a person to read and consider deeply. What is the point of having a machine do this “thinking” for us? The thinking is the point.
DrewADesign 4 hours ago||
And that’s the problem with a lot of chatbot usage in the wild: it’s saving you from having to think about things where thinking about them is the point. E.g. hobby writing, homework, and personal correspondence. That’s obviously not the only usage, but it’s certainly the basis for some of the more common use cases, and I find that depressing as hell.
znnajdla 3 hours ago|||
This is a software engineering forum. Most of the engineer types here lack the critical education needed to appreciate this sort of thing. I have a literary education and I’m actually shocked at how good most of these threads are.
PinkMilkshake 1 hour ago|||
I think most engineer types avoid that kind of analysis on purpose.
znnajdla 2 minutes ago||
Programmers tend to lean two ways: math-oriented or literature-oriented. The math types tend to become FAANG engineers. The literature oriented ones tend to start startups and become product managers and indie game devs and Laravel artisans.
only-one1701 1 hour ago|||
That doesn’t speak well towards your literary education, candidly.
znnajdla 7 minutes ago||
We should try posting this on a literary discussion forum and see the responses there. I expect a lot of AI FUD and envy but that’ll be evidence in this tools favor.
baxtr 1 hour ago|||
I checked 2-3 trails and have to agree.

Take for example the OODA loop. How are the connections made here of any use? Seems like the words are semantically related but the concept are not. And even if they are, so what?

I am missing the so what.

Now imagine a human had read all these books. It would have come up with something new, I’m pretty sure about that.

https://trails.pieterma.es/trail/tempo-gradient/

what-the-grump 13 hours ago||
Build a rag with significant amount of text, extract it by key word topic, place, date, name, etc.

… realize that it’s nonsense and the LLM is not smart enough to figure out much without a reranker and a ton of technology that tells it what to do with the data.

You can run any vector query against a rag and you are guaranteed a response. With chunks that are unrelated any way.

electroglyph 9 hours ago||
unrelated in any way? that's not normal. have you tested the model to make sure you have sane output? unless you're using sentence-transformers (which is pretty foolproof) you have to be careful about how you pool the raw output vectors
8organicbits 16 hours ago||
Can someone break this down for me?

I'm seeing "Thanos committing fraud" in a section about "useful lies". Given that the founder is currently in prison, it seems odd to consider the lie useful instead of harmful. It kinda seems like the AI found a bunch of loosely related things and mislabeled the group.

If you've read these books I'm not seeing what value this adds.

Closi 16 hours ago||
I guess the lies were useful until she got caught?
irishcoffee 15 hours ago|||
Why lie if it isn’t useful? Lying is generally bad, why do a generally bad thing if there isn’t at least a justification, a “use” if you will.
PeterStuer 4 hours ago||
Be careful with the 'utility' model of explaining behavior. It is fairly easy to slide into 'if behavior X is manifested, this must mean X must somehow be useful'. You can use this model to explain behavior, but be aware of the circularity trap in the model. "She lied thus the lie must have had use, even if it is not obvious we will discover the utility if we dig down enough".

Another model can be post-rationalization. People just do stuff instinctively, then rationalize why they did them after the fact. "She lied without thinking about it, then constructed a reasoning why the lie was rational to begin with".

At the extremes, some people will never lie, even to their detriment. Usually they seem to attribute this to virtue. Others will always lie. They seem to feel not lying is surrendering control. Most people are somewhere in between.

Terretta 13 hours ago||
Thanos is the comic book villain snapping his fingers.

Theranos is the fraud mentioned in the piece.

jennyholzer6 1 hour ago||
What I'm taking from this post and the responses to it is that LLMs are used most enthusiastically by functionally illiterate people.

What the LLM eats doesn't make you shit.

johnwatson11218 16 hours ago||
I did something similar whereby I used pdfplumber to extract text from my pdf book collection. I dumped it into postgresql, then chunked the text into 100 char chunks w/ a 10 char overlap. These chunks were directly embedded into a 384D space using python sentence_transformers. Then I simply averaged all chunks for a doc and wrote that single vector back to postgresql. Then I used UMAP + HDBScan to perform dimensionality reduction and clustering. I ended up with a 2D data set that I can plot with plotly to see my clusters. It is very cool to play with this. It takes hours to import 100 pdf files but I can take one folder that contains a mix of programming titles, self-help, math, science fiction etc. After the fully automated analysis you can clearly see the different topic clusters.

I just spent time getting it all running on docker compose and moved my web ui from express js to flask. I want to get the code cleaned up and open source it at some point.

fittingopposite 35 minutes ago||
Yes. Please publish. Sounds very interesting
ct0 15 hours ago|||
This sounds amazing, totally interested in seeing the approach and repo.
hellisad 13 hours ago||
Sounds a lot like Bertopic. Great library to use.
Balgair 1 hour ago||
Wow! Amazing!

Have you read the Syntopicon by Mortimer J Adler?

It's right up your alley on this one. It's essentially this, but in 1965, by hand, with Isaac Asimov and William F Buckley Jr, among others.

Where did you get the books from? I've been trying to do something like this myself, but haven't been able to get good access to books under copyright.

Yeah, thinking a bit more here, you've created a Syntopicon. I've always wanted to make a modern one too! You can do the old school late night Wikipedia reading session with the trails idea of yours. Brilliant!

Really though, how can I help you make this bigger?

theturtletalks 18 hours ago||
In a similar vein, I've been using Claude Code to "read" Github projects I have no business understanding. I found this one trending on Github with everything in Russian and went down the rabbit hole of deep packet inspection[0].

0. https://github.com/ValdikSS/GoodbyeDPI

noname120 15 hours ago||
ValdikSS is the guy behind the SBC XQ patches for Android (that alas were never merged by G). I didn’t expect to see him here with another project!

https://habr.com/en/articles/456476/

https://android-review.googlesource.com/c/platform/system/bt...

dinkleberg 17 hours ago||
That's a cool idea. There are so many interesting projects on GitHub that are incomprehensible without a ton of domain context.
theturtletalks 17 hours ago||
I got the idea from an old post on here called Story of Mel[0] where OP talks about the beauty of Mel's intricate machine code on a RPC-4000.

This is the part that always stuck with me:

I have often felt that programming is an art form, whose real value can only be appreciated by another versed in the same arcane art; there are lovely gems and brilliant coups hidden from human view and admiration, sometimes forever, by the very nature of the process. You can learn a lot about an individual just by reading through his code, even in hexadecimal. Mel was, I think, an unsung genius.

0. http://catb.org/esr/jargon/html/story-of-mel.html

coolewurst 5 hours ago||
Thank you for sharing that story. Mel seems virtuousic, but is that really art? Optimizing pattern positioning on a drum for maximum efficiency. Is that expression?
maxbond 3 hours ago|||
> Is that expression?

If it wasn't expression everyone would get the same result. But no one else at Royal McBee did things the way Mel Kaye did things.

Kaye had a strong artistic vision for how things should be done; he didn't want to use the ergonomic features of the RPC-4000 because they didn't align with his vision. I think he found the idea of rigging the blackjack program offensive in part for the same reason.

Speaking for myself, I have always found the story and "pessimal" instructions beautiful. It's my favorite piece of folklore of all time. Kaye and Nather are both artists to me.

Tangentially, Kaye is standing on the far right in this photo.

https://zappa.brainiac.com/MelKaye.png

And here is Nather.

https://en.wikipedia.org/wiki/Ed_Nather#/media/File:Ednather...

Abstract_Typist 4 hours ago|||
If you consider engineering the art of the possible. (Yes, I know it's a politician's phrase, that's because politics is the art of the plausible ... )
chrisgd 8 hours ago||
Really great work but have to agree with others that I don’t see the threads.

The one I found most connected that the LLm didn’t was a connection between Jobs and the The Elephant in the Brain

The Elephant in the Brain: The less we know of our own ugly motives, the easier it is to hide them from others. Self-deception is therefore strategic, a ploy our brains use to look good while behaving badly.

Jobs: “He can deceive himself,” said Bill Atkinson. “It allowed him to con people into believing his vision, because he has personally embraced and internalized it.”

smusamashah 18 hours ago||
I dont understand the lines connecting two pieces of text. In most cases, the connected words have absolutely zero connection with each other.

In "Father wound" the words "abandoned at birth" are connected to "did not". Which makes it look like those visual connections are just a stylistic choice and don't carry any meaning at all.

Oras 17 hours ago||
I had the exact same impression.
hecanjog 10 hours ago||
Yes, they look really good but they're being connected by an LLM.
pxc 17 hours ago|
I read a book maybe a decade ago on the "digital humanities". I wish now I could remember the title and author. :(

Anyway, it introduced me to the idea of using computational methods in the humanities, including literature. I found it really interesting at the time!

One of the the terms it introduced me to is "distant reading", whose name mirrors that of a technique you may have studied in your gen eds if you went to university ('close reading"). The idea is that rather than zooming in on some tiny piece of text to examine very subtle or nuanced meanings, you zoom out to hundreds or thousands of texts, using computers to search them for insights that only emerge from large bodies of work as wholes. The book argued that there are likely some questions that it is only feasible to ask this way.

An old friend of mine used techniques like this for dissertation in rhetoric, learning enough Python along the way to write the code needed for the analyses she wanted to do. I thought it was pretty cool!

I imagine LLMs are probably positioned now to push distant reading forward in an number of ways: enabling new techniques, allowing old techniques to be used without writing code, and helping novices get started with writing some code. (A lot of the maintainability issues that come with LLM code generation happily don't apply to research projects like this.)

Anyway, if you're interested in other computational techniques you can use to enrich this kind of reading, you might enjoy looking into "distant reading": https://en.wikipedia.org/wiki/Distant_reading

plutokras 17 hours ago|
> I wish now I could remember the title and author.

LLMs are great at finding media by vague descriptions. ;)

ako 16 hours ago|||
According to Claude (easy guess from the wikipedia link?):

The book is almost certainly by *Franco Moretti*, who coined the term "distant reading." Given the timeframe ("maybe a decade ago") and the description, it's most likely one of these two:

1. *"Distant Reading"* (2013) — A collection of Moretti's essays that directly takes the concept as its title. This would fit well with "about a decade ago."

2. *"Graphs, Maps, Trees: Abstract Models for Literary History"* (2005) — His earlier and very influential work that laid out the quantitative, computational approach to literary analysis, even if it didn't use "distant reading" as prominently in the title.

Moretti, who founded the Stanford Literary Lab, was the major proponent of the idea that we should analyze literature not just through careful reading of individual canonical texts, but through large-scale computational analysis of hundreds or thousands of works—looking at trends in genre evolution, plot structures, title lengths, and other patterns that only emerge at scale.

Given that the commenter specifically remembers learning the term "distant reading" from the book, my best guess is *"Distant Reading" (2013)*, though "Graphs, Maps, Trees" is also a strong possibility if their memory of "a decade" is approximate.

pxc 13 hours ago|||
After some digging, I think it was likely this one: https://direct.mit.edu/books/book/5346/Digital-Humanities
More comments...