Show HN: I used Claude Code to discover connections between 100 books

Posted by pmaze 1/10/2026

Show HN: I used Claude Code to discover connections between 100 books(trails.pieterma.es)

I think LLMs are overused to summarise and underused to help us read deeper.

I built a system for Claude Code to browse 100 non-fiction books and find interesting connections between them.

I started out with a pipeline in stages, chaining together LLM calls to build up a context of the library. I was mainly getting back the insight that I was baking into the prompts, and the results weren't particularly surprising.

On a whim, I gave CC access to my debug CLI tools and found that it wiped the floor with that approach. It gave actually interesting results and required very little orchestration in comparison.

One of my favourite trail of excerpts goes from Jobs’ reality distortion field to Theranos’ fake demos, to Thiel on startup cults, to Hoffer on mass movement charlatans (https://trails.pieterma.es/trail/useful-lies/). A fun tendency is that Claude kept getting distracted by topics of secrecy, conspiracy, and hidden systems - as if the task itself summoned a Foucault’s Pendulum mindset.

Details:

* The books are picked from HN’s favourites (which I collected before: https://hnbooks.pieterma.es/).

* Chunks are indexed by topic using Gemini Flash Lite. The whole library cost about £10.

* Topics are organised into a tree structure using recursive Leiden partitioning and LLM labels. This gives a high-level sense of the themes.

* There are several ways to browse. The most useful are embedding similarity, topic tree siblings, and topics cooccurring within a chunk window.

* Everything is stored in SQLite and manipulated using a set of CLI tools.

I wrote more about the process here: https://pieterma.es/syntopic-reading-claude/

I’m curious if this way of reading resonates for anyone else - LLM-mediated or not.

524 points | 146 comments

johnwatson11218 1/10/2026|

I did something similar whereby I used pdfplumber to extract text from my pdf book collection. I dumped it into postgresql, then chunked the text into 100 char chunks w/ a 10 char overlap. These chunks were directly embedded into a 384D space using python sentence_transformers. Then I simply averaged all chunks for a doc and wrote that single vector back to postgresql. Then I used UMAP + HDBScan to perform dimensionality reduction and clustering. I ended up with a 2D data set that I can plot with plotly to see my clusters. It is very cool to play with this. It takes hours to import 100 pdf files but I can take one folder that contains a mix of programming titles, self-help, math, science fiction etc. After the fully automated analysis you can clearly see the different topic clusters.

I just spent time getting it all running on docker compose and moved my web ui from express js to flask. I want to get the code cleaned up and open source it at some point.

johnwatson11218 1/16/2026||

I posted my code https://github.com/johnwatson11218/LatentTopicExplorer

johnwatson11218 1/11/2026|||

Thanks for the supportive comments. I'm definitely thinking I should release sooner rather than later. I have been using LLM for specific tasks and here is some sample stored procedure I had an LLM write for me.

-- -- Name: refresh_topic_tables(); Type: PROCEDURE; Schema: public; Owner: postgres --

CREATE PROCEDURE public.refresh_topic_tables() LANGUAGE plpgsql AS $$ BEGIN -- Drop tables in reverse dependency order DROP TABLE IF EXISTS topic_top_terms; DROP TABLE IF EXISTS topic_term_tfidf; DROP TABLE IF EXISTS term_df; DROP TABLE IF EXISTS term_tf; DROP TABLE IF EXISTS topic_terms;

    -- Recreate tables in correct dependency order
    CREATE TABLE topic_terms AS
    SELECT
        dt.term_id,
        dot.topic_id,
        COUNT(DISTINCT dt.document_id) as document_count,
        SUM(frequency) as total_frequency
    FROM document_terms dt
    JOIN document_topics dot ON dt.document_id = dot.document_id
    GROUP BY dt.term_id, dot.topic_id;

    CREATE TABLE term_tf AS
    SELECT
        topic_id,
        term_id,
        SUM(total_frequency) as term_frequency
    FROM topic_terms
    GROUP BY topic_id, term_id;

    CREATE TABLE term_df AS
    SELECT
        term_id,
        COUNT(DISTINCT topic_id) as document_frequency
    FROM topic_terms
    GROUP BY term_id;

    CREATE TABLE topic_term_tfidf AS
    SELECT
        tt.topic_id,
        tt.term_id,
        tt.term_frequency as tf,
        tdf.document_frequency as df,
        tt.term_frequency * LN( (SELECT COUNT(id) FROM topics) / GREATEST(tdf.document_frequency, 1)) as tf_idf
    FROM term_tf tt
    JOIN term_df tdf ON tt.term_id = tdf.term_id;

    CREATE TABLE topic_top_terms AS
    WITH ranked_terms AS (
        SELECT
            ttf.topic_id,
            t.term_text,
            ttf.tf_idf,
            ROW_NUMBER() OVER (PARTITION BY ttf.topic_id ORDER BY ttf.tf_idf DESC) as rank
        FROM topic_term_tfidf ttf
        JOIN terms t ON ttf.term_id = t.id
    )
    SELECT
        topic_id,
        term_text,
        tf_idf,
        rank
    FROM ranked_terms
    WHERE rank <= 5
    ORDER BY topic_id, rank;

    RAISE NOTICE 'All topic tables refreshed successfully';

EXCEPTION WHEN OTHERS THEN RAISE EXCEPTION 'Error refreshing topic tables: %', SQLERRM; END; $$;

ct0 1/10/2026|||

This sounds amazing, totally interested in seeing the approach and repo.

hellisad 1/11/2026|||

Sounds a lot like Bertopic. Great library to use.

fittingopposite 1/11/2026||

Yes. Please publish. Sounds very interesting

8organicbits 1/10/2026||

Can someone break this down for me?

I'm seeing "Thanos committing fraud" in a section about "useful lies". Given that the founder is currently in prison, it seems odd to consider the lie useful instead of harmful. It kinda seems like the AI found a bunch of loosely related things and mislabeled the group.

If you've read these books I'm not seeing what value this adds.

Closi 1/10/2026||

I guess the lies were useful until she got caught?

irishcoffee 1/11/2026|||

Why lie if it isn’t useful? Lying is generally bad, why do a generally bad thing if there isn’t at least a justification, a “use” if you will.

PeterStuer 1/11/2026||

Be careful with the 'utility' model of explaining behavior. It is fairly easy to slide into 'if behavior X is manifested, this must mean X must somehow be useful'. You can use this model to explain behavior, but be aware of the circularity trap in the model. "She lied thus the lie must have had use, even if it is not obvious we will discover the utility if we dig down enough".

Another model can be post-rationalization. People just do stuff instinctively, then rationalize why they did them after the fact. "She lied without thinking about it, then constructed a reasoning why the lie was rational to begin with".

At the extremes, some people will never lie, even to their detriment. Usually they seem to attribute this to virtue. Others will always lie. They seem to feel not lying is surrendering control. Most people are somewhere in between.

Terretta 1/11/2026||

Thanos is the comic book villain snapping his fingers.

Theranos is the fraud mentioned in the piece.

theturtletalks 1/10/2026||

In a similar vein, I've been using Claude Code to "read" Github projects I have no business understanding. I found this one trending on Github with everything in Russian and went down the rabbit hole of deep packet inspection[0].

0. https://github.com/ValdikSS/GoodbyeDPI

noname120 1/10/2026||

ValdikSS is the guy behind the SBC XQ patches for Android (that alas were never merged by G). I didn’t expect to see him here with another project!

https://habr.com/en/articles/456476/

https://android-review.googlesource.com/c/platform/system/bt...

dinkleberg 1/10/2026||

That's a cool idea. There are so many interesting projects on GitHub that are incomprehensible without a ton of domain context.

theturtletalks 1/10/2026||

I got the idea from an old post on here called Story of Mel[0] where OP talks about the beauty of Mel's intricate machine code on a RPC-4000.

This is the part that always stuck with me:

I have often felt that programming is an art form, whose real value can only be appreciated by another versed in the same arcane art; there are lovely gems and brilliant coups hidden from human view and admiration, sometimes forever, by the very nature of the process. You can learn a lot about an individual just by reading through his code, even in hexadecimal. Mel was, I think, an unsung genius.

0. http://catb.org/esr/jargon/html/story-of-mel.html

coolewurst 1/11/2026||

Thank you for sharing that story. Mel seems virtuousic, but is that really art? Optimizing pattern positioning on a drum for maximum efficiency. Is that expression?

maxbond 1/11/2026|||

> Is that expression?

If it wasn't expression everyone would get the same result. But no one else at Royal McBee did things the way Mel Kaye did things.

Kaye had a strong artistic vision for how things should be done; he didn't want to use the ergonomic features of the RPC-4000 because they didn't align with his vision. I think he found the idea of rigging the blackjack program offensive in part for the same reason.

Speaking for myself, I have always found the story and "pessimal" instructions beautiful. It's my favorite piece of folklore of all time. Kaye and Nather are both artists to me.

Tangentially, Kaye is standing on the far right in this photo.

https://zappa.brainiac.com/MelKaye.png

And here is Nather.

https://en.wikipedia.org/wiki/Ed_Nather#/media/File:Ednather...

Abstract_Typist 1/11/2026|||

If you consider engineering the art of the possible. (Yes, I know it's a politician's phrase, that's because politics is the art of the plausible ... )

pxc 1/10/2026||

I read a book maybe a decade ago on the "digital humanities". I wish now I could remember the title and author. :(

Anyway, it introduced me to the idea of using computational methods in the humanities, including literature. I found it really interesting at the time!

One of the the terms it introduced me to is "distant reading", whose name mirrors that of a technique you may have studied in your gen eds if you went to university ('close reading"). The idea is that rather than zooming in on some tiny piece of text to examine very subtle or nuanced meanings, you zoom out to hundreds or thousands of texts, using computers to search them for insights that only emerge from large bodies of work as wholes. The book argued that there are likely some questions that it is only feasible to ask this way.

An old friend of mine used techniques like this for dissertation in rhetoric, learning enough Python along the way to write the code needed for the analyses she wanted to do. I thought it was pretty cool!

I imagine LLMs are probably positioned now to push distant reading forward in an number of ways: enabling new techniques, allowing old techniques to be used without writing code, and helping novices get started with writing some code. (A lot of the maintainability issues that come with LLM code generation happily don't apply to research projects like this.)

Anyway, if you're interested in other computational techniques you can use to enrich this kind of reading, you might enjoy looking into "distant reading": https://en.wikipedia.org/wiki/Distant_reading

plutokras 1/10/2026|

> I wish now I could remember the title and author.

LLMs are great at finding media by vague descriptions. ;)

ako 1/10/2026|||

According to Claude (easy guess from the wikipedia link?):

The book is almost certainly by *Franco Moretti*, who coined the term "distant reading." Given the timeframe ("maybe a decade ago") and the description, it's most likely one of these two:

1. *"Distant Reading"* (2013) — A collection of Moretti's essays that directly takes the concept as its title. This would fit well with "about a decade ago."

2. *"Graphs, Maps, Trees: Abstract Models for Literary History"* (2005) — His earlier and very influential work that laid out the quantitative, computational approach to literary analysis, even if it didn't use "distant reading" as prominently in the title.

Moretti, who founded the Stanford Literary Lab, was the major proponent of the idea that we should analyze literature not just through careful reading of individual canonical texts, but through large-scale computational analysis of hundreds or thousands of works—looking at trends in genre evolution, plot structures, title lengths, and other patterns that only emerge at scale.

Given that the commenter specifically remembers learning the term "distant reading" from the book, my best guess is *"Distant Reading" (2013)*, though "Graphs, Maps, Trees" is also a strong possibility if their memory of "a decade" is approximate.

pxc 1/11/2026|||

After some digging, I think it was likely this one: https://direct.mit.edu/books/book/5346/Digital-Humanities

smusamashah 1/10/2026||

I dont understand the lines connecting two pieces of text. In most cases, the connected words have absolutely zero connection with each other.

In "Father wound" the words "abandoned at birth" are connected to "did not". Which makes it look like those visual connections are just a stylistic choice and don't carry any meaning at all.

Oras 1/10/2026||

I had the exact same impression.

hecanjog 1/11/2026||

Yes, they look really good but they're being connected by an LLM.

chrisgd 1/11/2026||

Really great work but have to agree with others that I don’t see the threads.

The one I found most connected that the LLm didn’t was a connection between Jobs and the The Elephant in the Brain

The Elephant in the Brain: The less we know of our own ugly motives, the easier it is to hide them from others. Self-deception is therefore strategic, a ploy our brains use to look good while behaving badly.

Jobs: “He can deceive himself,” said Bill Atkinson. “It allowed him to con people into believing his vision, because he has personally embraced and internalized it.”

urbandw311er 1/10/2026||

This feels like a nice idea but the connection between the theme and the overarching arc of each book seems tenuous at best. In some cases it just seems to have found one paragraph from thousands and extrapolated a theme that doesn’t really thread through the greater piece.

I do like the idea though — perhaps there is a way to refine the prompting to do a second pass or even multiple passes to iteratively extract themes before the linking step.

Balgair 1/11/2026||

Wow! Amazing!

Have you read the Syntopicon by Mortimer J Adler?

It's right up your alley on this one. It's essentially this, but in 1965, by hand, with Isaac Asimov and William F Buckley Jr, among others.

Where did you get the books from? I've been trying to do something like this myself, but haven't been able to get good access to books under copyright.

Yeah, thinking a bit more here, you've created a Syntopicon. I've always wanted to make a modern one too! You can do the old school late night Wikipedia reading session with the trails idea of yours. Brilliant!

Really though, how can I help you make this bigger?

tolerance 1/10/2026||

I don’t like this product as a service to readers (i.e., people who read as a cognitive/philosophical exploit) but I do think that somewhere embedded in its backend there are things of benefit.

I think that this sucks the discreet joy out of reading and learning. Having the ways that the topics within a certain book can cross over in lead into another book of a different topic externalized is hollowing and I don’t find it useful.

On the other hand I feel like seeing this process externalized gives us a glimpse at how “the algorithms” (read: recommender systems) suggest seemingly disjunctive content to users. So as a technical achievement I can’t knock what you’ve done and I’m satisfied to see that you’re the guy behind the HN Book map that I thought was nice too.

At its core this looks like a representation of the advantages that LLMs can afford to the humanities. Most of us know how Rob Pike feels about them. I wonder if his senior former colleague feels the same: https://www.cs.princeton.edu/~bwk/hum307/index.html. That’s a digression, but I’d like to see some people think in public about how to reasonably use these tools in that domain.

mathgeek 1/10/2026|

> Having the ways that the topics within a certain book can cross over in lead into another book of a different topic externalized is hollowing and I don’t find it useful.

Intuitively, I agree. This feels like the different between being a creator (of your own thoughts as inspired by another person's) and a consumer (although in a somewhat educational sense). There would need to be a big advantage to being taught those initial thoughts, analogous to why we teach folks algebra/calculus via formulas rather than having every student figure out proofs for themselves.

bonkusbingus 1/10/2026|

"There are, you see, two ways of reading a book: you either see it as a box with something inside and start looking for what it signifies, and then if you're even more perverse or depraved you set off after signifiers. And you treat the next book like a box contained in the first or containing it. And you annotate and interpret and question, and write a book about the book, and so on and on. Or there's the other way: you see the book as a little non-signifying machine, and the only question is "Does it work, and how does it work?" How does it work for you? If it doesn't work, if nothing comes through, you try another book. This second way of reading's intensive: something comes through or it doesn't. There's nothing to explain, nothing to understand, nothing to interpret." — Gilles Deleuze

drakeballew 1/11/2026|

I am not familiar with the source of this quote, but I don't disagree, it is just incredibly reductive. Gilles Deleuze him-/her-self was not born and did not live in a vacuum. They were influenced and mimetically reproduced ideas they were exposed to, like we all do. I don't find the point of this project meaningless myself. The opposite in fact. But the results are not accurate for anyone who has actually read any of these texts.

jennyholzer6 1/11/2026||

[dead]

More comments...