Show HN: Chonky – a neural approach for text semantic chunking

Posted by hessdalenlight 4/11/2025

Show HN: Chonky – a neural approach for text semantic chunking(github.com)

TLDR: I’ve made a transformer model and a wrapper library that segments text into meaningful semantic chunks.

The current text splitting approaches rely on heuristics (although one can use neural embedder to group semantically related sentences).

I propose a fully neural approach to semantic chunking.

I took the base distilbert model and trained it on a bookcorpus to split concatenated text paragraphs into original paragraphs. Basically it’s a token classification task. Model fine-tuning took day and a half on a 2x1080ti.

The library could be used as a text splitter module in a RAG system or for splitting transcripts for example.

The usage pattern that I see is the following: strip all the markup tags to produce pure text and feed this text into the model.

The problem is that although in theory this should improve overall RAG pipeline performance I didn’t manage to measure it properly. Other limitations: the model only supports English for now and the output text is downcased.

Please give it a try. I'll appreciate a feedback.

The Python library: https://github.com/mirth/chonky

The transformer model: https://huggingface.co/mirth/chonky_distilbert_base_uncased_...

169 points | 35 commentspage 2

sushidev 4/13/2025|

So I could use this to index i.e. a fiction book in a vector db, right? And the semantic chunking will possibly provide better results at query time for rag, did I understand that correctly?

hessdalenlight 4/13/2025|

Yes and yes you are correct!

rybosome 4/13/2025||

Interesting idea - is the chunking deterministic? It would have to be to be useful, but I’m wondering how that interacts with the neural net.

fareesh 4/13/2025||

The non english space in these fields is so far behind in terms of accuracy and reliability, it's crazy

acstorage 4/13/2025||

You mention that the fine tuning time took half a day, have you ever thought to reduce that time?

hessdalenlight 4/13/2025|

Actually day and a half :). I'm all for it but unfortunately I have pretty old hardware.

cmenge 4/13/2025||

> I took the base distilbert model

I read "the base Dilbert model", all sorts of weird ideas going through my head, concluded I should re-read and made the same mistake again XD

Guess I better take a break and go for a walk now...

jaggirs 4/13/2025||

Did you evaluate it on a RAG benchmark?

hessdalenlight 4/13/2025|

No I didn't it yet. I would be grateful if you could advise me such a benchmark.

jaggirs 4/13/2025||

Not sure, havent done so myself but I think you can use MTEB maybe. Or otherwise a llm benchmark on large inputs (and compare your chunking with naive chunking)

rekovacs 4/13/2025||

Really amazing and impressive work!

olavfosse 4/13/2025|

Does it work on other languages?

hessdalenlight 4/13/2025|

[dead]