Top
Best
New

Posted by hessdalenlight 4/11/2025

Show HN: Chonky – a neural approach for text semantic chunking(github.com)
TLDR: I’ve made a transformer model and a wrapper library that segments text into meaningful semantic chunks.

The current text splitting approaches rely on heuristics (although one can use neural embedder to group semantically related sentences).

I propose a fully neural approach to semantic chunking.

I took the base distilbert model and trained it on a bookcorpus to split concatenated text paragraphs into original paragraphs. Basically it’s a token classification task. Model fine-tuning took day and a half on a 2x1080ti.

The library could be used as a text splitter module in a RAG system or for splitting transcripts for example.

The usage pattern that I see is the following: strip all the markup tags to produce pure text and feed this text into the model.

The problem is that although in theory this should improve overall RAG pipeline performance I didn’t manage to measure it properly. Other limitations: the model only supports English for now and the output text is downcased.

Please give it a try. I'll appreciate a feedback.

The Python library: https://github.com/mirth/chonky

The transformer model: https://huggingface.co/mirth/chonky_distilbert_base_uncased_...

169 points | 35 commentspage 2
sushidev 4/13/2025|
So I could use this to index i.e. a fiction book in a vector db, right? And the semantic chunking will possibly provide better results at query time for rag, did I understand that correctly?
hessdalenlight 4/13/2025|
Yes and yes you are correct!
rybosome 4/13/2025||
Interesting idea - is the chunking deterministic? It would have to be to be useful, but I’m wondering how that interacts with the neural net.
fareesh 4/13/2025||
The non english space in these fields is so far behind in terms of accuracy and reliability, it's crazy
acstorage 4/13/2025||
You mention that the fine tuning time took half a day, have you ever thought to reduce that time?
hessdalenlight 4/13/2025|
Actually day and a half :). I'm all for it but unfortunately I have pretty old hardware.
cmenge 4/13/2025||
> I took the base distilbert model

I read "the base Dilbert model", all sorts of weird ideas going through my head, concluded I should re-read and made the same mistake again XD

Guess I better take a break and go for a walk now...

jaggirs 4/13/2025||
Did you evaluate it on a RAG benchmark?
hessdalenlight 4/13/2025|
No I didn't it yet. I would be grateful if you could advise me such a benchmark.
jaggirs 4/13/2025||
Not sure, havent done so myself but I think you can use MTEB maybe. Or otherwise a llm benchmark on large inputs (and compare your chunking with naive chunking)
rekovacs 4/13/2025||
Really amazing and impressive work!
olavfosse 4/13/2025|
Does it work on other languages?
hessdalenlight 4/13/2025|
[dead]