Ask HN: How are you doing RAG locally?

Posted by tmaly 1/14/2026

I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?

Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?

413 points | 157 commentspage 6

throwaway7783 1/15/2026|

We have a Q&A database. The questions, answers are both trigram indexed and also have embeddings. All in postgres. We then use pgvector + trigram search, combine them by relevance scores.

ehsanu1 1/15/2026||

Embedded usearch vector database. https://github.com/unum-cloud/USearch

lee1012 1/15/2026||

lee101/gobed https://github.com/lee101/gobed static embedding models so they are embedded in milliseconds and on gpu search with a cagra style on gpu index with a few things for speed like int8 quantization on the embeddings and fused embedding and search in the same kernel as the embedding really is just a trained map of embeddings per token/averaging

beret4breakfast 1/15/2026||

For the purposes of learning, I’ve built a chatbot using ollama, streamlit, chromadb and docling. Mostly playing around with embedding and chunking on a document library.

sidrag22 1/15/2026|

i took a similar path, i spun up a discord bot, used ollama, pgvector, docling for random documents, and made some specialized chunking strategies for some clunkier json data. its been a little while since i messed with it, but i really did enjoy it when i was.

it all moves so fast, i wouldnt be surprised if everything i made is now crazy outdated and it was probably like 2 months ago.

eajr 1/14/2026||

Local LibreChat which bundles a vector db for docs.

geuis 1/15/2026||

I don't. I actually write code.

To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.

However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.

nineteen999 1/14/2026||

A little BM25 can get you quite a way with an LLM.

ramesh31 1/14/2026||

SQLite with FTS5

lormayna 1/15/2026||

I have done some experiments with nomic embedding through Ollama and ChromaDB.

Works well, but I didn't tested on larger scale

More comments...