Ask HN: How are you doing RAG locally?

Posted by tmaly 1/14/2026

I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?

Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?

413 points | 157 commentspage 3

rahimnathwani 1/14/2026|

If your data aren't too large, you can use faiss-cpu and pickle

https://pypi.org/project/faiss-cpu/

notyourwork 1/15/2026||

For the uneducated, how large is too large? Curious.

itake 1/15/2026||

FAISS runs in RAM. If your dataset can't fit into ram, FAISS is not the right tool.

hahahahhaah 1/15/2026||

Shoud it be:

If the total size of your data isn't loo large...?

Data being a plural gets me.

You might have small datums but a lot of kilobytes!

pousada 1/15/2026||

Data is technically a plural but nobody uses the singular and it’s being used as a singular term often - which is completely fine I think, nobody speaks Latin anyway

DonHopkins 1/15/2026||

The opposite of Data is Lore.

init0 1/15/2026||

I built a lib for myself https://pypi.org/project/piragi/

stingraycharles 1/15/2026|

That looks great! Is there a way to store / cache the embeddings?

oliveiracwb 1/15/2026||

We handle ~300k customer interactions per day, so latency and precision really matter. We built an internal RAG-based portal on top of our knowledge base (basically a much better FAQ).

On the retrieval side, I built a custom search/indexing layer (Node) specifically for service traceability and discovery. It uses a hybrid approach — embeddings + full-text search + IVF-HNSW — to index and cross-reference our APIs, services, proxies and orchestration repos. The RAG pipelines sit on top of this layer, which gives us reasonable recall and predictable latency.

Compliance and observability are still a problem. Every year new vendors show up promising audits, data lineage and observability, but none of them really handle the informational sprawl of ~600 distributed systems. The entropy keeps increasing.

Lately I’ve been experimenting with a more semantic/logical KAG approach on top of knowledge graphs to map business rules scattered across those systems. The goal is to answer higher-level questions about how things actually work — Palantir-like outcomes, but with explicit logic instead of magic.

Curious if others are moving beyond “pure RAG” toward graph-based or hybrid reasoning setups.

bzGoRust 1/15/2026||

In my company, we build the internal chatbot based on RAG through LangChain + Milvus + LLM. Since the documents are well formatted, it is easy to do the overlapping chunking, then all those chunking data are inserted into vector db Milvus. The hybrid search (combine dense search and sparse search) is native supported in the Milvus could help us to do better retrieve. Thus the better quality answers are there.

cluckindan 1/15/2026|

Hybrid search usually refers to traditional keyword search (BM25, TF-IDF) combined with a vector similarity search.

folli 1/15/2026||

I was just working on a RAG implementation for >500k news articles, completely local, using postgres as a vector database: https://github.com/r-follador/TeletextSignals

I'm positively surprised on how well it works, especially if you also connect it to an LLM.

tschellenbach 1/15/2026||

Vector & BM25 on Turbopuffer. (see https://github.com/GetStream/Vision-Agents/blob/main/plugins...)

philip1209 1/15/2026||

I run a Mac Mini home datacenter [1]. I've been using Chroma, Qwen 0.6B embeddings, and gpt-oss-20b to build a search agent over my blog.

[1]: https://www.contraption.co/a-mini-data-center/

podgietaru 1/15/2026||

I made a small RAG database just using Postgres. I outlined it in the blog post below. I use it for RSS Feed organisation, and searching. They are small blobs. I do the labeling using a pseudo-KNN algorithm.

https://aws.amazon.com/blogs/machine-learning/use-language-e...

The code for it is here: https://github.com/aws-samples/rss-aggregator-using-cohere-e...

The example link no longer works, as I no longer work at AWS.

g0wda 1/15/2026||

Store fp16 vector blobs in sqlite. Load the vectors after filter queries into memory and do a matvec multiplication for similarity scores (this part will be fast if the library (e.g. numpy/torch) uses multithreading/blas/GPU). I will migrate this to the very based https://github.com/sqliteai/sqlite-vector when it starts to become a bottleneck. In my case the filters by other features (e.g. date, location) just subset a lot. All this is behind some interface that will allow me to switch out the backend.

cbcoutinho 1/15/2026|

The Nextcloud MCP Server [0] supports Qdrant as a vectordb to store embeddings and provide semantic search across your personal documents. This enables any LLM & MCP client (e.g. claude code) into a RAG system that you can use to chat with your files.

For local deployments, Qdrant supports storing embeddings in memory as well as in a local directory (similar to sqlite) - for larger deployments Qdrant supports running as a standalone service/sidecar and can be made available over the network.

[0] https://github.com/cbcoutinho/nextcloud-mcp-server

More comments...