Top
Best
New

Posted by tmaly 1 day ago

Ask HN: How are you doing RAG locally?

I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?

Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?

316 points | 126 commentspage 4
yakkomajuri 7 hours ago|
I've written about this (and the post was even here on HN) but mostly from the perspective of running a RAG on your infra as an organization. But I cover the general components and alternatives to Cloud services.

Not sure how useful it is for what you need specifically: https://blog.yakkomajuri.com/blog/local-rag

init0 14 hours ago||
I built a lib for myself https://pypi.org/project/piragi/
stingraycharles 13 hours ago|
That looks great! Is there a way to store / cache the embeddings?
pj4533 7 hours ago||
Well this isn’t code, but I’ve been working on a memory system for Claude Code. This portion provides semantic search over the session files in .claude/projects. It uses OpenAI for embeddings so not completely local (would be easy to modify) and storage in ChromaDB.

https://github.com/pj4533/seance

rahimnathwani 1 day ago||
If your data aren't too large, you can use faiss-cpu and pickle

https://pypi.org/project/faiss-cpu/

notyourwork 15 hours ago||
For the uneducated, how large is too large? Curious.
itake 13 hours ago||
FAISS runs in RAM. If your dataset can't fit into ram, FAISS is not the right tool.
hahahahhaah 12 hours ago||
Shoud it be:

If the total size of your data isn't loo large...?

Data being a plural gets me.

You might have small datums but a lot of kilobytes!

pousada 8 hours ago||
Data is technically a plural but nobody uses the singular and it’s being used as a singular term often - which is completely fine I think, nobody speaks Latin anyway
DonHopkins 5 hours ago||
The opposite of Data is Lore.
reactordev 7 hours ago||
I have three tools dedicated to this.

save_memory, recall_memory, search

Save memory vectorizes a session, summarizes it, and stores it in SQLite. Recall memory takes vector or a previous tool run id and loads the full text output. Search takes a vector array or string array and searches through the graph using fuzzy matching and vector dot products.

It’s not fancy, but it works really well. gpt-oss

andoando 3 hours ago||
Anyone have suggestions for doing semantic caching?
claylyons 6 hours ago||
Has anyone tried this? https://aws.amazon.com/s3/features/vectors/
cbcoutinho 13 hours ago||
The Nextcloud MCP Server [0] supports Qdrant as a vectordb to store embeddings and provide semantic search across your personal documents. This enables any LLM & MCP client (e.g. claude code) into a RAG system that you can use to chat with your files.

For local deployments, Qdrant supports storing embeddings in memory as well as in a local directory (similar to sqlite) - for larger deployments Qdrant supports running as a standalone service/sidecar and can be made available over the network.

[0] https://github.com/cbcoutinho/nextcloud-mcp-server

codebolt 8 hours ago||
Giving the LLM tools with an OData query interface has worked well for me. In C# it's pretty trivial to set up an MCP server with OData querying for an arbitrary data model. At work we have an Excel sheet with 40k rows which the LLM was able to quickly and reliably analyse using this method.
robotswantdata 9 hours ago|
You don’t need a vector database or graph, it really depends on your existing infrastructure , file types and needs.

The newer “agent” search approach can just query a file system or api. It’s slightly slower but easier to setup and maintain as no extra infrastructure.

More comments...