Posted by tmaly 1/14/2026
Ask HN: How are you doing RAG locally?
Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?
So I use hosted one to prevent this. My business use vector db, so created a new db to vectorize and host my knowledge base. 1. All my knowledge base is markdown files. So I split that by header tags. 2. The split is hashed and hash value is stored in SQLite 3. The hashed version is vectorized and pushed to cloud db. 4. When ever I make changes , I run a script which splits and checks hash, if it is changed the. I upsert the document. If not I don’t do anything. This helps me keep the store up to date
For search I have a cli query which searches and fetches from vector store.
1 - agent memory ( its an ai coach so its the unique training methods that allow for instant adoption of new skills and distilling best fit skills for context )
2 - user memory ( the ai coaches memory of a user )
3 - session memory ( for long conversations, instead of compaction or truncation )
Then separately I have coding agents which I give semantic search, same system FAISS
- on command they create new memories from lessons ( consumes tokens * ) - they vector search FAISS when needing more context ( 2x greater agent alignment / outcomes this way )
And finally I forked openais codex terminal agent code to add - inbuilt vector search and injection
So I say "Find any uncovered TDD opportunity matching intent to actuality for auth on these 3 repos, write TDD coverage, and bring failures to my attention"
They set my message to {$query}
vector search on {$query}
embed results in their context window
programmatically - so no token consumption ( what a freaking dream )
thats open source if helpful
Its here
https://github.com/Next-AI-Labs-Inc/codex/tree/nextailabs
Im trying to determine where something like this fits in
https://huggingface.co/MongoDB/mdbr-leaf-ir
My gaps right now are ...
I am not training the agents yet, like fine tuning the underlying models.
Would love the simplest approach to test this, because at least with the codex clone I could easily swap out local models, but somehow doubting that they will be able to match performance of the outsourced models.
especially bc claude code just launched ahead of codex in the last week or so in quality, and they are closed source. Im seeing clear swarm agentic coding internally which is a dream for context window efficiency. ( in claude code as of today )
The real challenge wasn't model quality - it was the chunking strategy. Financial data is weirdly structured and breaking it into sensible chunks that preserve context took more iteration than expected. Eventually settled on treating each complete record as a chunk rather than doing sliding windows over raw text. The "obvious" approaches from tutorials didn't work well at all for structured tabular-ish data.
ragtune explain "your query" --collection prod
Shows scores, sources, and diagnostics. Helps catch when your chunking
or embeddings are silently failing or you need numeric estimations to base your judgements on.Open source: https://github.com/metawake/ragtune
This is specifically a “remembrance agent”, so it surfaces related atoms to what you’re writing rather than doing anything generative.
Extension: https://github.com/mmargenot/tezcat
Also available in community plugins.
kb = Ragi(["./docs", "s3://bucket/data/*/*.pdf", "https://api.example.com/docs"])
answer = kb.ask("How do I deploy this?")
that's it! with https://pypi.org/project/piragi/
Not sure how useful it is for what you need specifically: https://blog.yakkomajuri.com/blog/local-rag
save_memory, recall_memory, search
Save memory vectorizes a session, summarizes it, and stores it in SQLite. Recall memory takes vector or a previous tool run id and loads the full text output. Search takes a vector array or string array and searches through the graph using fuzzy matching and vector dot products.
It’s not fancy, but it works really well. gpt-oss