Posted by tmaly 1/14/2026
Ask HN: How are you doing RAG locally?
Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?
it all moves so fast, i wouldnt be surprised if everything i made is now crazy outdated and it was probably like 2 months ago.
To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.
However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.
Works well, but I didn't tested on larger scale