Ask HN: How are you doing RAG locally?

Posted by tmaly 1/14/2026

I am curious how people are doing RAG locally with minimal dependencies for internal code or complex documents?

Are you using a vector database, some type of semantic search, a knowledge graph, a hypergraph?

413 points | 157 comments

navar 1/15/2026|

For the retrieval stage, we have developed a highly efficient, CPU-only-friendly text embedding model:

https://huggingface.co/MongoDB/mdbr-leaf-ir

It ranks #1 on a bunch of leaderboards for models of its size. It can be used interchangeably with the model it has been distilled from (https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1...).

You can see an example comparing semantic (i.e., embeddings-based) search vs bm25 vs hybrid here: http://search-sensei.s3-website-us-east-1.amazonaws.com (warning! It will download ~50MB of data for the model weights and onnx runtime on first load, but should otherwise run smoothly even on a phone)

This mini app illustrates the advantage of semantic vs bm25 search. For instance, embedding models "know" that j lo refers to jennifer lopez.

We have also published the recipe to train this type of models if you were interested in doing so; we show that it can be done on relatively modest hardware and training data is very easy to obtain: https://arxiv.org/abs/2509.12539

HanClinto 1/17/2026||

Thank you for publishing this! I absolutely love small embedding models, and have used them on a number of projects (both commercial and hobbyist). I look forward to checking this one out!

I don't know if this is too much to ask, but something that would really help me adopt your model is to include a fine-tuning setup. The BGE series of embeddings-models has been my go-to for a couple of years now -- not because it's the best-performing in the leaderboards, but because they make it so incredibly easy to fine-tune the model [0]. Give it a JSONL file of a bunch of training triplets, and you can fine-tune the base models on your own dataset. I appreciate you linking to the paper on the recipe for training this type of model -- how close to turnkey is your model to helping me do transfer learning with my own dataset? I looked around for a fine-tuning example of this model, and didn't happen to see anything, but I would be very interested in trying this one out.

Does support for fine-tuning already exist? If so, then I would be able to switch to this model away from BGE immediately.

* [0] - https://github.com/FlagOpen/FlagEmbedding/tree/master/exampl...

navar 1/17/2026||

As far as I can tell it should be possible to reuse this fine tuning code entirely and just replace `--embedder_name_or_path BAAI/bge-base-en-v1.5` with `--embedder_name_or_path MongoDB/mdbr-leaf-ir`

Note that bge-base-en-v1.5 is a 110M params model - our is 23M. * BEIR performance is bge=53.23 vs ours=53.55 * RTEB performance is bge=43.75 vs ours=44.82 -> overall they should be very similar, except ours is 5x smaller and hence that much faster.

rcarmo 1/15/2026|||

Hmmm. I recently created https://github.com/rcarmo/asterisk-embedding-model, need to look at this since I had very limited training resources.

jasonjmcghee 1/15/2026|||

How does performance (embedding speed and recall) compare to minish / model2vec static word embeddings?

navar 1/15/2026||

I interacted with the authors of these models quite a bit!

These are very interesting models.

The tradeoff here is that you get even faster inference, but lose on retrieval accuracy [0].

Specifically, inference will be faster because essentially you are only doing tokenization + a lookup table + an average. So despite the fact that their largest model is 32M params, you can expect inference speeds to be higher than ours, which 23M params but it is transformer-based.

I am not sure about typical inference speeds on a CPU for their models, but with ours you can expect to do ~22 docs per second, and ~120 queries per second on a standard 2vCPU server.

As far as retrieval accuracy goes, on BEIR we score 53.55, all-MiniLM-L12-v2 (a widely adopted compact text embedding model) scores 42.69, while potion-8M scores 30.43.

I can't find their larger models but you can generally get an idea of the power level of different embedding models here: https://huggingface.co/spaces/mteb/leaderboard

If you want to run them on a CPU it may make sense to filter for smaller models (e.g., <100M params). On the other side our models achieve higher retrieval scores.

[0] "accuracy" in layman terms, not in accuracy vs recall terms. The correct word here would be "effectiveness".

3abiton 1/16/2026||

And honestly in a lot of the cases bm25 has been the best approach used in many of the projects we deployed.

__jf__ 1/15/2026||

For vector generation I started using Meta-LLama-3-8B in april 2024 with Python and Transformers for each text chunk on an RTX-A6000. Wow that thing was fast but noisy and also burns 500W. So a year ago I switched to an M1 Ultra and only had to replace Transformers with Apple's MLX python library. Approximately the same speed but less heat and noise. The Llama model has 4k dimensions so at fp16 thats 8 kilobyte per chunk, which I store in a BLOB column in SQLite via numpy.save(). Between running on the RTX and M1 there is a very small difference in vector output but not enough for me to change retrieval results, regenerate the vectors or change to another LLM.

For retrieval I load all the vectors from the SQlite database into a numpy.array and hand it to FAISS. Faiss-gpu was impressively fast on the RTX6000 and faiss-cpu is slower on the M1 Ultra but still fast enough for my purposes (I'm firing a few queries per day, not per minute). For 5 million chunks memory usage is around 40 GB which both fit into the A6000 and easily fits into the 128GB of the M1 Ultra. It works, I'm happy.

beklein 1/15/2026||

Most of my complex documents are, luckily, Markdown files.

I can recommend https://github.com/tobi/qmd/ . It’s a simple CLI tool for searching in these kinds of files. My previous workflow was based on fzf, but this tool gives better results and enables even more fuzzy queries. I don’t use it for code, though.

Aachen 1/15/2026|

Given that preface, I was really expecting that link to be a grepping tool rewritten in golang or something, or perhaps customised for markdown to weigh matches in "# heading title"s heavier for example

whacked_new 1/15/2026||

Here's a rust one: https://github.com/BeaconBay/ck

I haven't used it extensively, but semantic grep alone was kind of worth it.

Aachen 1/15/2026||

Right, I should have said Rust. Golang is so 2017!

CuriouslyC 1/15/2026||

Don't use a vector database for code, embeddings are slow and bad for code. Code likes bm25+trigram, that gets better results while keeping search responses snappy.

jankovicsandras 1/15/2026||

You can do hybrid search in Postgres.

Shameless plug: https://github.com/jankovicsandras/plpgsql_bm25 BM25 search implemented in PL/pgSQL ( Unlicense / Public domain )

The repo includes also plpgsql_bm25rrf.sql : PL/pgSQL function for hybrid search ( plpgsql_bm25 + pgvector ) with Reciprocal Rank Fusion; and Jupyter notebook examples.

canadiantim 1/15/2026||

Wow very impressive library great work!

postalcoder 1/15/2026|||

I agree. Someone here posted a drop-in for grep that added the ability to do hybrid text/vector search but the constant need to re-index files was annoying and a drag. Moreover, vector search can add a ton of noise if the model isn't meant for code search and if you're not using a re-ranker.

For all intents and purposes, running gpt-oss 20B in a while loop with access to ripgrep works pretty dang well. gpt-oss is a tool calling god compared to everything else i've tried, and fast.

threecheese 1/16/2026||

Say more!

rao-v 1/15/2026|||

Anybody know of a good service / docker that will do BM25 + vector lookup without spinning up half a dozen microservices?

cipherself 1/15/2026|||

Here's a Dockerfile that will spin up postgres with pgvector and paradedb https://gist.github.com/cipherself/5260fea1e2631e9630081fb7d...

You can use pgvector for the vector lookup and paradedb for bm25.

porridgeraisin 1/15/2026||||

For BM25 + trigram, SQLite FTS5 works well.

donkeyboy 1/15/2026||||

Elasticsearch / Opensearch is the industry standard for this

abujazar 1/15/2026||

Used to be, but they're very complicated to operate compared to more modern alternatives and have just gotten more and more bloated over the years. Also require a bunch of different applications for different parts of the stack in order to do the same basic stuff as e.g. Meilisearch, Manticore or Typesense.

cluckindan 1/15/2026||

>very complicated to operate compared to more modern alternatives

Can you elaborate? What makes the modern alternatives easier to operate? What makes Elasticsearch complicated?

Asking because in my experience, Elasticsearch is pretty simple to operate unless you have a huge cluster with nodes operating in different modes.

abujazar 1/17/2026||

Sure, I've managed both clusters and single node deployments in production until 2025 when I changed jobs. Elastic definitely does have its strengths, but they're increasingly enterprise-oriented and appear not to care a lot about open source deployments. At one point Elastic itself had a severe regression in an irreverible patch update (!?) which took weeks to fix, forcing us to recover from backup and recreate the index. The documentation is or has been ambigious and self-contradicting on a lot of points. The Debian Elastic Enterprise Search package upgrade script was incomplete, so there's a significant manual process for updating the index even for patch updates. The interfaces between the different components of the ELK stack are incoherent and there's literally a thousand ways to configure them. Default setups have changed a lot over the years, leading to incoherent documentation. You really need to be an expert at Elastic in order to run it well – or pay handsomely for the service. It's simply too complicated and costly for what it is, compared to more recent alternatives.

abujazar 1/15/2026|||

Meilisearch

Der_Einzige 1/15/2026|||

This is true in general with LLMs, not just for code. LLMs can be told that their RAG tool is using BM25+N-grams, and will search accordingly. keyword search is superior to embeddings based search. The moment google switched to bert based embeddings for search everyone agreed it was going down hill. Most forms of early enshittification were simply switching off BM25 to embeddings based search.

BM25/tf-idf and N grams have always been extremely difficult to beat baselines in information retrieval. This is why embeddings still have not led to a "ChatGPT" moment in information retrieval.

lee1012 1/15/2026|||

static embedding models im finding quite fast lee101/gobed https://github.com/lee101/gobed is 1ms on gpu :) would need to be trained for code though the bigger code llm embeddings can be high quality too so its just yea about where is ideal on the pareto fronteir really , often yea though your right it tends to be bm25 or rg even for code but yea more complex solutions are kind of possible too if its really important the search is high quality

ehsanu1 1/15/2026|||

I've gotten great results applying it to file paths + signatures. Even better if you also fuse those results with BM25.

CuriouslyC 1/15/2026||

I like embeddings for natural language documents where your query terms are unlikely to be unique, and overall document direction is a good disambiguator.

itake 1/15/2026||

With AI needing more access to documentation, WDYT about using RAG for documentation retrieval?

CuriouslyC 1/15/2026||

IME most documentation is coming from the web via web search. I like agentic RAG for this case, which you can achieve easily with a Claude Code subagent.

esperent 1/15/2026||

I'm lucky enough to have 95% of my docs in small markdown markdown files so I'm just... not (+). I'm using SQLite FTS5 (full text search) to build a normal search index and using that. Well, I already had the index so I just wired it up to my mastra agents. Each file has a short description field, so if a keyword search surfaces the doc they check the description and if it matches, load the whole doc.

This took about one hour to set up and works very well.

(+) At least, I don't think this counts as RAG. I'm honestly a bit hazy on the definition. But there's no vectordb anyway.

dmos62 1/15/2026|

Retrieval-augmented generation. What you described is a perfect example of a RAG. An embedding-based search might be more common, but that's a detail.

esperent 1/15/2026||

Well, that is what the acronym stands for. But every source I've ever seen quickly follows by noting it's retrieval backed by a vectordb. So we'd probably find an even split of people who would call this RAG or not.

xpe 1/16/2026||

What are your sources?

The backing method doesn’t matter as long as it works. This is clear from good RAG survey papers, Wikipedia, and (broadly) understanding the ethos of machine learning engineers and researchers: specific implementation details are usually means to an end, not definitional boundaries.

This may be of interest:

https://github.com/ibm-self-serve-assets/Blended-RAG

> So we'd probably find an even split of people who would call this RAG or not.

Maybe but not likely. This is sometimes called the 50-50 fallacy or the false balance of probability or the equiprobability bias.

https://pmc.ncbi.nlm.nih.gov/articles/PMC4310748/

“The equiprobability bias (EB) is a tendency to believe that every process in which randomness is involved corresponds to a fair distribution, with equal probabilities for any possible outcome. The EB is known to affect both children and adults, and to increase with probability education. Because it results in probability errors resistant to pedagogical interventions, it has been described as a deep misconception about randomness: the erroneous belief that randomness implies uniformity. In the present paper, we show that the EB is actually not the result of a conceptual error about the definition of randomness.”

You can also find an ELI5 Reddit thread on this topic where one comment summarizes it as follows:

“People are conflating the number of distinguishable outcomes with the distribution of probability directly.”

https://www.reddit.com/r/explainlikeimfive/comments/1bpor68/...

eb0la 1/15/2026||

We started with PGVector just because we already knew Postgres and it was easy to hand over to the operations people.

After some time we noticed a semi-structured field in the prompt had a 100% match with the content needed to process the prompt.

Turns out operators started puting tags both in the input and the documents that needed to match on every use case (not much, about 50 docs).

Now we look for the field first and put the corresponding file in the prompt, then we look for matches in the database using the embedding.

85% of the time we don't need the vectordb.

alansaber 1/15/2026|

Most vectordb is a hammer looking for a nail

folli 1/15/2026||

I think it can be more efficient for two-step RAG so you can reuse the natural language query directly, but for agentic RAG it might indeed be overkill.

alansaber 1/16/2026||

Exactly this, agree completely

theahura 1/15/2026||

SQLite works shockingly well. The agents know how to write good queries, know how to chain queries, and can generally manipulate the DB however they need. At nori (https://usenori.ai/watchtower) we use SQLite + vec0 + fts5 for semantic and word search

scosman 1/15/2026||

Kiln wraps up all the parts in on app. Just drag and drop in files. You can easily compare different configs on your dataset: extraction methods, embedding model, search method (BM25, hybrid, vector), etc.

It uses LanceDB and has dozens of different extraction/embedding models to choose from. It even has evals for checking retrieval accuracy, including automatically generating the eval dataset.

You can use its UI, or call the RAG via MCP.

https://github.com/kiln-ai/kiln

https://docs.kiln.tech/docs/documents-and-search-rag

tebeka 1/15/2026||

https://duckdb.org/2024/05/03/vector-similarity-search-vss

jlarks32 1/15/2026||

+1 on this one, I've been pleasantly surprised by this for a small (<3GB) local project

m00dy 1/15/2026||

does duckdb scale well over large datasets for vector search ?

lgrebe 1/15/2026||

What order of magnitude would you define as „large“ in this case?

m00dy 1/15/2026||

like over 1tb.

cess11 1/15/2026||

Some people are using DuckDB for large datasets, https://duckdb.org/docs/stable/guides/performance/working_wi... , but you'd probably do some testing under the specific conditions of your rig to figure out if it is a good match or not.

riku_iki 1/15/2026||

its clear many DuckDB sql queries can handle terabytes of data, but the question here was about vector search..

acutesoftware 1/15/2026|

I am using LangChain with a SQLite database - it works pretty well on a 16G GPU, but I started running it on a crappy NUC, which also worked with lesser results.

The real lightbulb moment is when you realise the ONLY thing a RAG passes to the LLM is a short string of search results with small chunks of text. This changes it from 'magic' to 'ahh, ok - I need better search results'. With small models you cannot pass a lot of search results ( TOP_K=5 is probably the limit ), otherwise the small models 'forget context'.

It is fun trying to get decent results - and it is a rabbithole, next step I am going into is pre-summarising files and folders.

I open sourced the code I was using - https://github.com/acutesoftware/lifepim-ai-core

IXCoach 1/19/2026||

You can modify this, theres settings for - how much context - chunk size

We had to do this, 3 best matches but about 1000 characters each was far more effective than the default I ran into of 15-20 snippets of 4 sentences each

We also found a setting for "when do you cut off and/or start" the chunk, and set it to double new lines

Then just structured our agentic memory into meaningful chunks with 2 new lines between each, and it gelled perfectly.

( hope this helps )

reactordev 1/15/2026||

You can expand your context window to something like 100,000 to prevent memory loss.

More comments...