Posted by mathewpregasen 19 hours ago
It's crazy how people add bloat and complexity to their stuff just because they want to do medium scale RAG with ca. 2 million embeddings.
Here comes the punchline, you do not need a fancy vector database in this case. I stumbled over https://github.com/sqliteai/sqlite-vector which is a SQLite extension and I wonder why no one else did this before, but it simply implements a highly optimized brute force search over the vectors, so you get sub 100ms queries over millions of vectors with perfect recall. It uses dynamic runtime dispatch that makes use of the available SIMD instructions your CPU has. Turns out this might be all you need. No need for memory a memory hungry search index (like HNSW) or writing a huge index to disk (like DiskANN).
> For production or managed service use, please contact SQLite Cloud, Inc for a commercial license.
Currently, every new solution is either baked into an existing database (Elastic, pgvector, Mongo, etc) or an entirely separate system (Milvus, now Vectroid, etc.)
There is a clear argument in favor of the pgvector approach, since it simply brings new capabilities to 30 years of battle-tested database tech. That’s more compelling than something like Milvus that has to re-invent “the rest of the database.” And Milvus is also a second system that needs to be kept in sync with the source database.
But pgvector is still _just for Postgres_. It’s nice that it’s an extension, but in the same way Milvus has to reinvent the database, pgvector needs to reinvent the vector engine. I can’t load pgvector into DuckDB as an extension.
Is there any effort to make a pure, Unix-style, batteries not included, “vector engine?” A library with best-in-class index building, retrieval, storage… that can be glued into a Postgres extension just as easily as it can be glued into a DuckDB extension?
DataFusion nailed this balance between an embedded query engine and a standalone database system. It brings just the right amount of batteries that it’s not a super generic thing that does nothing useful out of the box, but it doesn’t bring so many that it needs to compete with full database systems.
I believe the maintainers refer to it as “the IR of databases” and I’ve always liked that analogy. That’s what I’d like to see for vector engines.
Maybe what we need as a pre-requisite is the equivalent of arrow/parquet ecosystem for vectors. DataFusion really leverages those standards for interoperability and performance. This also goes a long way toward the architectural decisions you reference — Arrow and Parquet are a solid, “good enough” choice for in-memory and storage formats that are efficient and flexible and well-supported. Is there something similar for vector storage?
Used in ClickHouse and a few other DBMS.
Disclaimer: I wrote duckdb-vss
Open source at https://github.com/spiceai/spiceai
Maybe not impossible using shared/lossy storage if they were sparsely scattered over a large space ?
But anyways - minutes. Thanks.
Edit: Gemini suggested that this sort of (lossy) storage size could be achieved using "Product Quantization" (sub vectors, clustering, cluster indices), giving an example of 256 dimensional vectors being stored at an average of 6 bits per vector, with ANN being one application that might use this.
Nitpick: could be wrong but I don’t think minutes is an SI derived unit.
Today, the differences are going to be performance, price, accuracy, flexibility, and some intangible UI elegance.
Performance: We actually INITIALLY built Vectroid for the use-case of billions of vectors and near single digit millisecond latency. During the process of building and talking to users, we found that there are just not that many use-cases (yet!) that are at that scale and require that latency. We still believe the market will get there, but it's not there today. So we re-focused on building a general purpose vector search platform, but we stayed close to our high performance roots, and we're seeing better query performance than the other serverless, object storage backed vector DBs. We think we can get way faster too.
Price: We optimized the heck out of this thing with object storage, pre-emptible virtual machines, etc. We've driven our cost down, and we're passing this to the user, starting with a free tier of 100GB. Actual pricing beyond that coming soon.
Accuracy: With our initial testing, we see recall greater or equal to competitors out there, all while being faster.
Flexibility: We are going to have a self managed version for users who want to run on their own infra, but admittedly, we don't have that today. Still working on it.
Other Product Elegance: My co-founder, Talip, made Hazelcast, and I've always been impressed by how easy it is to use and how the end to end experience is so elegant. As we continue to develop Vectroid, that same level of polish and focus on the UX will be there. As an example, one neat thing we rolled out is direct import of data from Hugging Face. We have lots of other cool ideas.
Apologies for the long winded answer. Feel free to ping us with any additional questions.
For 1024 dimensions even with 8 bit quantization you are looking at a terrabyte of data. Lets make it binary vectors, it is still 128GB of VRAM.
WAT?
That doesn't fit in anyone's video ram.
Each MI325x has 256 GB of HBM. So you would need ~32 of em if it was 2 bytes per scalar.
B200 spec:
* 8TB/sec HBM bandwidth
* 10 PetaOPs assuming int8.
* 186GB of VRAM.
If we work with 512-dimensional int8 embeddings, then we need 512GB VRAM to hold them, so assuming we have 8xB200 node (~500k$++), we can easily hold them (125M vectors per GPU).
It takes about 1000 OPs to do the dot product between two vectors, so we need to do 1000*1B = 1TeraOPs, spread over 8 GPUs, that's 125 GigaOPs per GPU, so a fraction of a ms.
Now the bottleneck will be data movement between HBM -> chips, since we have 125M vectors per GPU, aka 64GB, we can move them in ~8 ms.
Here you go, the most expensive vector search in history, giving you the same performance as a regular CPU-based vectorDB for only 1000x the price.
They show that with 4096-dimensional vectors, accuracy starts to fail at 250 mln documents (fundamental limits of embedding models). For 512-dim, it's just 500k.
Is 1 bln vectors practical?
If you mostly just want to find a particular single vector if possible and don't care so much what the second-best result is, you can get away with much smaller embeddings.
And if you do want to cover all possible pairs, 6500 dimensions or so should be enough. (Their empirical results roughly fit a cubic polynomial.)
I run a lot of search-related benchmarks (https://github.com/ashvardanian) and curious if you’ve compared to other engines on the same hardware setup, tracing recall, NDCG, indexing, and query speeds.
1. Has a technical system they think could be worth a fortune to large enterprises, containing at least a few novel insights to the industry.
2. Knows that competitors and open source alternatives could copy/implement these in a year or so if the product starts off open source.
3. Has to put food on the table and doesn’t want to give massive corporations extremely valuable software for free.
Open source has its place, but it is IMO one of the ways to give monopolies massive value for free. There are plenty of open source alternatives around for vector DBs. Do we (developers) need to give everything away to the rich
(not)
Secondly, as I know, the blocker with approximate neighbor search is often not insertion, but search. And if this search was worth a fortune to me, I'd simply embarrassingly parallelize it on CPUs or on GPUs.
Vectroid co-founder here. We're huge fans of open source. My co-founder, Talip, made Hazelcast, which is open source.
It might make sense to open source all or part of Vectroid at some point in the future, but at the moment, we feel that would slow us down.
I hate vendor lock-in just as much as the next person. I believe data portability is the ACTUAL counter to vendor lock-in. If we have clean APIs to get your data in, get your data out, and the ability to bulk export your data (which we need to implement soon!), then there's less of a concern, in my opinion.
I also totally understand and respect that some people only want open source software. I'm certainly like that w/ my homelab setup! Except for Plex... Love Plex... Usually.