Posted by mfiguiere 2 days ago
The good-ol' copy-on-write memmapped B-Tree is still widely used, even on newer key-value stores like redb (I am more familiar with the Rust ecosystem), and it claims to outperform rocksdb (the go-to LSM-tree kvs) on most metrics other than batch writes [1] (although probably biased).
LMDB is still widely used (one of the main classic B-tree kvs), and Postgres, SQLite and MongoDB (WiredTiger), among others, are still backed by B-Tree key-value stores. The key-value storage backends tend to be relatively easy to swap, and I don't know of major efforts to migrate popular databases to a LSM tree backend.
[1] https://www.redb.org/post/2023/06/16/1-0-stable-release/
Lately at work I've done C++ optimization tricks like inplace_map, inplace_string, placement new to inline map-like iterators inside a view adapter's iterators and putting that byte buffer as the first member of the class to not incur std::max_align_t padding with the other members. At a higher architecture level, I wrote a data model binding library that can serialize JSON, YAML and CBOR documents to an output iterator one byte at a time without incurring heap allocation in most cases.
This is because I work on an embedded system with 640 KiB of SRAM and given the sheer amount of run-time data it will have to handle and produce, I'm wary not only about heap usage, but also heap fragmentation.
AI will readily identify such tricks, it can even help implement them, but unless constrained otherwise AI will pick the most expedient solution that answers the question (note that I didn't say answers the problem).
With SOTA models it all depends on how you drive them.
The market is telling us that through increased hardware prices.
LLMs being very powerful means that we need to start being smarter about allocating resources. Should chat apps really eat up gigabytes of RAM and be entitled to cores, when we could use that for inference?
LLMs aren't even close to the level of knowledge distillation capacity a human has yet.
Skip trees/graphs sound interesting, but I can't think of any use case for them off the top of my head.
My first instinctive idea would be that there is an optimal distance, maybe based on absolute distance or by function of list size or frequency of access or whatever. Leaving the promotion to randomness is counter intuitive to me.
If you knew all the items ahead of time and didn't require adding/removing them incrementally you could build optimal skiplist without randomness. But in such situations you likely don't need a skip list or BST at all. Binary search on sorted list will do.
If you're saving multiple ancestors in ancestors_between why not save all of them?
And if the goal is to access the info for all of the ancestors, what makes it more likely than average that your ancestors aren't on the same layer as yourself (i.e. that e.g. half of them aren in ancestors_between)?
Because, if a fixed ratio (50 or even 10%) of an arbitrary node's ancestors is at the same layer, isn't complexity still the same (only reduced by a constant factor)?