What are skiplists good for?

Posted by mfiguiere 2 days ago

What are skiplists good for?(antithesis.com)

261 points | 63 commentspage 2

mrjn 21 hours ago|

skiplists form the basis of in-memory tables used by LSM trees, which are themselves the basis of most modern DBs (written post 2005).

oersted 5 hours ago|

Not sure if it's true that LSM trees are so dominant, it was a bit of a fad a few years ago during the NoSQL hype. They can be a good default for concurrent write-heavy workloads like analytics or logging, but they can be tricky to tune.

The good-ol' copy-on-write memmapped B-Tree is still widely used, even on newer key-value stores like redb (I am more familiar with the Rust ecosystem), and it claims to outperform rocksdb (the go-to LSM-tree kvs) on most metrics other than batch writes [1] (although probably biased).

LMDB is still widely used (one of the main classic B-tree kvs), and Postgres, SQLite and MongoDB (WiredTiger), among others, are still backed by B-Tree key-value stores. The key-value storage backends tend to be relatively easy to swap, and I don't know of major efforts to migrate popular databases to a LSM tree backend.

[1] https://www.redb.org/post/2023/06/16/1-0-stable-release/

teiferer 19 hours ago||

In the age of agentic programming and the ever increasing pressure to ship faster, I'm afraid this kind of knowledge will become more and more fringe, even moreso than it is today. Who has the time to think through the intricacies of parallel data structures? Clearly we'll just throw more hardware at problems, write yet another service/api/http endpoint and move on to the next hype. The LLM figures out the algorithms and we soon lose the skills to develop new ones. And tell each other the scifi BS myth that "AI" will invent new data structures in the future so we don't even beed humans in the loop.

boricj 18 hours ago||

AI is like a genie: be careful what you wish for or you'll get what you asked for.

Lately at work I've done C++ optimization tricks like inplace_map, inplace_string, placement new to inline map-like iterators inside a view adapter's iterators and putting that byte buffer as the first member of the class to not incur std::max_align_t padding with the other members. At a higher architecture level, I wrote a data model binding library that can serialize JSON, YAML and CBOR documents to an output iterator one byte at a time without incurring heap allocation in most cases.

This is because I work on an embedded system with 640 KiB of SRAM and given the sheer amount of run-time data it will have to handle and produce, I'm wary not only about heap usage, but also heap fragmentation.

AI will readily identify such tricks, it can even help implement them, but unless constrained otherwise AI will pick the most expedient solution that answers the question (note that I didn't say answers the problem).

hakanderyal 18 hours ago||

This is also the reason why we have two polar opposite views on AI. “Slop generator” vs “Next best thing since sliced bread”.

With SOTA models it all depends on how you drive them.

timeinput 11 hours ago||

All my old software before AI was self documenting and didn't need comments -- it just was obvious. Today my prompts never make slop. I'm a really good driver.

dgb23 19 hours ago|||

I think the opposite is the case. We increasingly need to care more about performance and know how to leverage hardware better.

The market is telling us that through increased hardware prices.

LLMs being very powerful means that we need to start being smarter about allocating resources. Should chat apps really eat up gigabytes of RAM and be entitled to cores, when we could use that for inference?

ozgrakkurt 15 hours ago|||

People underestimate the effect of knowledge accumulation that happens when learning from high quality sources imo.

LLMs aren't even close to the level of knowledge distillation capacity a human has yet.

Andys 7 hours ago||

Or..? A golden era for people who want to think of new things and test out their ideas quickly by having AI code it up.

tooltower 20 hours ago||

In my personal projects, I've used it to insert/delete transactions in a ledger. I wanted to be able to update/query the account balance fast. Like the article says, "fold operations".

siddsukh 5 hours ago||

The skiptree is a great example of constraint-driven invention - you didn't set out to build a new data structure, you had a specific constraint (BigQuery's poor point lookups) and the solution naturally emerged from combining two existing ideas. The part about writing a JavaScript compiler to generate kilobyte-scale SQL queries is underappreciated, too. Most engineers would have switched databases, but building a compiler when output is too complex to write by hand is almost always the right call.

medbar 16 hours ago||

Skiplist operations are local for the most part, which makes it easier to write thread-safe code for than b-trees in practice. Anecdotally, they were a nice implementation problem for my Java class in uni. But I liked working with b-lists more.

Skip trees/graphs sound interesting, but I can't think of any use case for them off the top of my head.

shawn_w 18 hours ago||

Random access with similar performance to a balanced binary tree, and ordered iteration as simple as a linked list. It's a nice combination. (Of course, so is a binary search of a sorted array, which I lean more towards these days unless doing a lot of random insertions and deletions throughout the life of the mapping).

aaa_aaa 16 hours ago||

Almost nothing. My friend and I used it once (in a rather obscure problem). Then used simple lists with some tricks with better performance because of the locality etc.

random3 10 hours ago|

You and your friend seem to make a dream team.

torben-friis 17 hours ago||

Could someone provide intuitive understanding for why the "express lanes" in a skip list are created probabilistically?

My first instinctive idea would be that there is an optimal distance, maybe based on absolute distance or by function of list size or frequency of access or whatever. Leaving the promotion to randomness is counter intuitive to me.

Karliss 14 hours ago||

The same reason naive BST tree (non self balancing one) doesn't work. You need to be able to add and remove elements without any knowledge of future operations and without malicious input being able to force a degenerate structure which is equivalent to a flat linked list. It's a bit similar how treap achieves somewhat balanced tree using randomness instead of deterministic rebalancing algorithms.

If you knew all the items ahead of time and didn't require adding/removing them incrementally you could build optimal skiplist without randomness. But in such situations you likely don't need a skip list or BST at all. Binary search on sorted list will do.

robotresearcher 5 hours ago|||

There likely is an optimal stride, as a function of the platform and the data you are storing. There is also a worst-case stride. You probably don't know what the good and bad strides are, and because they are data-dependent, they can change over time. Randomness gives you a very high probability of avoiding very bad case performance. And (to rephrase what another comment says) avoiding a constant stride avoids unlucky 'resonances' where the interval interacts with some other fixed property of your system or data.

ozgrakkurt 9 hours ago||

If it has a constant stride then that stride can align with something upstream and result in horrible performance

ncruces 6 hours ago||

I guess I'm missing the point of this, so I'm probably looking at it wrong.

If you're saving multiple ancestors in ancestors_between why not save all of them?

And if the goal is to access the info for all of the ancestors, what makes it more likely than average that your ancestors aren't on the same layer as yourself (i.e. that e.g. half of them aren in ancestors_between)?

Because, if a fixed ratio (50 or even 10%) of an arbitrary node's ancestors is at the same layer, isn't complexity still the same (only reduced by a constant factor)?

fnordpiglet 17 hours ago|

A major global bank operated all trading, especially the complex stuff, off of a globally replicated skip list.

techolic 11 hours ago|

Would you mind sharing more?

More comments...