What Does a Database for SSDs Look Like?

Posted by charleshn 12/20/2025

What Does a Database for SSDs Look Like?(brooker.co.za)

148 points | 121 commentspage 3

dist1ll 12/20/2025|

Is there more detail on the design of the distributed multi-AZ journal? That feels like the meat of the architecture.

raggi 12/20/2025||

It may not matter for clouds with massive margins but there are substantial opportunities for optimizing wear.

joek1301 12/20/2025||

I would think hyperscalers stand to benefit the most from optimizing wear!

loeg 12/20/2025||

We care about wear to the extent we can get the expected 5 years out of SSDs as a capital asset, but below that threshold it doesn't really matter to us.

adsharma 12/20/2025||

Re: keeping the relational model

This made sense for product catalogs, employee dept and e-commerce type of use cases.

But it's an extremely poor fit for storing a world model that LLMs are building in an opaque and probabilistic way.

Prediction: a new data model will take over in the next 5 years. It might use some principles from many decades of relational DBs, but will also be different in fundamental ways.

ghqqwwee 12/20/2025||

I’m a bit disappointed the article doesn’t mention Aerospike. It’s not a rdbms but a kvdb commonly used in adtech, and extremely performant on that use case. Anyway, it’s actually designed for ssds, which makes it possible to persist all writes even when the nic is saturated with write operations. Of course the aggregated bandwidth of the attached ssd hardware needs to be faster than the throughput of the nic, but not much, there’s very little overhead in the software.

CraigJPerry 12/20/2025|

How does that work? Is that an open source solution like the ZCRX stuff with io uring or does it require proprietary hardware setups? I'm hopeful that the open source solutions today are competitive.

I was familiar with Solarflare and Mellanox zero copy setups in a previous fintech role, but at that time it all relied on black boxes (specifically out of tree kernel modules, delivered as blobs without DKMS or equivalent support, a real headache to live with) that didn't always work perfectly, it was pretty frustrating overall because the customer paying the bill (rightfully) had less than zero tolerance for performance fluctuations. And fluctuations were annoyingly common, despite my best efforts (dedicating a core to IRQ handling, bringing up the kernel masked to another core, then pinning the user space workloads to specific cores and stuff like that) It was quite an extreme setup, GPS disciplined oscillator with millimetre perfect antenna wiring for the NTP setup etc we built two identical setups one in Hong Kong and one in new york. Ah very good fun overall but frustrating because of stack immaturity at that time.

toolslive 12/20/2025||

but... but... SSD/MVMes are not really block devices. Not wrangling them into a block device interface but using the full set of features can already yield major improvements. Two examples: metadata and indexes need smaller granularities compared to data and an NVMe can do this quite naturally. Another example is that the data can be sent directly from the device to the network, without the CPU being involved.

sreekanth850 12/20/2025||

Unpopular Opinion: Database were designed for 1980-90 mechanics, the only thing that never innovates is DB. It still use BTree/LSM tree that were optimized for spinning disc. Inefficiency is masked by hardware innovation and speed (Moores Law).

cmrdporcupine 12/20/2025||

There's plenty of innovation in DB storage tech, but the hardware interface itself is still page-based.

It turns out that btrees are still efficient for this work. At least until the hardware vendors deign to give us an interface to SSD that looks more like RAM.

Reading over https://www.cs.cit.tum.de/dis/research/leanstore/ and associated papers and follow up work is recommended.

In the meantime with RAM prices sky rocketing, work and research in buffer & page management for greater-than-main-memory-sized DBs is set to be Hot Stuff again.

I like working in this area.

sreekanth850 12/20/2025||

Btrees are not optimal for SSD, and the only reason we still use them is legacy constraints of page-oriented storage and POSIX block interfaces.We pay a lot of unnecessary write amplification, metadata churn, and small random writes because we’re still force-fitting tree structures into a block device abstraction.

cmrdporcupine 12/20/2025||

I don't think we're disagreeing. But the issue is at the boundary between software and hardware, which the hardware device manufacturers have dictated, not further up.

nly 12/20/2025||

Optimising hardware to run existing software is how you sell your hardware.

The amount of performance you can extract from a modern CPU if you really start optimising cache access patterns is astounding

High performance networking is another area like this. High performance NICs still go to great lengths to provide a BSD socket experience to devs. You can still get 80-90% of the performance advantages of kernel bypass without abandoning that model.

gethly 12/20/2025||

> The amount of performance you can extract from a modern CPU if you really start optimising cache access patterns is astounding

I think this was one, and I want to emphasise this, of the main points behind Odin programming language.

javaunsafe2019 12/20/2025||

AI slop for sure

Rakshath_1 12/20/2025||

[dead]

Rakshath_1 12/20/2025|

[dead]