Reduce bandwidth costs with dm-cache: fast local SSD caching for network storage

Posted by tlar 3 days ago

Reduce bandwidth costs with dm-cache: fast local SSD caching for network storage(devcenter.upsun.com)

57 points | 17 comments

loeg 7 hours ago|

Historically, I believe bcache offered a better design than dm-cache. I wonder if that has changed at all?

That said, for this use, I would be very concerned about coherency issues putting any cache in front of the actual distributed filesystem. (Unless this is the only node doing writes, I guess?)

rbranson 10 hours ago||

> For e-commerce workloads, the performance benefit of write-back mode isn’t worth the data integrity risk. Our customers depend on transactional consistency, and write-through mode ensures every write operation is safely committed to our replicated Ceph storage before the application considers it complete.

Unless the writer is always overwriting entire files at once blindly (doesn't read-then-write), consistency requires consistency reads AND writes. Even then, potential ordering issues creep in. It would be really interesting to hear how they deal with it.

twotwotwo 6 hours ago|

They mention it as a block device, and the diagram makes it look like there's one reader. If so, this seems like it has the same function as the page cache in RAM, just saving reads, and looks a lot like https://discord.com/blog/how-discord-supercharges-network-di... (which mentions dm-cache too).

If so, safe enough, though if they're going to do that, why stop at 512MB? The big win of Flash would be that you could go much bigger.

mrkurt 9 hours ago||

dm-cache writeback mode is both amazing and terrifying. It reorders writes, so not only do you lose data if the cache fails, you probably just corrupted the entire backing disk.

0xbadcafebee 10 hours ago||

This is good timing; I was just looking at a use-case where we need more iops and the only immediate solutions involve allocating way more high-performance disks or network storage. The problem with a cache is having a large dataset with random access, so repeated cache hits might not be frequent. But I had a theory that you could still make an impact on performance and lower your storage performance requirements. I may try this out, but it is block-level, so it's a bit intrusive.

Another option I haven't tried is tmpfs with an overlay. Initial access is RAM, falls back to underlying slower storage. Since I'm mostly doing reads, should be fine, writes can go to the slower disk mount. No block storage changes needed.

mgerdts 6 hours ago||

I remember seeing another strategy where a remote block device was (lazily?) mirrored to a local SSD. The mirror was configured such that reads from the local device were preferred and writes would go to both devices. I think this was done by someone on GCP.

Does this ring any bells? I’ve searched for this a time or two and can’t find it again.

twotwotwo 6 hours ago||

Discord: https://discord.com/blog/how-discord-supercharges-network-di...

(Somehow the name "SuperDisks" was burned into my brain for this. Although Discord's post does use 'Super-Disks' in a section header, if you search the Internet for SuperDisks you'll everything's about the LS-120 floppies that went by that name.)

zipmapfoldright 1 hour ago|||

Google's L4 cache? https://cloud.google.com/blog/products/storage-data-transfer...

Conch5606 2 hours ago|||

This is not quite the same, it's for migrating from one device to another while keeping the file system writable, but it's quite neat: dm-clone[1]

I've used it before for a low downtime migration of VMs between two machines - it was a personal project and I could have just kept the VM offline for the migration, but it was fun to play around with it.

You give it a read-only backing device and a writable device that's at least as big. It will slowly copy the data from the read-only device to the writable device. If a read is issued to the dm-clone target it's either gotten from the writable device if it's already cloned or forwarded to the read-only device. Writes are always going to the writable device and afterwards the read-only device is ignored for that block.

It's not the fastest, but it's relatively easy to set up, even though using device mapper directly is a bit clunky. It's also not super efficient, IIRC if a read goes to a chunk that hasn't been copied yet, that's used to give the data to the reading program, but it's not stored on the writable device, so it has to be fetched again. If the file system being copied isn't full, it's a good idea to run trimming after creating the dm-clone target as discarded blocks are marked as not needing to be fetched.

[1] https://docs.kernel.org/admin-guide/device-mapper/dm-clone.h...

cperciva 6 hours ago||

I've done this on EC2 -- in particular back in the days when EBS billed per I/O (as opposed to using a "reserved IOPs" model where you say up front how much I/O performance you need). I haven't bothered recently since EBS performance is good enough for most purposes and there's no automatic cost savings.

kayson 10 hours ago||

I was looking into SSD caching recently and decided to go with Open-CAS instead, which should be more performant (didn't test it personally): https://github.com/Open-CAS/open-cas-linux/issues/1221

It's maintained by Intel and Huawei and the devs were very responsive.

mgerdts 8 hours ago|

Is Intel still working on it? Open-CAS bdev support was nearly removed from SPDK at a time when Intel still employed a SPDK development and QA team. Huawei stepped in to offer support to keep it alive, preventing its removal.

I’ve been under the impression that Intel got rid of pretty much all of their storage software employees.

quickslowdown 7 hours ago||

I mean to ask a genuine, good faith question here, because I don't know much about Huawei's development team.

My head goes to the xz attack when I hear that Intel decided to stop supporting an open source tool, and a Chinese company known to sell backdoored equipment "steps in" to continue development, and it makes me suspicious & concerned.

This is to say nothing of the quality of the software they write or its functionality, they may be "good stewards" of it, but does it seem paranoid to be unsure of that arrangement?

AtlasBarfed 10 hours ago||

"When deploying infrastructure across multiple AWS availability zones (AZs), bandwidth costs can become a significant operational expense"

An expense in the age of 100gbit networking that is entirely because AWS can get away with charging the suckers, um, customers for it

0xbadcafebee 10 hours ago||

AZs are whole datacenters, so I imagine their backbone bandwidth between AZs is a fraction of total bandwidth inside the DC. If they didn't charge it'd probably get saturated and then there's not much point in using them for reliability.

The internet egress price is where they're bastards.

martinald 10 hours ago||

Definitely not. Azure doesn't charge for intra region costs FWIW.

Getting terabits and terabits of 'private' interconnect is unbelievably cheap at amazon scale. AWS even own some of their own cables and have plans to build more.

There is _so_ much capacity available on fiber links. For example one newish (Anjana) cable between the US and Europe has 480Tbit/sec capacity. That's just one cable. And that could probably be upgraded to 10-20x that already with newer modulation techniques.

random3 10 hours ago||

reduce network bandwidth from the network attaches SSD volumes, yes?