Posted by zzsheng 4 days ago
If you need to know it’s been persisted to non-volatile storage then you need to own the full stack of every piece of software between the OS and the actual physical memory.
Every managed flash drive is going to have layers and layers of complexity and caching and things you simply can’t easily control or really understand. Don’t trust it unless you know exactly how it works all the way down.
In my last company we need to disable the disk write cache during each reboot, and we also heard a lot industry stories related to underneath firmware implementation from oxide computer podcasts [1]. Yes, to provide truly reliable service, we need to evaluate underneath hardware settings case-by-case.
Famously not, as the man page says.
It is also said later in the article:
> POSIX strictly requires a parent-directory fsync to make a newly created file’s existence durable.
So I'm not sure why the dirent sync is claimed earlier.
Is there something else weird that can happen if the file is not new, not unlinked, but changes are made that would alter the directory entry in it in some way?
Even if you just look at hardware failure rates, you get unrecoverable I/O errors (data corruption) at about one in 10^15 bits, disk failures at a rate of about 1% per year, etc. People usually like to have better guarantees than those numbers give you with just a plain fsync anyway; so you are probably forced to do an analysis of the whole system if you want to provide good durability guarantees and be able to explain where the guarantees come from.
And I wouldn't assume they meant that number to be per record in the first place.
The way I would go is by saying that you multiply the number of objects by AFR, and that’s close to the actual losses on most years. You can then exclude WW3 and the late holocene extinction event from your consideration. Or simple bankruptcy, for that matter. If your employer is gone, you don’t care about its data any more.
Durability is a knob. If you have enough data, or turn the knob too far in the direction of durability, you will simply bankrupt yourself or maybe drown your service in latency. It makes sense that you would have storage services that provide different levels of durability.
If you’re building a data storage system and are using the term “durable” to mean “it’s in RAM on three virtual machines”, for example, I don’t think it’s unfair to say that you are lying to your customers, because you are intentionally misusing a well-established term.
That was very helpful when choosing durability levels.
AFRs and discussions about different failure scenarios are the bare minimum. The bare minimum for scenarios is disk loss, total machine loss, and data center loss. This is just my take on things. I don’t care if something is on disk or not. I do care what happens when a sector on disk goes bad, when a faulty power supply destroys all the disks in a machine, or when a data center floods.
That forces you to think about things like whether you want to turn on synchronous replication.
My beef is with database systems that use the argument you made further up thread to skip fsync to juice their performance numbers. Data is not “durable” if turning off the machines storing it means it’s lost, that’s a category difference, not a pure probability difference as you are claiming.
It is of course totally fine to not store data to durable media and say the risk of devops doing a coordinated reboot is as low as the risk of raid disk data loss, but then don’t use the word “durable”.
I don't see how a virtualised NVMe disk is different from a physical one.
Especially if you don't have control over the underlying hardware (so you don't know if it has power-loss-protection PLP SSDs), you should send the FUA.
> O_DATA_SYNC
You mean `O_DSYNC`?
Why would you need `O_DSYNC` on-premise, but not on cloud VMs? (Or are you saying you'd include it everywhere?) Similar to my above point, surely it is the task of the VM to pass through any FUA commands the VM guest issues to the actual storage?
Further: Is `O_DSYNC` actually substantially different from writing and then `fdatasync()`ing yourself?
My understand is that no, it's the same. In particular, the same amount of data gets written. So if you believe that to avoid the "can trigger an order of magnitude more I/O" by avoiding `fdatasync()`, you would re-introduce it with `O_DSYNC`.
However, I suspect that that whole consideration is pointless:
The only thing that makes your O_DIRECT+preallocated-only-overwrites writes safe are enterprise SSDs with Power Loss Protection (PLP), usually capacitors.
On those SSDs, NVMe Flush/FUA are no-ops [1]. So you might as well `fdatasync()`/`O_DSYNC`, always. This is simpler, and also better because you do not need to assume/hope that your underlying SSDs have PLP: Doing the safe thing is fast on PLP [2], and safe on non-PLP.
[1] https://news.ycombinator.com/item?id=46532675
[2] https://tanelpoder.com/posts/using-pg-test-fsync-for-testing-low-latency-writes/
So the only remaining benefit of `O_DSYNC` over `fdatasync()` is that you save a syscall. That's an OK optimisation given they are equivalent, but it would surprise me if it had any noticeable impact at the latencies you are reporting ("413 us"), because [2] reports the difference beting 6 us.Let me know if I got anything wrong.
The only remaining question is: Why do you then see any difference in your benchmark?
Configuration Throughput (obj/s)
-------------------------------------------
ext4 + O_DIRECT + fsync 116,041
Our engine 190,985
That is what I'd find very valuable to investigate.The first suspicion I have is: Shouldn't you be measuring `+ fdatasync` instead?
So I'd be interested in:
ext4 + O_DIRECT + fdatasync
ext4 + O_DIRECT + O_DSYNC
Our engine + O_DSYNC (which you're suggesting above)
Also I don't fully understand what the remaining diference between "ext4 + O_DIRECT + O_DSYNC" and "Our engine + O_DSYNC" would be.For the benchmark results, and they were mainly due to metadata management. We have implemented our own KV store, see internal here [1], which is more efficient than ext4 namespace management, even after doing very aggressive fs tuning for that [2] (plus 65536 sharding for each leveled dir).
[1] https://fractalbits.com/blog/metadata-engine-for-our-object-...
[2] https://github.com/fractalbits-labs/fractalbits/commit/12109...
That is where the disparity lies here. Reading back the data after the device reports that it has been written offers little in the way of additional assurances that it's successfully written. But if you report successful writes without syncing, there is a near certainty that you'll lose data on every power loss.
My guess is the preallocation + zeroing is what got them most of the win, and the O_DIRECT is actually hurting, not helping throughput. This has been the case 100% of the time I've benchmarked such things.
If you're doing this sort of stuff for real under Linux, check out sync_file_range. It's the only non-broken and performant sync API for ext4 (note that it's broken by design for many other file systems, and the API is terribly difficult to use correctly).
If you really care, it's probably just easier to use SPDK or something. Linux has historically been pretty hostile towards DBMS implementations.
This is not a new trick. It has been used in many storage engine designs to effect durability without an fsync.
if we give ourselves two definitions of persisted - logically(wal or write) and physically (index or read), it seems like we can maintain the invariant that P < L. (1) by keeping an in memory view of P-L that we have to consult on every read to assert eh delta and (2) an expensive but asynchronous flush path for updating P driven from reads verifying L has landed, then have we patched all the holes(?).
edit: of course one of the root problems here is the drive lying, so how can we understand that some log block has actually commit so that we can update P
For example, if you map folders like /foo and /foo/bar to numeric IDs, then each file can simply refer their parent folder. Renaming a folder, or moving a folder to a new parent, does not need to update any files.
You can take this a step further and have a three-level split: Tree, file-tree join table, and files. The tree describes the hierarchical structure of folders (which changes more rarely than files do), while the file-tree join table is essentially [folder_id, file_id]. When a file is moved, only the join table (which is much smaller than the files and super sortable and compressible) must be updated.
I take the point that updating multiple discrete pieces of information puts more demand on the transactional layer, which has to ensure atomicity and consistency. But I'm surprised it wasn't even mentioned as an alternative that was evaluated and rejected. The article starts out with the premise that a flat key/value approach is the only choice on the table.
Bookmarked your whole blog for later consumption, interesting stuff!
[1] https://fractalbits.com/blog/metadata-engine-for-our-object-...
EDIT: sketchy from an answering "what exactly are the guarantees?" perspective
Some storage devices guarantee durability of non-persisted writes, which is explicitly part of their model. Consequently, the entire durable write path is the storage device completing a DMA read of their buffer.
The underlying assumptions will not hold true for every environment. However, it will hold true for many and you can check most (all?) of them at runtime.
I'm not saying it's impossible, but typically people who want to lean on hardware guarantees for extra performance control more of the stack.
It will also make the system initialization faster, since right now we need to write all zeros to make ext4/xfs to actually initialize extents as "allocated".
I also do have an optional WAL. Maybe I should add an additional mode that disables fsync only for the WAL. I don't think it would be a good idea. My WAL does use checksums and sequence numbers etc. to prevent committing wrong data.