This design ACKs writes that aren't yet durably persisted (to the journal or data areas). That might be ok, but it might not. It's certainly unusual not to at least persist the journal update.
zzsheng 1 day ago|
nop. we will not ack any write which is not in data or journal. please check the put details in the blog.
loeg 1 day ago||
You initiate a write to the journal, but do not sync it before ACKing to the client.
zzsheng 1 day ago||
journal file was pre-alloacated and we use direct-io for journal write so no need to call fsync.
loeg 1 day ago|||
Again, it is not durably persisted before acking to the client. Like I said earlier, that might be fine for your durability model, but it is unusual.
thomas_fa 1 day ago||
We would wait for Bss data and journal DirectIO and the acking (sending response back to api_server) in the callback function. What you are implying is what s3 actually doing and you can get see from their paper[1] and we are stronger than that.
So basically, you are writing data without guarantees it's actually written? "YOLO mode" but for data written to a device?
Would you be so kind to explain what happens in a power-loss scenario?
ovaistariq 1 day ago||
There is no way to reliably prove that bytes have made their way to the disk without issuing fsync. Thus, without it you cannot guarantee that writes ACKed to the client survive any failure afterwards
dboreham 2 days ago||
Almost full-circle back to when Oracle took over the entire volume and implemented its own filesystem.
dale_glass 2 days ago|
I wonder why this is not more common. LVM is easy to set up, and it's already common to allocate volumes for things like disk images for VMs, so why not databases?
jandrewrogers 2 days ago|||
Some Linux filesystems, notably ext4 and XFS, provide the necessary features to get 90% of the benefit simply by using O_DIRECT correctly. The last 10% is achieved by doing direct I/O to raw block devices, with the obvious caveat that this is not as easy to manage.
Both of these are commonly done in database storage engines.
tptacek 2 days ago||||
If you preallocate and O_DIRECT, haven't you basically soaked up most of the benefit of skipping the filesystem?
pizza234 2 days ago|||
Because the speed increase is - on modern, properly tuned filesystems - surprisingly small, due to how RDBMS's manage their pool; by working on large container files, they avoid most of the filesystem overhead.
jnwatson 1 day ago||
If you're bypassing the page cache, what invalidates the page cache so that the next read (from the filesystem) isn't stale?
zzsheng 1 day ago|
we also use direct-io for reads.
bawolff 2 days ago||
Am i understanding correctly that you are just targeting consistency and not durability?
zzsheng 1 day ago|
actually both crash consistentcy and durability. after we ack, we make sure data will be lost due to crash, restart or power loss.
zzsheng 1 day ago||
sorry, typo. data will *not* be lost.
7e 2 days ago||
This is really great work. Kudos to the team for such an elegant solution.
S3 was never designed for performance. Trying to be compatible while going with very hardware dependent low level optimization seems to be a wrong direction to begin with.
zzsheng 1 day ago|
check s3 express one zone
up2isomorphism 1 day ago||
Still slow as hell, if you think 10ms is fast that’s a different story.
up2isomorphism 1 day ago||
The repo seems to contains some api gateway, and none of actual storage engine is open sourced. I did it so you don’t have to waste your time to find out.