https://github.com/orbitinghail/graft/blob/main/docs/design....
> Graft clients commit locally and then asynchronously attempt to commit remotely. Because Graft enforces Strict Serializability globally, when two clients concurrently commit based on the same snapshot, one commit will succeed and the other will fail.
OK, but, the API provides only a single commit operation:
> commit(VolumeId, ClientId, Snapshot LSN, page_count, segments) Commit changes to a Volume if it is safe to do so. The provided Snapshot LSN is the snapshot the commit was based on. Returns the newly committed Snapshot on success.
So if a client commits something, and it succeeds, presumably locally, then how should that client discover that the "async" propagation of that commit has failed, and therefore everything it's done on top of that successful local commit needs to be rolled-back?
This model is kind of conflating multiple, very different, notions of "commit" with each other. Usually "commit" means the committed transaction/state/whatever is guaranteed to be valid. But here it seems like a "local commit" can be invalidated at some arbitrary point in the future, and is something totally different than an "async-validated commit"?
The key idea is that if your system supports offline writes, then by definition the client making those writes can't have general purpose strict serializability. They have to exist under the assumption that when their transactions eventually sync, they are no longer valid. Graft attempts to provide a strong foundation (server side commits are strictly serialized), however let's the client choose how to handle local writes.
A client may choose any of these options:
1. If offline, reject all local writes - wait until we are online to commit
2. Rebase local writes on the latest snapshot when you come back online, resulting in the client experiencing "optimistic snapshot isolation"
3. Merge local changes with remote changes - this probably depends heavily on the datastructure you are storing in Graft. For example, storing a Conflict-Free Replicated Datatype (CRDT) would work nicely
4. Fork the Volume entirely and let the user figure out how to manually merge the branches back together
5. Throw away all local changes (probably not what you want, but it works!)
My goal is to build a building block on top of which edge native systems can be built. But I'm not too opinionated about what local write semantic you're application needs. :)
(edit: added newlines between list items)
The guarantees of Graft's "commit" operation are properties of the Graft system itself. If commit is e.g. strict-serializable when clients satisfy one set of requirements, and isn't strict-serializable if clients don't satisfy those requirements, then "commit" is not strict-serializable.
Assuming you agree with that, what would be a more clear way to explain the tradeoffs and resulting consistency models in the event that a client desires to asynchronously commit?
I think I see the issue, but I'd love to hear your take on how to update the docs.
I think the key point of your design is flexibility, rather than any individual consistency properties. It might be better to emphasise this and try to explain, at the top-level, the concrete ways an application can interact with the storage and the different tradeoffs.
So you might have strong write consistency with forced global serialisation or weaker properties with less enforced sync policies. From the perspective of the application, the internal details shouldn’t matter, but the external properties and how to achieve them (eg. CRDT style merging etc as a way to get strong consistency with less syncing, for certain domains).
Specifically, I don't really see how a client can "commit locally" and "commit globally" as separate things. I understand a client to be something that interacts with your metastore API, which provides a single "commit" operation, that AFAICT will return success/failure based on local commit state, not global state.
Is that not correct?
Later on in the design.md you say
> The Client will be a Rust library ...
which might be the missing link in this discussion. Is the system model here assuming that clients are always gonna be Rust code in the same compilation unit as the Graft crate/library/etc.?
They're building an edge data store that has both local and global characteristics, with certain options meant for offline mode. It's reasonable to assume that the answer is more complicated when talking about the strict serializability of such a system.
I'm going to force myself to sign off for the evening. Will be around first thing to answer any other questions that come up! I just arrived to Washington, DC to attend Antithesis BugBash[1] and if I don't get ahead of the jet lag I'm going to regret it.
If anyone happens to be around Washington this week (perhaps at the conference) and wants to meet up, please let me know! You can email me at hello [at] orbitinghail [dotdev].
I'll be at Antithesis BugBash [3] for the next few days talking about Deterministic Simulation Testing (DST) with fellow DST nerds. If you're around please reach out so we can meet in person!
Either way, have an excellent day! :)
[1]: https://discord.gg/etFk2N9nzC
The problem here is that there is no way to cleanly detect a conflict. The documentation talks about pages which have changed, but a page changing isnt a good indicator of conflict. A conflict can happen due to a read conflict. E.g.
Update Customer Id: "UPDATE Customers SET id='bar' WHERE id='foo'; UPDATE Orders SET customerId='bar' WHERE customerId='foo'"
Add Customer Purchase: "SELECT id FROM Customers WHERE email="blah"; INSERT INTO Orders(customerId, ...) VALUES("foo", ...);"
If the update task gets committed first and the pages for the Orders table are full (i.e. inserting causes a new page to allocated) these two operations dont have any page conflicts, but the result is incorrect.\
In order to fix this, you would need to track the pages read during the transaction in which the write occurred, but that could easily end up being the whole table if the update column isnt part of an index (and thus requiring a table scan).
Some simple examples: https://www.caktusgroup.com/blog/2018/03/19/when-clean-merge...
If strict serializability is not possible, because your changes are based on a snapshot that is already invalid, you can either replay (your local transactions are not durable, but system-wide you regain serializability) or merge (degrading to snapshot isolation).
As long as local unsynchronized transactions retain the page read set, and look for conflicts there, this should be sound.
Dropping them all is technically consistent but it may be unsafe depending on the circumstances. E.g. a doc records an urgent referral but then the tx fails because admin staff has concurrently updated the patient's phone number or whatever. Automatically replaying is unsafe because consistency cannot be guaranteed.
Manual merging may be the only safe option in many cases. But how can the app reconstitute the context of those failed transactions so that users can review and revise? At the very least it would need access to a transaction ID that can be linked back to a user level entity, task or workflow. I don't think SQLite surfaces transaction IDs. So this would have to be provided by the Graft API I guess.
Yeah, exactly right. This is why CRDTs are popular: they give you well-defined semantics for automatic conflict resolution, and save you from having to implement all that stuff from scratch yourself.
The author writes that CRDTs "don’t generalize to arbitrary data." This is true, and sometimes it may be easier to your own custom app-specific conflict resolution logic than massaging your data to fit within preexisting CRDTs, but doing that is extremely tricky to get right.
It seems like the implied tradeoff being made by Graft is "you can just keep using the same data formats you're already using, and everything just works!" But the real tradeoff is that you're going to have to write a lot of tricky, error-prone conflict resolution logic. There's no such thing as a free lunch, unfortunately.
E.g, how do you make sure that a hotel room cannot be booked by more than one person at a time or at least flag this situation as a constraint violation that needs manual intervention?
It's really hard to get anywhere close to the universal usefulness and simplicity of centralised transactions.
But this isn't a problem specific to CRDTs, it's a limitation with any database that favors availability over consistency. And there are use cases that don't require these kinds of constraints where these limitations are more manageable.
I think CRDTs would be applicable to a wider range of applications if it was possible to specify soft constraints.
So after merging your changes you can query the CRDT for a list of constraint violations that need to be resolved.
This manifests itself to the user as just data loss, though. You do something, it looks like it worked, but then it goes away later.
"Consistency" is really easy, as it turns out, if you allow yourself to simply drop any inconvenient transactions at some arbitrary point in the future.
- Each action increments or decrements a counter
- You have a log of timestamps of actions stored as a set
- etc.
If you can't model your changes to the data store as an unordered set of actions and have that materialize into state, you will have data loss.
Consider a scenario with three clients which each dispatch an action. If action 1 sets value X to true, action 2 sets it to true, and action 3 sets it to false, you have no way to know whether X should be true or false. Even with timestamps, unless you have a centralized writer you can't possibly know whether some/none/all of the timestamps that the clients used are accurate.
Truly a hard problem!
Curious how this compares to Cloud-Backed SQLite’s manifest: https://sqlite.org/cloudsqlite/doc/trunk/www/index.wiki
It’s similar to your design (sending changed pages), but doesn’t need any compute on the server, which I think is a huge win.
CBS uses manifests and blocks as you point out. This allows readers to pull a manifest and know which blocks can be reused and which need to be pulled. So from that perspective it's very similar.
The write layer is pretty different, mainly because CBS writes blocks directly from the client, while Graft leverages an intermediate PageStore to handle persistence.
The first benefit of using a middleman is that the PageStore is able to collate changes from many Volumes into larger segments in S3, and soon will compact and optimize those segments over time to improve query performance and eliminate tombstones.
The second benefit is fairly unique to Graft, and that is that the written pages are "floating" until they are pinned into a LSN by committing to the MetaStore. This matters when write concurrency increases. If a client's commit is rejected (it wasn't based on the last snapshot), it may attempt to rebase its local changes on the latest snapshot. When it does so, Graft's model allows it to reuse any subset of its previously attempted commit in the new commit, in the best case completely eliminating any additional page uploads. I'm excited to experiment with using this to dramatically improve concurrency for non-overlapping workloads.
The third benefit is permissions. When you roll out Graft, you are able to enforce granular write permissions in the PageStore and MetaStore. In comparison, CBS requires clients to have direct access to blob storage. This might work in a server side deployment, but isn't suited to edge and device use cases where you'd like to embed replicas in the application.
On the manifest side of the equation, while in CBS it's true that a client can simply pull the latest manifest, when you scale up to many clients and high change workload, Graft's compressed bitset approach dramatically reduces how much data clients need to pull. You can think of this as pulling a log vs a snapshot, except for metadata.
Hope that helps clarify the differences!
Oh, and one more petty detail: I really like Rust. :)
Graft introduces a lot of new concepts, and while they might be necessary to achieve the more ambitious goals, it feels a like a lot. Running an HTTP API with Protobuf is a lot of API surface area to maintain and evolve going forward.
Write concurrency for CBS is "one writer per bucket", which is usable in a "one bucket per user" configuration. You can mediate a client's blob storage access with signed URLs. It's not great, though, and you have to roll your own conflict resolution.
The most interesting choice to me here is to handle conflict resolution (rebasing/forking/resetting) at the storage engine (page) level. For non-overlapping writes, I can see how rebasing is quite effective.
I was recently looking for a way to do low scale serverless db in gcloud, this might be better than any of their actual offerings.
Cloud firestore seems like the obvious choice, but I couldn't figure out a way to make it work with existing gcloud credentials that are ubiquitous in our dev and CI environments. Maybe a skill issue.
I like the idea behind graft, but it immediately runs into some complications. Like the fact that, as an edge device with unreliable networking, its client may not have availability to fetch the pages it needs when it needs them. If anything, what the client really needs is to fetch all the data whenever it can, so that when it does need to access it, it isn't waiting.
But if you have a lot of data, that could take forever! That's why the application needs to be more intelligent about what data it's requesting, and do things like create priorities and classes for its data so it can fetch what it needs the most first.
The fact that this is built for SQLite kind of reinforces the point. Distributed relational databases are usually a bad idea, because usually people want them to be ACID-like, but you can't enforce that in a distributed way. That's why things like CRDTs exist; you treat every "copy" as an independent thing, breaking the false narrative that you can really distribute the same thing in multiple places reliably. (And that shows you why relational databases just aren't good for replicating; to replicate an entire relational database with CRDTs, you need one CRDT per cell, which multiplies your data)
https://collabs.readthedocs.io/en/latest/
https://rxdb.info/offline-first.html
https://github.com/siriusastrebe/jsynchronous
I hope one day to try them all :-) Or read a summary from someone who does.
For me personally I have 4 of those as visited, pouchdb, automerge, loro and sqlsync of course. I was trying to fit such a tool into existing architectures that I deal with at work but nothing really makes sense.
My guess is those solutions are in totally wrong abstraction layer, creators think that would be best thing since sliced bread - but in reality having rest API and some persistence on client is like 99% good enough. With service workers in browser and mobile apps no problem of just having data stores.
Sending out specific partial updates, just reloading full state from the server is just easy to explain to the users and easy to implement. Last write wins with some auditing log is also good for something like 99.9% of applications and is super easy to explain to people - what's not easy to explain and not easy to implement is merging conflicts on database data. It is not easy to make audit logs server side so they are not tampered with if you just sync full database instead of doing REST requests.
This approach with "sync databases" feels for me exactly like someone proposing use of LateX because it is great to people who need to write 10 page essays.
The reason people descend into this madness is because visible replication code is tricky and the general feeling is that it'll infect parts that shouldn't be infected (or at least not without a general framework).
So at a somewhat trivial level you have:
A: A bare log replication system (where the application needs awareness for most object types multiplying object complexity).
B: A object replication system where the framework handles all object types coherently and the application "only" needs to be aware of how to fetch objects, think a KV store that stores fairly basic objects with some addressing support.
C: Since recent crowd "wisdom" dictates that most KV stores will likely re-implement SQL functionality badly, people go straight to doing the SQL case (maybe they've had a curiosity about SQL databases already that they're scratching)
I've recently built A (basic replication and LWW) and building the application I'm definitively feeling an itch to just start over or adjust to support B (a simple replicated KV store) to separate the concerns more, I can see how I would also feel the SQL itch of C (but having done SQL like systems before it's not as bad for me).
For this first application A will suffice (since the offline needs are relatively trivial) but having a more complicated application in mind I'm strongly considering B for that project (along with designs or third party libs to make it happen).
I think a big gap in the space is that most seem focused on "documents" (esp the CRDT based tools), ie a document being the atomic piece that is synchronized but imo it leaves a big gap in that now all regular application management tools like SQL query tools are useless since essentially you only have a bunch of "blobs" or worse. If you want the regular enterprise SQL backend these tools don't seem to have a focus on synchronizing to those regular backend storage systems.
You get local the document-level atomicity for sync. Multi-document transaction support on server side, KV access, SQL inside JSON docs or even across multiple documents, Full Text Search, and fine-grained/RBAC for document-level synchronization - but the cost is as much lock-in as it is financial. You can't mix and match storage, query or sync without pretty big tradeoffs.
Multiplayer documents are real time synchronized and since they are documents that’s totally not use case for DB synchronization.
All the tools are for offline to online data synchronization. Different use case than document.
What do people really need? Who defines and polices that?
> just reloading full state from the server is just easy to explain to the users and easy to implement
Is the green light on? If not, press the "power" button. Never underestimate the difficulty of explaining the simple to the uninterested.
The audience in this case is geeks like us, so it's probably ok to have wonky process until someone forks the project and fixes those problems.
It's easy to focus on libgraft's SQLite integration (comparing to turso, etc), but I appreciate that the author approached this as a more general and lower-level distributed storage problem. If it proves robust in practice, I could see this being used for a lot more than just sqlite.
At the same time, I think "low level general solutions" are often unhinged when they're not guided by concrete experience. The author's experience with sqlsync, and applying graft to sqlite on day one, feels like it gives them standing to take a stab at a general solution. I like the approach they came up with, particularly shifting responsibility for reconciliation to the application/client layer. Because reconciliation lives heavily in tradeoff space, it feels right to require the application to think closely about how they want to do it.
A lot of the questions here are requesting comparison's to existing SQLite replication systems, the article actually has a great section on this topic at the bottom: https://sqlsync.dev/posts/stop-syncing-everything/#compariso...
And yea, I fell pretty far down the "Leap" rabbit hole. It's a fun one :)
This means that the app developer could support private syncing without any effort, and I wouldn't have to pay a subscription just for sync, I could just run a sync server on my NAS and all apps would know to sync their data with it. Of course, app developers don't want this because subscriptions are a reliable income stream, but hopefully users would push for it.
Disclaimer: I have worked on a sync backend for a company in this space for the last 8 years. You can probably find out where if you look, but this comment won't be a sales pitch.
Competition like this has incredible value for communities with poor internet access but reasonable development capabilities. Think about travelling doctors in the lesser developed countries and areas of Africa for instance. Quite often entire villages get medical checkups done, with data being recorded on a tablet. This can be synced once the doctor gets to a town with internet access for follow-up. Of course, projects like the above do not have big budgets. Unfortunately they are priced out of using a lot of tech to solve these problems (my company included in this statement)
On the more enterprise-y side, which is where I mostly sit, a lot of airlines, cruise ships, retail and field-based industry use these technologies, since they are prone to periods of being completely offline or in a P2P mesh only. Cloud databases or even regular replicated databases running in-situ are a non-starter, since there won't be anybody around to administer it. Replication is a difficult problem at the best of times, let alone in periods of comms downtime.
And I made Graft open source to enable exactly that! Deploy it anywhere! Just let me know so I can better support your use case :)
We developed our own sync engine for an offline-first IDE for notes/tasks [1] we're building, where the data structure is a tree (or graph actually) to support outlining operations. Conflict resolution is always the challenge, and especially with trees multiple offline players can optimistically commit local changes which would result in an invalid tree state once globally merged.
The way we handle this is by rolling back tree inserts when a client comes online and receives other tree insert deltas. From what I understand from the description of SQLSync, the best way to handle this would be to pull in a latest snapshot and then replay. Pulling in a full snapshot sounds rather expensive though. We have some sort of heuristic where we can do this if the number of incoming deltas would be very large, but for most offline syncing we never need it. Just curious how SQLSync defines these snapshots? Sticking with the note-taking example, in our case we can't really have a snapshot of a single "note" because of graph features like transclusions. Does SQLSync have some clever way to avoid having to send all state in order to "reset and replay"?
In SQLSync, the major issue was precisely what you describe: pulling in the full snapshot to replay. This is the main driver behind the "partial" aspect of the Graft design. It means that clients only need to pull the portion of the incoming snapshot that they don't already have and that overlap with the read/write set of their transactions. So yes, to answer your question once SQLSync is powered by Graft, it will frequently be able to avoid downloading all state in order to reset and replay.
Note that if a client is ok with relaxing to snapshot isolation (in particular this means clients may experience Write Skew[1]), and the r/w set of their local transaction does not intersect the snapshot changeset, Graft is able to perform an automatic merge.