Posted by tosh 3 days ago
This is an unfortunate limitation to be aware of when evaluating
https://www.orioledb.com/docs/usage/getting-started#current-...
1) Overhead. SSI implies a heavyweight lock on any involved index page or heap tuple (even for reads). The overhead of SSI was initially measured at ~10%, but nowadays, scalability has gone much farther. These many HW-locks could slow down in multiple times a typical workload on a multicore machine.
2) SSI needs the user to be able to repeat any transaction due to serialization failure. Even a read-only transaction needs to be DEFERRABLE or might be aborted due to serialization failure (it might "saw impossible things" and needs a retry).
In contrast, typically it's not hard to resolve the concurrency problems of writing transactions using explicit row-level and advisory locks, while REPEATABLE READ is enough for reporting. Frankly speaking, during my whole career, I didn't see a single case where SERIALIZABLE isolation level was justified.
Any time you have to check constraints manually, you can't just do it before the write, or after the write, because two REPEATABLE READ write transactions will not see each other's INSERT.
You need something like a lock, a two-phase commit, or SERIALIZABLE isolation for writes. Advisory locks have sharp edges, and 2PC is not so simple either, there is a lot that can go wrong.
In the case of SERIALIZABLE you do need to retry in case of conflict, but usually the serialization anomalies can be limited to a reasonably fine level. And an explicit retry feels safer than risking a livelock situation when there is contention.
https://www.postgresql.org/docs/current/ddl-constraints.html...
An exclusion constraint needs concrete values to compare, but here we can't pre-compute and index every future value (there are infinitely many)
We solve a diophantine equation for this check (if there is a solution to Ax - By = 0, then formulas A and B can conflict at some point)
Heck, https://www.sciencedirect.com/science/article/pii/S147466701... and https://www.amazon.com/Declarative-Models-Concurrent-Cyclic-... seem to indicate this is still an area of active research. And, to your point, almost certainly too complex to try to encode into a Postgres index - that would be a paper-worthy project unto itself!
(I've had to implement the naive approach here, which nested-loops over dates and rules and finds ones that might conflict. Definitely not meant to scale, and would be a nightmare if not at a date-level granularity!)
No pressure to share, but would love to learn more!
I'm not entirely familiar with the literature, so we did some simplifications to keep it manageable, and it's very possible there might be a better approach!
We model the events generated by our recurring rules as intervals [SA+n*A, SA+n*A+DA), where A is a constant integer number of days (e.g. A=3*7 for an event on Tuesday every 3 weeks), SA is the start date, and DA is the duration of an event (... which can be >24h to make things more interesting). It might continue recurring forever, or it might stop by some end date.
Now if you have two of these recurring events with periods A and B, you can think of each of them as a periodic function, and the combination of them will have a period equal to LCM(A, B). So we could check everything modulo LCM(A, B) once, and that would hold infinitely far forward.
But actually, we can do even better by taking the difference Δ=SA-SB mod GCD(A, B). You can sort of think of it as the closest approach between these two recurring events. If that's less than DA or more than the GCD-DB, these events are eventually going to overlap. There's some extra details to check whether overlap happens before either end date (if any), but that's the idea in a nutshell.
---
Where it gets really interesting is when you introduce timezones and daylight savings time (DST). Now a day is not always 24h, so there isn't a nice [SA+n*A, SA+n*A+DA) at regular intervals, and we can't naively work modulo the LCM or GCD anymore.
But we can notice that the days of the year on which a DST transition falls generally follow simple rules, and would repeat every 28 years (actually every 400 years due to leap years). In practice for a given timezone, for example CEST, there are only 14 unique days of the year on which the DST transitions can fall, and it repeats after a full period.
So if we want to check for overlap, we can treat those DST transition days as special cases, and all the time intervals in between DST transition days look locally "UTC-like" (time doesn't jump forward or backwards, if we ignore leap seconds), so the previous formula continues to work on all of these.
But at some point things become a little messy, so we did some simplifications =)
My napkin math suggests the size of the supercycle when you take two recurring events A and B with periods < 365 days and the 400 year cycle for DST transition can be about a hundred thousand years in the worst case.
But considering that timezones and daylight savings time are fundamentally political decisions, and that the IANA tzdata file is updated regularly, there is not much sense in applying the current rules thousands of years into the future! So we check at most 400 years ahead, and we consider that when timezones are involved, the overlap check is best effort and we prepare for the fact that it could miss a DST transition – no matter what math we do, a country could change the rules at any time, and a day that used to be 24h could now be 25, 22:30, or even skipped entirely, invalidating all past assumptions.
The idea that someone could book a room for 400 years of consistency guarantees is somewhat hilarious, yet absolutely the kind of thing that would check a box in a regulated environment like healthcare!
It does speak to a larger issue I see quite often, which is that capturing structured intent as of the time an event is created, with the context that led to that decision, is incredibly hard to model. Because no matter what, something about the system will change, like daylight savings time or assumptions around resource/room availability, in an unpredictable way such that the data that had been used to drive a confident evaluation of lack-of-conflict at insertion time, is now insufficient to automatically resolve conflicts.
There's no single way to solve this problem - in a way, "reinterpreting intent" is why programming is often equal parts archaeology and anthropology. Certainly gives us very interesting problems to solve!
To circle it back to the original conversation: the better our databases are at handling large volumes of written data, the more context we can capture around insertion-time intent in an easy-to-query way, and the better we'll be at adapting to unforeseen circumstances!
I use MySQL, not Postgres, for this application (for better or for worse), and I can absolutely generate a bad state if I drop MySQL to a level below SERIALIZABLE — I’ve tried it. (Yes, I could probably avoid this with SELECT FOR UPDATE, but I don’t trust MySQL enough and I get adequate performance with SERIALIZABLE.)
To make SERIALIZABLE work well, I wrap all the transactions in retry loops, and I deal with MySQL’s obnoxious reporting of which errors are worthy of retries.
(Aside from bad committed states, I’ve also seen MySQL return results from a single straightforward query that cannot correspond to any state of the table. It was something like selecting MIN, MAX and COUNT(*) from a table and getting min and max differing and count = 1. It’s been a while — I could be remembering wrong.)
"IF ANY LITIGATION IS INSTITUTED AGAINST SUPABASE, INC. BY A LICENSEE OF THIS SOFTWARE, THEN THE LICENSE GRANTED TO SAID LICENSEE SHALL TERMINATE AS OF THE DATE SUCH LITIGATION IS FILED."
This is a poison pill. At best the licensing is naive and blocks even customers of Supabase from using OrioleDB, at worst it's an attempt for Supabase to provide themselves unlimited indemnity through the guise of a community project. It means the moment you sue Supabase for anything. Contract dispute, IP, employment, unrelated tort you lose the license. They could lose your data and if you try do anything about it they can immediately counter sue for a license violation. Using the brand of the postgres license to slip this in is pretty wild.
OrioleDB looks like a promising project and undoubtedly an great contribution from Supabase but it's not open source or really usable by anyone with this license.
It is now Apache 2.0 which grants patent rights and can be re-licensed to PostgreSQL when the code is upstreamed. I'll amend the blog to make that clearer.
I think that "under review" claim is doing some very heavy lifting, especially when it relates to their changes to index tuple lifecycle management. The patches that have been submitted are unlikely to get committed in full anytime soon, even after substantial changes to the patches' designs.
PostgreSQL just has not been designed for what OrioleDB is doing, and forcing OrioleDB's designs into PostgreSQL upstream would a lot of (very) sharp edges that the community can't properly test without at least a baseline implementation - which critically hasn't been submitted to upstream. Examples of these sharp edges are varsized TIDs, MVCC-owning indexes, and table AM signalled index inserts.
There are certainly ideas in OrioleDB's designs that PostgreSQL can benefit from (retail index tuple deletion! self-clustering tables!), but these will need careful consideration in how this can be brought into the project without duplicating implementations at nearly every level. A wholesale graft of a downstream fork and then hoping it'll work out well enough is just not how the PostgreSQL project works.
1. With PostgreSQL heap, you need to access the heap page itself. And it's not for free. It goes all through the buffer manager and other related components.
2. In OrioleDB, we have a lightweight protocol to read from pages. In-memory pages are connected using direct links (https://www.orioledb.com/docs/architecture/overview#dual-poi...), and pages are read lock-less (https://www.orioledb.com/docs/architecture/overview#page-str...). Additionally, tree navigation for simple data types skips both copying and tuple deforming (https://www.orioledb.com/blog/orioledb-fastpath-search).
According to all of the above, I believe OrioleDB still wins in the case of secondary key lookup. I think this is indirectly confirmed by the results of the TPC-C benchmark, which contains quite a lot of log of secondary key lookups. However, this subject is worth dedicated benchmarking in the future.
Of course, the flip side of the coin is that if you do an UPDATE of a row in the presence of a secondary index, and the UPDATE doesn't touch the key, then you don't need to update the index(es) at all. So it really depends on how much you update rows versus how often you index-scan them IME.
[1] TPC-H doesn't have difficult enough queries to really stress the planner, so it mattered comparatively less there than in other OLAP work.
This is why Postgres b-tree indexes offer CREATE INDEX (indexCol1, indexCol2, ...) INCLUDE (includeCol1, includeCol2, ...). With INCLUDE, the index will directly store the listed additional columns, so if your query does `SELECT includeCol1 WHERE indexCol1 = X AND indexCol2 > Y`, you avoid needing to look up the entire row in the heap, because includeCol1 is stored in the index already. This is called a "covering index" because the index itself covers all the data necessary to answer the query, and you get an "index only scan" in your query plan.
The downside to creating covering indexes is that it's more work for Postgres to go update all the INCLUDE values in all your covering indexes at write time, so you are trading write speed for increased read speed.
I think it's quite typical to see this in SQL databases. SQLite behaves the same way for indexes; the exception is that if you create a WITHOUT ROWID table, then the table itself is sorted by primary key instead of by ROWID, so you get at most 1 index that maps directly to the row value. (sqlite docs: https://sqlite.org/withoutrowid.html)
> This means that in an ordinary index scan, each row retrieval requires fetching data from both the index and the heap.
Note that it says _index and the heap_. Not _index and the primary index and the heap_. (For a B-tree-organized table, the leaf heap nodes are essentially the bottom of the primary index, so it means that to find anything, you need to follow the primary index from the top, which may or may not entail extra disk accesses. For a normal Postgres heap, this does not happen, you can just go directly to the right block.)
Index-only scans (and by extension, INCLUDE) are to avoid reaching into the heap at all.
> The downside to creating covering indexes is that it's more work for Postgres to go update all the INCLUDE values in all your covering indexes at write time, so you are trading write speed for increased read speed.
For updates, even those that don't touch INCLUDE values, Postgres generally needs to go update the index anyway (this the main weakness of such a scheme). HOT is an exception, where you can avoid that update if there's room in the same heap block, and the index scans will follow the marker(s) you left to “here's the new row!” instead of fetching it directly.
Based on the limited description of OrioleDB I understand it works like SQLite WITHOUT ROWID, actually storing the row tuple in the primary key b-tree, but I didn’t go read the code
Notably, you can have a Postgres table without a primary key at all, not even an implicit one.
> Based on the limited description of OrioleDB I understand it works like SQLite WITHOUT ROWID, actually storing the row tuple in the primary key b-tree, but I didn’t go read the code
This is my understanding of OrioleDB as well.
We should start counting the times a permissive-licensed software needs this kind of protection, and then wonder what is the difference between all this effort and just going GPL.
You don't need to hang onto hopes and public shaming until someone else does the hard part.
Still, kudos for the grant.
How does it compare with Neon DB?
Wait when you need to manage a bunch of servers yourself. Unfortunately, the solutions available are complex, and not something where you can simply points something to a server, or VPS, and have a quick total controlled kernel level solution. K8, sure, but even that is not on the same level. And when you then need to manage DB's, your often placing those outside K8. Most scalable solutions like CRDB (10m pay, required yearly "free" license approval), Yugabyte(half broken), TiDB ... have their own issues and again, do not tie into a complete managed cloud experience.
https://github.com/orioledb/orioledb/commit/44bab2aa9879feb7...