What is a database transaction?

Posted by 0x54MUR41 10 hours ago

What is a database transaction?(planetscale.com)

189 points | 47 comments

MHordecki 9 hours ago|

I’ve found this article lacking. Like some other articles in this space, it introduces isolation levels through the lens of the phenomena described in the SQL standard, but I find that there’s a different, more intuitive approach.

I think it’s more tractable to define this problem space starting from the concept of (strict) serializability, which is really a generalization of the concept of thread safety. Every software engineer has an intuitive understanding of it. Lack of serializability can lead to execution-dependent behavior, which usually results in hard-to-diagnose bugs. Thus, all systems should strive towards serializability, and the database can be a tool in achieving it.

Various non-serializable levels of database transaction isolation are relaxations of the serializability guarantee, where the database no longer enforces the guarantee and it’s up to the database user to ensure it through other means.

The isolation phenomena are a useful tool for visualizing various corner cases of non-serializability, but they are not inherently tied to it. It's possible to achieve serializability while observing all of the SQL phenomena. For example, a Kubernetes cluster with carefully-written controllers can be serializable.

bddicken 8 hours ago||

Author here. This is good feedback.

The combination of transactions, isolation levels, and MVCC is such a huge undertaking to cover all at once, specially when comparing how it's done across multiple DBs which I attempted here. Always a balance between technical depth, accessibility to people with less experience, and not letting it turn into an hour-long read.

libraryofbabel 5 hours ago|||

I actually like this article a lot. I do a bit of teaching, and I imagined the ideal audience for this as a smart junior engineer who knows SQL and has encountered transactions but maybe doesn’t really understand them yet. I think introducing things via examples of isolation anomalies (which most engineers will have seen examples of in bugs, even if they didn’t fully understand them) gives the explanation a lot more concreteness than starting with serializability as a theoretical concept as GP is proposing. Sure, strict serializability is a powerful idea that ties all this together and is more satisfying for an expert who already knows this stuff. But for someone who is just learning, you have to motivate it first.

If anything, I’d say it might be better to start with the lower isolation levels first, highlight the concurrency problems that can arise with them, and gradually introduce higher isolation levels until you get to serializability. That feels a bit more intuitive rather than downward progression from serializability to read uncommitted as presented here.

It also might be nice to see a quick discussion of why people choose particular isolation levels in practice, e.g. why you might make a tradeoff under high concurrency and give up serializability to avoid waits and deadlocks.

But excellent article overall, and great visualizations.

jaxr 7 hours ago|||

I love the work planetscale does on keeping this type of content accurate yet accessible. Keep it up!

Rapzid 8 hours ago|||

https://aphyr.com/posts/327-jepsen-mariadb-galera-cluster

More notation, more citations, more better.

bddicken 8 hours ago|||

Notation is useful. Citations are nice for further reading. But I don't agree more of this makes for a better article!

peterclary 8 hours ago|||

Looks like the author is geoblocking in protest of the UK Online Safety Act (and fair enough).

lateforwork 7 hours ago|||

Most RDBMSs offer serializable isolation if you need it. Often you don't need it. The downside of using serializable isolation unnecessarily is reduced concurrency and throughput due to increased coordination between transactions.

bddicken 1 hour ago||

Yep. Its a wonderful capability to have for some situations, but for 90% of applications SERIALIZABLE isolation is overkill.

ignoramous 3 hours ago|||

> concept of (strict) serializability [("S")], which is really a generalization of the concept of thread safety

Unsure why "strict" (L + S) is in braces: Linearizability ("L") is what resembles safety in SMP systems the most?

mika6996 8 hours ago||

Then recommend a better explanation?

zadikian 5 hours ago||

Seems like a frequent surprise is that Postgres and MySQL don't default to serializable (so not fully I in ACID). They do read-committed. I didn't see this article mention that, but maybe I missed it. The article says read-committed provides "slightly" better performance, but it's been way faster in my experience. Forget where, but I think they said they chose this default for that reason.

Using read-committed ofc means having to keep locking details in mind. Like, UNIQUE doesn't just guard against bad data entry, it can also be necessary for avoiding race conditions. But now that I know, I'd rather do that than take the serializable performance hit, and also have to retry xacts and deal with the other caveats at the bottom of https://www.postgresql.org/docs/current/transaction-iso.html

benoitg 1 hour ago||

Recent versions of MySQL and MariaDB default to repeatable-read for InnoDB tables, not read-commited :

https://dev.mysql.com/doc/refman/8.4/en/set-transaction.html...

https://mariadb.com/docs/server/reference/sql-statements/adm...

I don't know about MyISAM though (who uses it anyway ;-) ).

layer8 4 hours ago|||

The issue with SERIALIZABLE, aside from performance, is that transactions can fail due to conflicts/deadlocks/timeouts, so application code must be prepared to recognize those cases and have a strategy to retry the transactions.

zadikian 3 hours ago||

Right. So my code had a helper to run some inner func in a serializable xact, in rw or ro mode, which would retry with backoff. Like the TransactionRunner in Spanner. But even with no retries occurring, it was very slow.

twic 22 minutes ago||

VoltDB took this to an extreme - the way you interact with it is by sending it some code which does a mix of queries and logic, and it automatically retries the code as many times as necessary if there's a conflict. Because it all happens inside the DBMS, it's transparent and fast. I thought that was really clever.

I'm using the past tense here, but VoltDB is still going. I don't think it's as well-known as it deserves to be.

lateforwork 3 hours ago||

> Postgres and MySQL don't default to serializable

Oracle and SQL Server also default to read committed, not serializable. Serializable looks good in text books but is rarely used in practice.

zadikian 2 hours ago||

Yeah, the only examples I know of it being default are Spanner and Cockroach, which are for a different use case.

interlocutor 8 hours ago||

A lot of database tools these days prioritize instant sharing of updates over transactions and ACID properties. Example: Airtable. As soon as you update a field the update shows up on your coworkers screen who also has the same table open. The downside of this is that Airtable doesn't do transactions. And the downside of not doing transactions is potentially dangerous data inconsistencies. More about that here: https://visualdb.com/blog/concurrencycontrol/

rishabhaiover 9 hours ago||

It's an absolute pleasure reading planetscale blogs. I'm curious about what tool is used to make these visualizations?

bddicken 8 hours ago|

Author here. Thank you! These visuals are built with js + gsap (https://gsap.com)

rishabhaiover 6 hours ago|||

Thank you for sharing it, kind sir. Your explanation on b+trees (https://planetscale.com/blog/btrees-and-database-indexes) is probably the best one I've ever seen on the internet.

airstrike 5 hours ago|||

Thought it was going to be a blog post about Jeopardy for a sec

lasgawe 4 hours ago||

We built an entire project for a client-side project with millions of SQL rows and thousands of users without adding a single transaction. :/

layer8 4 hours ago||

If you have no explicit transactions, every insert/update is its own transaction (aka auto-commit). Depending on what you do, you might not need more. It’s still important to know that these execute as a transaction.

zadikian 3 hours ago||

Yep, there have been times I get through a whole project without any explicit transactions. In fact it can be a sign of not fully normalized schema design if you rely on those a lot (which can ofc be fine if you deliberately wanted that).

bddicken 1 hour ago|||

These are still transactions! It's not uncommon for a large % of transactions in an OLTP workload to be only one query without explicit BEGIN / COMMIT; This is called an autocommit transactions or implicit transaction.

libraryofbabel 3 hours ago|||

How nice for you. But since you totally neglected to say anything about your use-case or schema or query patterns, it’s impossible to know what this even means. Some use cases can trivially be done without any explicit transactions and you’re not giving anything up. For others (usually, something where you need to enforce invariants under high concurrency writes or writes+reads on the same data across multiple tables), transactions are pretty critical. So, it depends.

ninjaoxygen 3 hours ago||

Did you use SELECT FOR UPDATE at all, or just never had to update dependent data? If the complex operations are implemented using stored functions / procedures then the a transaction is implicit.

If the data is fairly straightforward like just one-to-many CRUD with no circular references, you would be able to do it without transactions, just table relationships would be enough to ensure consistency.

cryptonector 1 hour ago||

In the section about serializable read TFA gets the `accounts` `balance` wrong.

bone_tag 1 hour ago||

Yes, I wasn't sure if I misunderstood the concept or if there was an error in the article.

bddicken 1 hour ago||

Thanks, fixed!

jascha_eng 6 hours ago||

This actually used to be one of my favorite interview questions for backend engineers. Everyone has used transactions but depending on your seniority you'd understand it to different degrees.

And no I'd never expect people to know the isolation levels by heart, but if you know there are different ones and they behave differntly that's pretty good and tells me you are curious about how things work under the hood.

layer8 3 hours ago|

The nominally same isolation levels can also behave differently on different database systems, so in general you have to investigate the details on a case-by-case basis anyway.

shalabhc 4 hours ago||

For all interested in this topic, I highly recommend the book Designing Data Intensive Applications https://www.goodreads.com/book/show/23463279-designing-data-....

It goes into not only different isolation levels, but also some ambiguity in the traditional ACID definition.

I believe a 2nd edition is imminent.

zabzonk 3 hours ago||

I thought this was pretty good, not least because it attempts to explain isolation levels, something I always found pretty tricky when teaching SQL. Mind you, I was only teaching SQL, and so isolation, as part of C and C++ courses so that our clients could do useful stuff, but explaining what levels to use was always tuff.

unkulunkulu 5 hours ago|

> At this stage, it has nothing to do with xmin and xmax, but rather because other transactions cannot see uncommitted data

Am I missing something or this statement is incomplete? Also I find the visualization of commit weird, it “points to” the header of the table, but then xmax gets updated “behind the scenes”? Isnt xmax/xmin “the mechanism behind how the database knows what is committed/not committed”? Also, there could be subtransactions, which make this statement even more contradictory?

I enjoyed the visualizations and explanations otherwise, thanks!

CWIZO 2 hours ago|

I also think the article glossed/skipped over the xmax/xmin concepts. And they are fundamental to understand how different isolation levels actually work. It's quite jarring to the point I'm wondering if a whole section got accidentally dropped from the article.

More comments...