Posted by hollylawly 3 days ago
- We wrote a cronjob to periodically DELETE for a retention policy on a table we'd just created. Most senior person on the team reviewed it, looked fine.
- Unusually for us, we prioritize QA'ing a different feature for release, delaying the release of this cronjob and a bunch of other code.
- During that delay, the new table accumulated many times more rows to be deleted than we'd expected during review.
- Release happens. All looks well since the initial delete wasn't a migration and cronjob hasn't run yet; engineer doing the release signs off.
- Cronjob runs, deleting hundreds of millions of rows quickly.
- Next day, replica lag's high and MySQL's transaction history is very high. MySQL keeps transaction history around until purge threads have visited all the affected pages on disk.
- The bad cluster conditions last for days and lead to other problems.
This omits detail and the 'noise' of everything else we were watching. But it gets across how the code and MySQL behaved.
Like most exciting events, it led to multiple changes to avoid a repeat. For retention policies, our new approach was one at the end of PlanetScale's post, to partition and drop old partitions. Transitioning to this from a huge unpartitioned table can be fun!
If a table is append-only and already huge, with lots of rows already past the retention threshold, you might only copy the rows to be kept to the new partitioned table: copy what you can, lock tables, do a last catch-up copy and swap tables. (Roughly the blog's 'performant one-off delete'.)
If the table's merely kind of big, gh-ost or such could allow you to ALTER without causing lag, locking, etc.
At a scale below that, you could run a slow incremental 'nibble' delete while watching server stats, and a step below that, plain ALTERs or DELETEs are fine.
Using partitioning has fun bits, too. In MySQL, the partition key has to be part of any unique index, understandably. But you have to keep that in mind when you're using INSERT..ON DUPLICATE KEY UPDATE and relying on uniqueness to trigger the update. Things stay interesting!
I hear Vitess shops like PlanetScale usually don't run multi-terabyte myqsld instances in the first place: even when physical nodes are big, they run many smaller mysqlds on them. That wouldn't make all this fully irrelevant--huge deletes would still sometimes be worse than copy-swap-drop--but it does seem real handy for taming issues that tend to worsen with mysqld size, like replication lag. All to say, little bit jelly of their setup over there!
been exploring clickhouse and while it is definitely not a general purpose DB, for time-series shaped data that can survive some insert latency, the automatic partition-based TTL is very nice and, at least so far, requires zero attention to maintain
which I guess is solved by `pg_partman` at the bottom of the post