Caching is an abstraction, not an optimization

Posted by samuel246 5 days ago

Caching is an abstraction, not an optimization(buttondown.com)

135 points | 128 commentspage 3

taeric 2 days ago|

This reminds me of the use of materialized views as both a cache strategy and as an abstraction helper.

bravesoul2 2 days ago|

And they too can slow things down. Like all caches can. Like Redis can. Cache is a leaky abstraction.

(Although a materialised view is more like an index than a cache. The view won't expire requiring you to rebuild.)

necovek 2 days ago||

I believe this same language use is what makes this article confusing: Redis is not a cache, it is a key value store. Caching is usually implemented using key value stores, but it is not an abstraction (leaky or not).

In RDBMS contexts, index really is a caching mechanism (a cache) managed by the database system (query planner needs to decide when it's best to use one index or another).

But as you note yourself even in these cases where you've got cache management bundled with the database, having too many can slow down (even deadlock) writes so much as the database tries to ensure consistency between these redundant data storage elements.

bravesoul2 2 days ago||

I thought Redis grew up as a KV cache and persistent storage came later.

In some sense though. If it ain't L1 it's storage :)

necovek 2 days ago|||

Maybe Redis started up as an in-memory KV store focused on caching use cases, but it was still a KV store that could be used for caching, or not.

Even if you use "cache" in the name (eg. memcached), that's still not a cache, even if it's a KV store designed for caching.

jongjong 2 days ago||

I was discussing this with someone recently, caching is one of those things that people might do behind the scenes, thinking that it doesn't affect the API but in fact it can create all sorts of issues/complexity.

pclmulqdq 2 days ago||

Use of a better abstraction is an optimization, though.

jbverschoor 3 days ago||

Everything is caching. Almost nothing operates on the target data directly.

necovek 2 days ago|

Do you think that's a useful definition of the term?

If everything is caching, why even introduce the term: language should help us describe ideas, it should not be superfluous.

jbverschoor 2 days ago||

Because you can operate directly on data

dasil003 2 days ago||

What? No, caching means a specific thing: keeping a copy of data away from the source of truth, closer to where you want to read it. Caching always makes systems more complex, it never makes things simpler, and it damn sure doesn't serve as any kind of abstraction unless you're redefining what words mean to indulge your technical philosophizing.

hansvm 2 days ago|

What if you have to keep some data closer and away from the source of truth though? Given that constraint, TFA argued that other architectures could do the job but that caching functions as an abstraction.

k__ 3 days ago||

Anything can be an abstraction if designed carefully.

chrisjj 2 days ago||

Why not both? :)

0xbadcafebee 2 days ago|

Sometimes posts are so difficult to read they're hard to respond to. I think I get what they're saying. I think they're saying that they think caching should be simple, or at least, that it should be obvious how you should cache in your particular situation such that you don't need things like algorithms. But that argument is kind of nonsense, because really everything in software is an algorithm.

Caching is storing a copy of data in a place or way that it is faster to retrieve than it would be otherwise. Caching is not an abstraction; it is a computer science technique to achieve improved performance.

Caching does not make software simpler. In fact, it always, by necessity, makes software more complex. For example, there are:

  - Routines to look up data in a fast storage medium
  - Routines to retrieve data from a slow storage medium and store them in a fast storage medium
  - Routines to remove the cache if an expiration is reached
  - Routines to remove cache entries if we run out of cache storage
  - Routines to remove the oldest unused cache entry
  - Routines to remove the newest cache entry
  - Routines to store the age of each cache entry access
  - Routines to remove cache entries which have been used the least
  - Routines to remove specific cache entries regardless of age
  - Routines to store data in the cache at the same time as slow storage
  - Routines to store data in cache and only write to slow storage occasionally
  - Routines to clear out the data and get it again on-demand/as necessary
  - Routines to inform other systems about the state of your cache
  - ...and many, many more

Each routine involves a calculation that determines whether the cache will be beneficial. A hit or miss can lead to operations which may add or remove latency, may or may not run into consistency problems, may or may not require remediation. The cache may need to be warmed up, or it may be fine starting cold. Clearing the cache (ex. restarts) may cause such a drastic cascading failure that the system cannot be started again. And there is often a large amount of statistics and analysis needed to optimize a caching strategy.

These are just a few of the considerations of caching. Caching is famously one of the hardest problems in computer science. How caching is implemented, and what it affects, can be very complex, and needs to be considered carefully. If you try to abstract it away, it usually leads to problems. Though if you don't try to abstract it away, it also leads to problems. Because of all of that, abstracting caching away into "general storage engine" is simply impossible in many cases.

Caching also isn't just having data in fast storage. Caching is cheating. You want to provide your data faster than actually works with your normal data storage (or transfer mechanism, etc). So you cheat, by copying it somewhere faster. And you cheat again, by trying to figure out how to look it up fast. And cheat again, by trying to figure out how to deal with its state being ultimately separate from the state of the "real" data in storage.

Basically caching is us trying to be really clever and work around our inherent limitations. But often we're not as smart as we think we are, and our clever cheat can bite us. So my advice is to design your system to work well without caching. You will thank yourself later, when you finally are dealing with the bug bites, and realize you dodged a bullet before.