Posted by upmostly 4 days ago
A lot of data pipeline work can function pretty well with each step producing artifacts that the next can consume. If you need unique items, use uuids. If you want to check whether an item exists, use a hash of something (maybe a url). And each day goes in its own subdirectory. It works.
Sometimes, though, you end up reinventing database constructs. Joins, indexing, versioning. A little of that is ok, but it gets gnarly fast. If you mess up, it could mean corrupting your data in a way that’s irreversible. Or, corrupting your data in a way that’s reversible but really hard.
So a database has the benefit of enforcing consistency that a glob over json files doesn’t. It doesn’t mean every one-off script needs a blue-green Postgres DB, but there can be an inflection point.
In my (hypothetical, 'cause I never actually sat down and wrote that) case, I wanted the personal transactions in a month, and I realized I could just keep one single file per month, and read the whole thing at once (also 'cause the application would display the whole month at once).
Filesystems can be considered a key-value (or key-document) database. The funny thing about the example used in the link is that one could simply create a structure like `user/[id]/info.json` and directly access the user ID instead of running some file to find them -- again, just 'cause the examples used, search by name would be a pain, and one point where databases would handle things better.
Now that said, if there's value to the "database" being human readable/editable, json is still well worth a consideration. Dealing with even sqlite is a pain in the ass when you just need to tweak or read something, especially if you're not the dev.
How? With SQL is super easy to search, compare, and update data. That's what it’s built for.
However, if your all application state can be represented in a single json file of less than a dozen MB, yes, a database can be overkill.
Yes agreed, but it's usually a lot easier to find the filename part, especially if the application follows XDG. Sqlite databases are usually buried somewhere because they aren't expected to be looked at.
1. You're paying very little for 1000x the features.
2. The chances your application will keep doing "pure ID lookups" forever are zero. The moment you need to query the data in any other way or have more than one writer you're going to have to throw all this nonsense code into the trash.
3. Do you need the 1.7x speedup? No, of course you don't. It's just optimizing for the sake of optimizing.
I'd have just used sqlite to begin with and not benchmarked anything. No custom code, no need to waste any time, great performance, amazing flexibility — all with minimum effort.