Do you even need a database?

Posted by upmostly 8 hours ago

Do you even need a database?(www.dbpro.app)

142 points | 208 comments

devilsdata 21 minutes ago|

This is a cool exercise, but I would hesitate to choose files over SQLite or another Dockerised relational database in production.

They are overoptimising for the simplest part of writing the application; the beginning. They've half-implemented an actual database, with none of the safety features. There are a lot of potential headaches that this article has avoided talking about; perhaps because they haven't experienced them yet.

See: https://danluu.com/file-consistency/

What happens when you need to start expanding the scope of this feature? Joining users on profiles, or users on orgs?

Ask yourself: how many shops have seriously written an application backed by files and stuck with it over the long-run? The answer is likely very few. Therefore, this is likely doubling up the work required.

There is a reason people reach for a database first. I'd strongly encourage anyone to avoid doing stuff like this.

ozgrakkurt 4 hours ago||

You need databases if you need any kind of atomicity. Doing atomic writes is extremely fragile if you are just on top of the filesystem.

This is also why many databases have persistence issues and can easily corrupt on-disk data on crash. Rocksdb on windows is a very simple example a couple years back. It was regularly having corruption issues when doing development with it.

gavinray 36 minutes ago||

  > Doing atomic writes is extremely fragile if you are just on top of the filesystem.

This is not true, at least in Linux.

  pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);

The requirements being that the write must be block-aligned and no larger than the underlying FS's guaranteed atomic write size

dkarl 3 hours ago|||

Honestly, at this point, if I had a design that required making atomic changes to files, I'd redo the design to use SQLite. The other way around sounds crazy to me.

"Why use spray paint when you can achieve the same effect by ejecting paint from your mouth in a uniform high-velocity mist?" If you happen to have developed that particular weird skill, by all means use it, but if you haven't, don't start now.

That probably sounds soft and lazy. I should learn to use my operating system's filesystem APIs safely. It would make me a better person. But honestly, I think that's a very niche skill these days, and you should consider if you really need it now and if you'll ever benefit from it in the future.

Also, even if you do it right, the people who inherit your code probably won't develop the same skills. They'll tell their boss it's impossibly dangerous to make any changes, and they'll replace it with a database.

duped 1 hour ago||

The problem is that most of the time when you want "atomic changes to files" the only safe API is copy the file, mutate it, then rename. That doesn't factor in concurrent writers or advisory locks.

If that kind of filesystem traffic is unsuitable for your application then you will reinvent journaling or write-ahead logging. And if you want those to be fast you'll implement checkpointing and indexes.

creatonez 3 hours ago|||

For the simple case, it isn't necessarily that fragile. Write the entire database to a temp file, then after flushing, move the temp file to overwrite the old file. All Unix filesystems will ensure the move operation is atomic. Lots of "we dump a bunch of JSON to the disk" use cases could be much more stable if they just did this.

Doesn't scale at all, though - all of the data that needs to be self-consistent needs to be part of the same file, so unnecessary writes go through the roof if you're only doing small updates on a giant file. Still gotta handle locking if there is risk of a stray process messing it up. And doing this only handles part of ACID.

jeffffff 2 hours ago||

don't forget to fsync the file before the rename! and you also need to fsync the directory after the rename!

goerch 3 hours ago|||

Nice, so we are already covering the A of ACID. And don't get me started about what OLAP databases like DuckDB can do for out of core workloads.

noselasd 1 hour ago|||

Yes, the code in the article will at one unlucky point end up with an empty file after a power outage.

At least write to a temp file(in the same filesystem), fsync the file and its folder and rename it over the original.

wasabi991011 1 hour ago|||

Yes, this is covered in the "When do you actually need a database?" section of the article.

vector_spaces 2 hours ago||

I mean, if your atomic unit is a single file and you can tolerate simple consistency models, flat files are perfectly fine. There are many use cases that fit here comfortably where a whole database would be overkill

z3ugma 6 hours ago||

At some point, don't you just end up making a low-quality, poorly-tested reinvention of SQLite by doing this and adding features?

freedomben 6 hours ago||

Sometimes yes, I've seen it. It even tends to happen on NoSQL databases as well. Three times I've seen apps start on top of Dynamo DB, and then end up re-implementing relational databases at the application level anyway. Starting with postgres would have been the right answer for all three of those. Initial dev went faster, but tech debt and complexity quickly started soaking up all those gains and left a hard-to-maintain mess.

leafarlua 6 hours ago|||

This always confuses me because we have decades of SQL and all its issues as well. Hundreds of experienced devs talking about all the issues in SQL and the quirks of queries when your data is not trivial.

One would think that for a startup of sorts, where things changes fast and are unpredictable, NoSQL is the correct answer. And when things are stable and the shape of entities are known, going for SQL becomes a natural path.

There is also cases for having both, and there is cases for graph-oriented databases or even columnar-oriented ones such as duckdb.

Seems to me, with my very limited experience of course, everything leads to same boring fundamental issue: Rarely the issue lays on infrastructure, and is mostly bad design decisions and poor domain knowledge. Realistic, how many times the bottleneck is indeed the type of database versus the quality of the code and the imlementation of the system design?

marcosdumay 4 hours ago|||

No, when things change fast and unpredictably, NoSQL is worse than when they are well-known and stable.

NoSQL gains you no speed at all in redesigning your system. Instead, you trade a few hard to do tasks in data migration into an unsurmountable mess of data inconsistency bugs that you'll never actually get into the end of.

> is mostly bad design decisions and poor domain knowledge

Yes, using NoSQL to avoid data migrations is a bad design decision. Usually created by poor general knowledge.

james_marks 3 hours ago|||

If the argument for NoSQL is, “we don’t know what our schema is going to be”, stop.

Stop and go ask more questions until you have a better understanding of the problem.

jampekka 2 hours ago||

Oftentimes better understanding of the problem needs trying out solutions. Armchair architectures tend to blow up in contact with reality.

freedomben 1 hour ago||

For sure, though with databases it's usually pretty clear even at the start whether your "objects" will be relational in nature. I can't think of a single time that hasn't been the case, over hundreds of apps/services I've been part of. Things like asynchronous jobs, message queues, even object storage, I fully agree though.

leafarlua 2 hours ago|||

Makes sense. But in this case, why NoSQL exists? What problems does it resolves and when should it be considered? I'm being naive, but fast changing environment has been one of the main advantages that I was taught from devs when it comes to NoSQL vs SQL (nosql being the choice for flexible schemas). So it is more about BASE vs ACID?

gf000 56 minutes ago|||

Probably the best use case would be something like a Facebook profile page for a given user.

It may not have a very rigid schema, you may later add several other optional fields.

You need very large scale (as in no of concurrent accesses), you want to shard the data by e.g. location. But also, the data is not "critical", your highschool not being visible temporarily for certain users is not an issue.

You mostly use the whole dataset "at the same time", you don't do a lot of WHERE, JOIN on some nested value.

In every other case I would rather reach for postgres with a JSONB column.

marcosdumay 2 hours ago|||

NoSQL was created to deal with scales where ACID becomes a bottleneck. It also shown itself useful for dealing with data that don't actually have an schema.

If you have either of those problems, you will know it very clearly.

Also, ironically, Postgres became one of the most scalable NoSQL bases out there, and one of the most flexible to use unstructured data too.

freedomben 1 hour ago||

Agreed. In my experience (YMMV), there was also a real adoption push in the js world from primarily front-end people that wanted to do some backend but didn't want to learn/deal with SQL databases. I don't say that with malice, I was also on-board the NoSQL train for a bit before I actually gained experience with the headaches it caused. The appeal of "just dump your JSON blob straight in" was (and still is) strong. Software is all about learning, and sometimes that learning is expensive. We've all built something we later regretted.

tracker1 4 hours ago||||

I think part of it is the scale in terms of the past decade and a half... The hardware and vertical scale you could get in 2010 is dramatically different than today.

A lot of the bespoke no-sql data stores really started to come to the forefront around 2010 or so. At that time, having 8 cpu cores and 10k rpm SAS spinning drives was a high end server. Today, we have well over 100 cores, with TBs of RAM and PCIe Gen 4/5 NVME storage (u.x) that is thousands of times faster and has a total cost lower than the servers from 2010 or so that your average laptop can outclass today.

You can vertically scale a traditional RDBMS like PostgreSQL to an extreme degree... Not to mention utilizing features like JSONB where you can have denormalized tables within a structured world. This makes it even harder to really justify using NoSQL/NewSQL databases. The main bottlenecks are easier to overcome if you relax normalization where necessary.

There's also the consideration of specialized databases or alternative databases where data is echo'd to for the purposes of logging, metrics or reporting. Not to mention, certain layers of appropriate caching, which can still be less complex than some multi-database approaches.

leafarlua 2 hours ago||

What about the microservices/serverless functions world? This was another common topic over the years, that using SQL with this type of system was not optimal, I believe the issue being the connections to the SQL database and stuff.

tracker1 2 hours ago||

I think a lot of the deference to microservices/serverless is for similar reasons... you can work around some of this if you use a connection proxy, which is pretty common for PostgreSQL...

That said, I've leaned into avoiding breaking up a lot of microservices unless/until you need them... I'm also not opposed to combining CQRS style workflows if/when you do need micro services. Usually if you need them, you're either breaking off certain compute/logic workflows first where the async/queued nature lends itself to your needs. My limited experience with a heavy micro-service application combined with GraphQL was somewhat painful in that the infrastructure and orchestration weren't appropriately backed by dedicated teams leading to excess complexity and job duties for a project that would have scaled just fine in a more monolithic approach.

YMMV depending on your specific needs, of course. You can also have microservices call natural services that have better connection sharing heuristics depending again on your infrastructure and needs... I've got worker pools that mostly operate of a queue, perform heavy compute loads then interact with the same API service(s) as everything else.

mike_hearn 4 hours ago||||

Disclaimer: I work part time on the DB team.

You could also consider renting an Oracle DB. Yep! Consider some unintuitive facts:

• It can be cheaper to use Oracle than MongoDB. There are companies that have migrated away from Mongo to Oracle to save money. This idea violates some of HN's most sacred memes, but there you go. Cloud databases are things you always pay for, even if they're based on open source code.

• Oracle supports NoSQL features including the MongoDB protocol. You can use the Mongo GUI tools to view and edit your data. Starting with NoSQL is very easy as a consequence.

• But... it also has "JSON duality views". You start with a collection of JSON documents and the database not only works out your JSON schemas through data entropy analysis, but can also refactor your documents into relational tables behind the scenes whilst preserving the JSON/REST oriented view e.g. with optimistic locking using etags. Queries on JSON DVs become SQL queries that join tables behind the scenes so you get the benefits of both NoSQL and SQL worlds (i.e. updating a sub-object in one place updates it in all places cheaply).

• If your startup has viral growth you won't have db scaling issues because Oracle DBs scale horizontally, and have a bunch of other neat performance tricks like automatically adding indexes you forgot you needed, you can materialize views, there are high performance transactional message queues etc.

So you get a nice smooth scale-up and transition from ad hoc "stuff some json into the db and hope for the best" to well typed data with schemas and properly normalized forms that benefit from all the features of SQL.

alexisread 3 hours ago|||

Good points, but Postgres has all those, along with much better local testing story, easier and more reliable CDC, better UDFs (in Python, Go etc.), a huge ecosystem of extensions for eg. GIS data, no licencing issues ever, API compatability with DuckDB, Doris and other DBs, and (this is the big one) is not Oracle.

danny_codes 9 minutes ago||||

But then you’d have to interact with Oracle.

So.

Yeah no sane person would be that stupid

tracker1 4 hours ago||||

I generally limit Oracle to where you are in a position to have a dedicated team to the design, deployment and management of just database operations. I'm not really a fan of Oracle in general, but if you're in a position to spend upwards of $1m/yr or more for dedicated db staff, then it's probably worth considering.

Even then, PostgreSQL and even MS-SQL are often decent alternatives for most use cases.

mike_hearn 3 hours ago||

That was true years ago but these days there's the autonomous database offering, where DB operations are almost all automated. You can rent them in the cloud and you just get the connection strings/wallet and go. Examples of stuff it automates: backups, scaling up/down, (as mentioned) adding indexes automatically, query plan A/B testing to catch bad replans, you can pin plans if you need to, rolling upgrades without downtime, automated application of security patches (if you want that), etc.

So yeah running a relational DB used to be quite high effort but it got a lot better over time.

tracker1 3 hours ago||

At that point, you can say the same for PostgreSQL, which is more broadly supported across all major and minor cloud platforms with similar features and I'm assuming a lower cost and barrier of entry. This is without signing with Oracle, Inc... which tends to bring a lot of lock-in behaviors that come with those feature sets.

TBF, I haven't had to use Oracle in about a decade at this point... so I'm not sure how well it competes... My experiences with the corporate entity itself leave a lot to be desired, let alone just getting setup/started with local connectivity has always been what I considered extremely painful vs common alternatives. MS-SQL was always really nice to get setup, but more recently has had a lot of difficulties, in particular with docker/dev instances and more under arm (mac) than alternatives.

I'm a pretty big fan of PG, which is, again, very widely available and supported.

mike_hearn 3 hours ago||

Autonomous DB can run on-premises or in any cloud, not just Oracle's cloud. So it's not quite the same.

I think PG doesn't have most of the features I named, I'm pretty sure it doesn't have integrated queues for example (SELECT FOR UPDATE SKIP LOCKED isn't an MQ system), but also, bear in mind the "postgres" cloud vendors sell is often not actually Postgres. They've forked it and are exploiting the weak trademark protection, so people can end up more locked in than they think. In the past one cloud even shipped a transaction isolation bug in something they were calling managed Postgres, that didn't exist upstream! So then you're stuck with both a single DB and a single cloud.

Local dev is the same as other DBs:

    docker run -d --name <oracle-db> container-registry.oracle.com/database/free:latest

See https://container-registry.oracle.com

Works on Intel and ARM. I develop on an ARM Mac without issue. It starts up in a few seconds.

Cost isn't necessarily much lower. At one point I specced out a DB equivalent to what a managed Postgres would cost for OpenAI's reported workload:

> I knocked up an estimate using Azure's pricing calculator and the numbers they provide, assuming 5TB of data (under-estimate) and HA option. Even with a 1 year reservation @40% discount they'd be paying (list price) around $350k/month. For that amount you can rent a dedicated Oracle/ExaData cluster with 192 cores! That's got all kinds of fancy hardware optimizations like a dedicated intra-cluster replication network, RDMA between nodes, predicate pushdown etc. It's going to perform better, and have way more features that would relieve their operational headache.

chrisweekly 2 hours ago|||

In the spirit of helpfulness (not pedantry) FYI "knocked up" means "impregnated". Maybe "put together"?

mike_hearn 2 hours ago||

Ah, this must be a British vs American English thing, thanks for the info.

Yes I meant it in this sense: "If you knock something up, you make it or build it very quickly, using whatever materials are available."

https://www.collinsdictionary.com/dictionary/english/knock-u...

tracker1 2 hours ago|||

And, again... most of my issues are with Oracle, Inc. So technical advantages are less of a consideration.

OtomotO 2 hours ago||||

If you have an option, never ever use Oracle!

Never!

freedomben 4 hours ago|||

I wanted to hate you for suggesting Oracle, but you defend it well! I had no idea

dalenw 5 hours ago||||

It's almost always a system design issue. Outside of a few specific use cases with big data, I struggle to imagine when I'd use NoSQL, especially in an application or data analytics scenario. At the end of the data, your data should be structured in a predictable manner, and it most likely relates to other data. So just use SQL.

greenavocado 5 hours ago||

System design issues are a product of culture, capabilities, and prototyping speed of the dev team

AlotOfReading 3 hours ago||||

There's plenty of middle ground between an unchanging SQL schema and the implicit schemas of "schemaless" databases. You can have completely fluid schemas with the full power of relational algebra (e.g. untyped datalog). You shouldn't be using NoSQL just because you want to easily change schemas.

ignoramous 22 minutes ago|||

> One would think that for a startup of sorts, where things changes fast and are unpredictable, NoSQL is the correct answer. And when things are stable and the shape of entities are known, going for SQL becomes a natural path.

NoSQL is the "correct" answer if your queries are KV oriented, while predictable performance and high availability are priority (true for most "control planes"). Don't think any well-designed system will usually need to "graduate" from NoSQL to SQL.

Prior: https://news.ycombinator.com/item?id=22249490

akdev1l 2 hours ago||||

> end up re-implementing relational databases at the application level anyway

This is by design, the idea is that scaling your application layer is easy but scaling your storage/db layer is not

Hence make the storage dumb and have the application do the joins and now your app scales right up

(But tbh I agree a lot of applications don’t reach the scale required to benefit from this)

tshaddox 3 hours ago||||

I've never used DynamoDB in production, but it always struck me as the type of thing where you'd want to start with a typical relational database, and only transition the critical read/write paths when you get to massive scale and have a very good understanding of your data access patterns.

icedchai 4 hours ago|||

Same. DynamoDB is almost never a good default choice unless you've thought very carefully about your current and future use cases. That's not to say it's always bad! At previous startups we did some amazing things with Dynamo.

noveltyaccount 6 hours ago|||

As soon as you need to do a JOIN, you're either rewriting a database or replatforming on Sqlite.

goerch 3 hours ago|||

a) Just heard today: JOINs are bad for performance b) How many columns can (an Excel) table have: no need for JOINs

datadrivenangel 2 hours ago||

vlookups are bad for performance. recursive vlookups even more so.

pgtan 4 hours ago|||

Here are two checks using joins, one with sqlite, one with the join builtin of ksh93:

  check_empty_vhosts () {
    # Check which vhost adapter doesn't have any VTD mapped
    start_sqlite
    tosql "SELECT l.vios_name,l.vadapter_name FROM vios_vadapter AS l
        LEFT OUTER JOIN vios_wwn_disk_vadapter_vtd AS r
    USING (vadapter_name,vios_name)
    WHERE r.vadapter_name IS NULL AND
      r.vios_name IS NULL AND
   l.vadapter_name LIKE 'vhost%';"
    endsql
    getsql
    stop_sqlite
  }

  check_empty_vhosts_sh () {
    # same as above, but on the shell
    join  -v 1  -t , -1 1 -2 1 \
   <(while IFS=, read vio host slot; do 
  if [[ $host == vhost* ]]; then
      print ${vio}_$host,$slot 
  fi
     done < $VIO_ADAPTER_SLOT | sort -t , -k 1)\
   <(while IFS=, read vio vhost vtd disk; do
  if [[ $vhost == vhost* ]]; then        
    print ${vio}_$vhost
  fi
     done < $VIO_VHOST_VTD_DISK | sort -t , -k 1)
  }

bachmeier 4 hours ago|||

Based on what's in the article, it wouldn't take much to move these files to SQLite or any other database in the future.

Edit: I just submitted a link to Joe Armstrong's Minimum Viable Programs article from 2014. If the response to my comment is about the enterprise and imaginary scaling problems, realize that those situations don't apply to some programming problems.

locknitpicker 4 hours ago||

> Based on what's in the article, it wouldn't take much to move these files to SQLite or any other database in the future.

Why waste time screwing around with ad-hoc file reads, then?

I mean, what exactly are you buying by rolling your own?

bachmeier 4 hours ago||

You can avoid the overhead of working with the database. If you want to work with json data and prefer the advantages of text files, this solution will be better when you're starting out. I'm not going to argue in favor of a particular solution because that depends on what you're doing. One could turn the question around and ask what's special about SQLite.

pythonaut_16 4 hours ago|||

If your language supports it, what is the overhead of working with SQLite?

What's special about SQLite is that it already solves most of the things you need for data persistence without adding the same kind of overhead or trade offs as Postgres or other persistence layers, and that it saves you from solving those problems yourself in your json text files...

Like by all means don't use SQLite in every project. I have projects where I just use files on the disk too. But it's kinda inane to pretend it's some kind of burdensome tool that adds so much overhead it's not worth it.

cleversomething 4 hours ago||||

> what's special about SQLite

Battle-tested, extremely performant, easier to use than a homegrown alternative?

By all means, hack around and make your own pseudo-database file system. Sounds like a fun weekend project. It doesn't sound easier or better or less costly than using SQLite in a production app though.

ablob 4 hours ago||||

So you trade the overhead of SQL with the overhead of JSON?

locknitpicker 4 hours ago|||

> You can avoid the overhead of working with the database.

What overhead?

SQLite is literally more performant than fread/fwrite.

cleversomething 4 hours ago||

That's exactly what I was going to say. This seems more like a neat "look Ma, no database!" hobby project than an actual production recommendation.

hackingonempty 3 hours ago|||

Probably more like a low-quality, poorly-tested reinvention of BerkeleyDB.

whalesalad 4 hours ago|||

Reminds me of the infamous Robert Virding quote:

“Virding's First Rule of Programming: Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.”

mrec 3 hours ago||

In case you weren't aware, that in itself is riffing on Greenspun's tenth rule:

https://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule

randyrand 4 hours ago|||

“You Aren’t Gonna Need It” - one of the most important software principles.

Wait until you actually need it.

dkarl 3 hours ago|||

I interpret YAGNI to mean that you shouldn't invest extra work and extra code complexity to create capabilities that you don't need.

In this case, I feel like using the filesystem directly is the opposite: doing much more difficult programming and creating more complex code, in order to do less.

It depends on how you weigh the cost of the additional dependency that lets you write simpler code, of course, but I think in this case adding a SQLite dependency is a lower long-term maintenance burden than writing code to make atomic file writes.

The original post isn't about simplicity, though. It's about performance. They claim they achieved better performance by using the filesystem directly, which could (if they really need the extra performance) justify the extra challenge and code complexity.

goerch 2 hours ago||||

Is this what we do with education in general?

upmostly 3 hours ago|||

100%.

Premature optimisation I believe that's called.

I've seen it play out many times in engineering over the years.

trgn 2 hours ago|||

im sure, but honestly, i would love to have a db engine that just writes/reads csv or json. does it exist?

banana_giraffe 2 hours ago|||

DuckDB can do exactly this, once you get the API working in your system, it becomes something simple like

    SELECT \* from read_csv('example.csv');

Writing generally involves reading to an in-memory database, making whatever changes you want, then something like

    COPY new_table TO 'example.csv' (HEADER true, DELIMITER ',');

herpdyderp 2 hours ago||||

I wrote a CSV DB engine once! I can't remember why. For fun?

zabzonk 1 hour ago||

Microsoft actually provide an ODBC CSV data source out of the box.

akdev1l 2 hours ago|||

SQLite can do it

trgn 55 minutes ago||

it's storage file is a csv? or do you mean import/export to csv?

gorjusborg 6 hours ago||

Only if you get there and need it.

z3ugma 6 hours ago|||

but it's so trivial to implement SQLite, in almost any app or language...there are sufficient ORMs to do the joins if you don't like working with SQL directly...the B-trees are built in and you don't need to reason about binary search, and your app doesn't have 300% test coverage with fuzzing like SQLite does

you should be squashing bugs related to your business logic, not core data storage. Local data storage on your one horizontally-scaling box is a solved problem using SQLite. Not to mention atomic backups?

gorjusborg 6 hours ago|||

Honestly, there is zero chance you will implement anything close to sqlite.

What is more likely, if you are making good decisions, is that you'll reach a point where the simple approach will fail to meet your needs. If you use the same attitude again and choose the simplest solution based on your _need_, you'll have concrete knowledge and constraints that you can redesign for.

z3ugma 2 hours ago||

not re-implement SQLite, I mean "use SQLite as your persistence layer in your program"

e.g. worry about what makes your app unique. Data storage is not what makes your app unique. Outsource thinking about that to SQLite

hirvi74 6 hours ago||||

Sqlite is also the only major database to receive DO-178B certification, which allows Sqlite to legally operate in avionic environments and roles.

moron4hire 6 hours ago||||

Came here to also throw in a vote for it being so much easier to just use SQLite. You get so much for so very little. There might be a one-time up-front learning effort for tweaking settings, but that is a lot less effort than what you're going to spend on fiddling with stupid issues with data files all day, every day, for the rest of the life of your project.

tracker1 3 hours ago||

Even then... I'd argue for at least LevelDB over raw jsonl files... and I say this as someone who would regularly do ETL and backups to jsonl file formats in prior jobs.

9rx 6 hours ago|||

> and your app doesn't have 300% test coverage with fuzzing like SQLite does

Surely it does? Otherwise you cannot trust the interface point with SQLite and you're no further ahead. SQLite being flawless doesn't mean much if you screw things up before getting to it.

RL2024 6 hours ago||

That's true but relying on a highly tested component like SQLite means that you can focus your tests on the interface and your business logic, i.e. you can test that you are persisting to the your datastore rather than testing that your datastore implementation is valid.

9rx 5 hours ago||

Your business logic tests will already, by osmosis, exercise the backing data store in every conceivable way to the fundamental extent that is possible with testing given finite time. If that's not the case, your business logic tests have cases that have been overlooked. Choosing SQLite does mean that it will also be tested for code paths that your application will never touch, but who cares about that? It makes no difference if code that is never executed is theoretically buggy.

wmanley 4 hours ago||

Business logic tests will rarely test what happens to your data if a machine loses power.

9rx 4 hours ago||

Then your business logic contains unspecified behaviour. Maybe you have a business situation where power loss conditions being unspecified is perfectly acceptable, but if that is so it doesn't really matter what happens to your backing data store either.

upmostly 6 hours ago|||

Exactly. And most apps don't get there and therefore don't need it.

evanelias 6 hours ago||

Your article completely ignores operational considerations: backups, schema changes, replication/HA. As well as security, i.e. your application has full permissions to completely destroy your data file.

Regardless of whether most apps have enough requests per second to "need" a database for performance reasons, these are extremely important topics for any app used by a real business.

koliber 5 hours ago||

I love this article as it shows how fast computers really are.

There is one conclusion that I do not agree with. Near the end, the author lists cases where you will outgrow flat files. He then says that "None of these constraints apply to a lot of applications."

One of the constraints is "Multiple processes need to write at the same time." It turns out many early stage products need crons and message queues that execute on a separate worker. These multiple processes often need to write at the same time. You could finagle it so that the main server is the only one writing, but you'd introduce architectural complexity.

So while from the pure scale perspective I agree with the author, if you take a wider perspective, it's best to go with a database. And sqlite is a very sane choice.

If you need scale, cache the most often accessed data in memory and you have the best of both worlds.

My winning combo is sqlite + in-memory cache.

jerf 1 hour ago||

The one that gets me a lot, which is similar in practice to your point, is when I need server redundancy, even if one server is otherwise plenty for my task. As soon as I'm not running in one place, you need network data storage, and that kicks pretty hard in the direction of a network-accessible database. S3 works sometimes and the recent work on being able to atomically claim files has helped with some of the worst rough edges but it still doesn't take a lot to disqualify it, at least as the only store.

pseudosavant 3 hours ago|||

SQLite has become my new go-to when starting any project that needs a DB. The performance is very fast, and if anything is ever successful enough to outgrow SQLite, it wouldn't be that hard to switch it out for Postgres. Not having to maintain/backup/manage a separate database server is cheaper and easier.

koliber 2 hours ago||

Backups are super-simple as well.

I'm also a convert.

upmostly 3 hours ago||

Seeing the Rust 1M benches were an amazing reminder as to how fast stuff really is.

koliber 2 hours ago||

The reality is that things will be blazing fast in any language if you save things by PK in HashMaps.

orthogonal_cube 32 minutes ago||

SQLite did decently well but I think they should’ve done an additional benchmark with the database loaded completely into memory.

Since they’re using Go to accept requests and forwarding them to their SQLite connection, it may have been worthwhile to produce the same interface with Rust to demonstrate whether or not SQLite itself was hitting its performance limit or if Go had some hand in that.

Other than that, it’s a good demonstration of how a custom solution for a lightweight task can pay off. Keep it simple but don’t reinvent the wheel if the needs are very general.

kabir_daki 6 hours ago||

We built a PDF processing tool and faced this exact question early on.

For our use case — merge, split, compress — we went fully stateless. Files are processed in memory and never stored. No database needed at all.

The only time a database becomes necessary is when you need user accounts, history, or async jobs for large files. For simple tools, a database is often just added complexity.

The real question isn't "do you need a database" but "do you need state" — and often the answer is no.

tetha 1 hour ago||

> The real question isn't "do you need a database" but "do you need state" — and often the answer is no.

We have a bunch of these applications and they are a joy to work with.

Funny enough, even if you have a database, if you wonder if you need caches to hold state in your application server, the answer is, kindly, fuck no. Really, really horrible scaling problems and bugs are down that path.

There are use cases to store expensive to compute state in varnish (HTTP caching), memcache/redis (expensive, complex datastructures like a friendship graph), elasticsearch/opensearch (aggregated, expensive full-text search), but caching SQL results in an application server because the database is "slow" beyond a single transaction brings nothing but pain in the future. I've spent so much energy working around decisions born out of simple bad schema design decisions and tuning...

bevr1337 4 hours ago||

> The real question isn't "do you need a database" but "do you need state" — and often the answer is no.

This is a solid takeaway and applies to a lot of domains. Great observation

datadrivenangel 2 hours ago||

But if you have state, and that state needs to persist between program executions, then for the love of resiliency and all that is robust, consider using a SQL database!

forinti 6 hours ago||

Many eons ago I wrote a small sales web application in Perl. I couldn't install anything on the ISP's machine, so I used file-backed hashes: one for users, one for orders, another for something else.

As the years went by, I expected the client to move to something better, but he just stuck with it until he died after about 20 years, the family took over and had everything redone (it now runs Wordpress).

The last time I checked, it had hundreds of thousands of orders and still had good performance. The evolution of hardware made this hack keep its performance well past what I had expected it to endure. I'm pretty sure SQLite would be just fine nowadays.

da02 4 hours ago|

What type of product or service were they selling?

forinti 4 hours ago||

A calendar for cutting your hair according to the phases of the moon.

da02 4 hours ago||

Sounds like a tough business. The profit margins must have been razor thin.

forinti 4 hours ago||

Jokes aside, the guy made an impressive amount of money with this.

I should have charged him a percentage. Even if I had charged 0.5%, I would have made more money.

da02 2 hours ago||

Retail customers or B2B? What was stopping you from starting a similar business, but different product category?

shafoshaf 6 hours ago||

Relational Databases Aren’t Dinosaurs, They’re Sharks. https://www.simplethread.com/relational-databases-arent-dino...

The very small bonus you get on small apps is hardly worth the time you spend redeveloping the wheel.

adrian_b 3 hours ago|

Sharks vs. dinosaurs seems indeed an appropriate metaphor.

During Cretaceous, when dinosaurs were at their peak, sharks had already become very similar to the sharks of today, e.g. there were big sharks that differed very little from the white sharks and tiger sharks of today.

Then the dinosaurs have disappeared, together with the pterosaurs and the mosasaurs, and they have been replaced by other animals, but the sharks have continued to live until today with little changes, because they had already reached an optimized design that was hard to improve.

Besides the sharks, during Cretaceous there already existed along the dinosaurs other 2 groups of big predators that have changed little since then, crocodiles and big constrictor snakes similar to the pythons of today.

Therefore all 3 (sharks, crocodiles and big constrictor snakes) are examples of locally optimum designs that have been reached more than 70 million years ago, without needing after that any major upgrades.

ktzar 5 hours ago||

Writing your own storage is a great way to understand how databases work (if you do it efficiently, keeping indexes, correct data structures, etc.) and to come to the conclusion that if your intention wasn't just tinkering, you should've used a database from day 1.

throwway120385 3 hours ago|

I dunno. Even in embedded systems every time I've started without a database I've eventually come to need something like a database, and in every case I've found myself building essentially an ad-hoc poorly managed database into the application including marshalling/unmarshalling, file management, notification, and so on because each new feature over the top of regular files was just that much easier to add versus switching to a database system.

However the driving motivation for adding a database is not necessarily managing data, but the fact that the database system creates a nice abstraction layer around storing data of relational or non-relational form in non-volatile memory and controlling access to it while other systems are updating it. And because it's a nice abstraction, there are a lot of existing libraries that can take advantage of it in your language of choice without requiring you to completely invent all of that stuff over the top of the filesystem. That has knock-on effects when you're trying to add new functionality or new interaction patterns to an existing system.

And in cases where two or more processes need to communicate using the same data, a database gives you some good abstractions and synchronization primitives that make sense, whereas regular files or IPC require you to invent a lot of that stuff. You could use messaging to communicate updates to data but now you have two copies of everything, and you have to somehow atomize the updates so that either copy is consistent for a point in time. Why not use a database?

Knowing what I know today I would start with some kind of database abstraction even if it's not necessarily designed for transactional data, and I would make sure it handled the numerous concerns I have around data sharing, consistency, atomicity, and notification because if I don't have those things I eventually have to invent them to solve the reliability problems I otherwise run in to without them.

More comments...