Do you even need a database?

Posted by upmostly 4 days ago

Do you even need a database?(www.dbpro.app)

303 points | 296 commentspage 4

inasio 3 days ago|

There's a whole thing this days about building solvers (e.g. SAT or Ising) out of exotic hardware that does compute in memory. A while back I wondered if one could leverage distributed DB logic to build solvers for massive problems, something like compute in DB.

thutch76 3 days ago||

I love reading posts like these.

I will still reach for a database 99% or the time, because I like things like SQL and transactions. However, I've recently been working on a 100% personal project to manage some private data; extracting insights, graphing trends, etc. It's not high volume data, so I decided to use just the file system, with data backed at yaml files, with some simple indexing, and I haven't run into any performance issues yet. I probably never will at my scale and volume.

In this particular case having something that was human readable, and more importantly diffable, was more valuable to me than outright performance.

Having said that, I will still gladly reach for a database with a query language and all the guarantees that comes with 99% of the time.

jmaw 3 days ago||

Very interesting, I'd never heard of JSONL before: https://jsonlines.org/

Also notable mention for JSON5 which supports comments!: https://json5.org/

the_inspector 4 days ago||

In many cases not. E.g. for caching with python, diskcache is a good choice. For small amounts of data, a JSON file does the job (you pointed to JSONL as an option). But for larger collections, that should be searchable/processable, postgres is a good choice.

Memory of course, as you wrote, also seems reasonable in many cases.

upmostly 4 days ago|

[dead]

chrismorgan 3 days ago||

> The index format is simple: one line per record, exactly 58 bytes: <36-char UUID>:<20-digit byte offset in data file>\n.

It would be much better to write all of this as binary data, omitting separators.

• Since it’s fixed-width and simple, inspecting the data is still pretty easy—there are tools for working with binary data of declared schema, or you could write a few-liner to convert it yourself. You don’t lose much by departing ASCII.

• You might want to complicate it a little by writing a version tag at the start of the file or outside it so you can change the format more easily (e.g. if you ever add a third column). I will admit the explicit separators do make that easier. You can also leave that for later, it probably won’t hurt.

• UUID: 36 bytes → 16 bytes.

• Offset: 20 bytes (zero-padded base-ten integer) → 8 bytes.

• It removes one type of error altogether: now all bit patterns are syntactically valid.

• It’ll use less disk space, be cheaper to read, be cheaper to write, and probably take less code.

I also want to register alarm at the sample code given for func FindUserBinarySearch. To begin with, despite a return type of (*User, error), it always returns nil error—it swallows all I/O errors and ignores JSON decode errors. Then:

  entryID := strings.TrimRight(string(buf[:36]), " ")

That strings.TrimRight will only do anything if your data is corrupted.

  cmp := strings.Compare(entryID, id)

Not important when you control the writing, but worth noting that UUID string comparison is case-insensitive.

  offsetStr := strings.TrimLeft(string(buf[37:57]), "0")

Superfluous. ParseInt doesn’t mind leading zeroes, and it’ll probably skip them faster than a separate TrimLeft call.

  dataOffset, _ := strconv.ParseInt(offsetStr, 10, 64)

That’s begging to make data corruption difficult to debug. Most corruption will now become dataOffset 0. Congratulations! You are now root.

Panzerschrek 2 days ago||

I agree that using a database in many cases is overkill. And I don't understand at all who and why uses so-called in-memory databases.

jmaw 3 days ago||

While this is certainly cool to see. And I love seeing how fast webservers can go.. The counter question "Do you even need 25,000 RPS and sub-ms latency?" comes to mind.

I don't choose a DB over a flat file for its speed. I choose a DB for the consistent interface and redundancy.

chuckadams 4 days ago||

I need a filesystem that does some database things. We got teased with that with WinFS and Beos's BFS, but it seems the football always gets yanked away, and the mainstream of filesystems always reverts back to the APIs established in the 1980s.

tracker1 3 days ago|

FWIW, you can do some things like this on top of S3 Metadata.

chuckadams 3 days ago||

Transactions are one thing I want the most, and that's not going to happen on S3. Sure, I can reinvent them by hand, but the point is I want that baked in.

tracker1 3 days ago||

Yeah, closest thing there is MS-SQL FILESTREAM, but even that has flaws and severe limitations... you can do similar transaction implemenations for binary data storage in any given RDBMS, or do similarly via behavior to read/write to filestream along with a transactional row lock that corresponds to the underlying data. But that gets its' own complexities.

827a 3 days ago||

I'm a big fan of using S3 as a database. A lot of apps can get a lot of mileage just doing that for a good chunk of their data; that which just needs lookup by a single field (usually ID, but doesn't have to be).

tracker1 3 days ago|

I worked in an org where a lot of records were denormalized to be used in a search database... since I went through that level of work anyway, I also fed the exports into S3 records for a "just in case" backup. That backup path became really useful in practice, since there was a need for eventually a "pending" version of records, separate from the "published" version.

In practice, the records themselves took no less than 30 joins for a flat view of the record data that was needed for a single view of what could/should have been one somewhat denormalied record in practice. In the early 2010's that meant the main database was often kicked over under load, and it took a lot of effort to add in appropriate caching and the search db, that wound up handling most of the load on a smaller server.

pstuart 3 days ago|

In order to ask this question it's important to understand the lifecycle of the data in question. If it is constantly being updated and requires "liveness" (updates are reflected in queries immediately), the simple answer is: yes, you need a database.

But if you have data that is static or effectively static (data that is updated occasionally or batched), then serving via custom file handling can have its place.

If the records are fixed width and sorted on the key value, then it becomes trivial to do a binary search on the mmapped file. It's about as lightweight as could be asked for.

goerch 3 days ago|

But then you'll get two files to join?

pstuart 3 days ago||

I should have been clear with the assumption baked into that statement: the data in question is in a single file, with fixed size fields and sorted by primary keys. That precludes "looser" datasets, but I believe my point stands for the given context.

More comments...