Kafka is Fast – I'll use Postgres

Posted by enether 7 days ago

Kafka is Fast – I'll use Postgres(topicpartition.io)

558 points | 392 commentspage 2

guywithahat 7 days ago|

> One camp chases buzzwords

> ...

> The other camp chases common sense

I don't really like these simplifications. Like one group obviously isn't just dumb, they're doing things for reasons you maybe don't understand. I don't know enough about data science to make a call, but I'm guessing there were reasons to use Kafka due to current hardware limits or scalability concerns, and while the issues may not be as present today that doesn't mean they used Kafka just because they heard a new word and wanted to repeat it.

sumtechguy 7 days ago||

Kafka and other message systems like it have their uses. But sometimes all you need is just need a database. Now you start doing realtime streaming and notifications and event type things a messaging system is good. You can even back it up with a boring database. Would I start with kafka? Probably not. I would start with a boring databsee and then if if my bashing on the db over and over saying 'have you changed' doesnt work as good anymore then you put in a messaging system.

temporallobe 7 days ago||

Agree with this sentiment - it’s easy to be judgmental about these things, but project-level issues and decisions can be very complicated and engineers often have little to no visibility into them. We’re using Kafka for a gigantic pipeline where IMO any reasonably modern database would suffice (and may even be superior), but our performance requirements are unclear. At some point in the distant future, we may have a significant surge in data quantity and speed, requiring greater throughput and (de)serialization speed, but I am not convinced that Kafka ultimately helps us there. I imagine this is a case where the program leadership was sold a solution which we are now obligated to use. This happens a LOT, and I have seen unnecessary and unused products cost companies millions over the years. For example, my team was doing analysis on replacing our existing Atlassian Data Center with other solutions, and in doing so, we discovered several underused/unused Atlassian plugins for which we are paying very high license fees. At some point, users over the years had requested some functionality for a specific workflow and the plugins were purchased. The people and projects went away or otherwise processes became OBE, but the plugins happily hummed along while the bills were paid.

this_user 7 days ago||

The real two camps seem to be:

1) People constantly chasing the latest technology with no regard for whether it's appropriate for the situation.

2) People constantly trying to shoehorn their favourite technology into everything with no regard for whether it's appropriate for the situation.

PeterCorless 7 days ago||

2) above is basically "Give a kid a hammer, and everything becomes a nail."

The third camp:

3) People who look at a task, then apply a tool appropriate for the task.

j45 7 days ago||

Kafka is anything but new. It does get shoehorned too.

Postgres also has been around for a long time and a lot of people didn’t know all it can do which isn’t what we normally think about with a database.

Appropriateness is a nice way to look at it as long as it’s clear whether or not it’s about personal preferences and interpretations and being righteous towards others with them.

Customers rarely care about the backend or what it’s developed in, except maybe for developer products. It’s a great way to waste time though.

spectraldrift 7 days ago||

> Should You Use Postgres? Most of the time - yes

This made me wonder about a tangential statistic that would, in all likelihood, be impossible to derive:

If we looked at all database systems running at any given time, what proportion does each technology represent (e.g., Postgres vs. MySQL vs. [your favorite DB])? You could try to measure this in a few ways: bytes written/read, total rows, dollars of revenue served, etc.

It would be very challenging to land on a widely agreeable definition. We'd quickly get into the territory of what counts as a "database" and whether to include file systems, blockchains, or even paper. Still, it makes me wonder. I feel like such a question would be immensely interesting to answer.

Because then we might have a better definition of "most of the time."

abtinf 7 days ago|

SQLite likely dominates all other databases combined on the metrics you mentioned, I would guess by at least an order of magnitude.

Server side. Client side. iOS, iPad, Mac apps. Uses in every field. Uses in aerospace.

Just think for a moment that literally every photo and video taken on every iPhone (and I would assume android as well) ends up stored (either directly or sizable amounts of metadata) in a SQLite db.

sublimefire 7 days ago||

Yes it seems like it is absent in this discussion but maybe it should have been “it” the whole time as a default option. I wonder if it could attain similar throughput numbers; bet the article would feel slightly sarcastic then though

losvedir 7 days ago||

Maybe I missed it in the design here, but this pseudo-Kafka Postgres implementation doesn't really handle consumer groups very well. The great thing about Kafka consumer groups is it makes it easy to spread the load over several instances running your service. They'll all connect using the same group, and different partitions will be assigned to the different instances. As you scale up or down, the partition responsibilities will be updated accordingly.

You need some sort of server-side logic to manage that, and the consumer heartbeats, and generation tracking, to make sure that only the "correct" instances can actually commit the new offsets. Distributed systems are hard, and Kafka goes through a lot of trouble to ensure that you don't fail to process a message.

mrkeen 7 days ago|

Right, the author's worldview is that Kafka is resume-driven development, used by people "for speed" (even though they are only pushing 500KB/s).

Of course the implementation based off that is going to miss a bit.

johnyzee 7 days ago||

Seems like you would at the very least need a fairly thick application layer on top of Postgres to make it look and act like a messaging system. At that point, seems like you have just built another messaging system.

Unless you're a five man shop where everybody just agrees to use that one table, make sure to manage transactions right, cron job retention, YOLO clustering, etc. etc.

Performance is probably last on the list of reasons to choose Kafka over Postgres.

j45 7 days ago|

You expose the api on Postgres much like any other group of developers use and call it a day.

There’s several implementations of queues to increase the chance of finishing what one is after. https://github.com/dhamaniasad/awesome-postgres

dagss 7 days ago||

There's a lot of logic involved client side regarding managing read cursors and marking events as processed consumer side. Possibly also client side error queues and so on.

I truly miss a good standard client side library following the Kafka-in-SQL philosophy. I started on in my previous job and we used it internally but it never got good enough that it would be widely used elsewhere, and now I work somewhere else...

(PS: Talking about the pub/sub Kafka-like usecase, not the work queue FOR UPDATE usecase)

bmcahren 7 days ago||

A huge benefit of single-database operations at scale is point-in-time recovery for the entire system thereby not having to coordinate recovery points between data stores. Alternatively, you can treat your queue as volatile depending on the purpose.

qsort 7 days ago||

I feel so seen lol. I work in data engineering and the first paragraph is me all the time. There are a lot of cool technologies (timeseries databases, vector databases, stuff like Synapse on Azure, "lakehouses" etc.) but they are mostly for edge cases.

I'm not saying they're useless, but if I see something like that lying around, it's more likely that someone put it there based on vibes rather than an actual engineering need. Postgres is good enough for OpenAI, chances are it's good enough for you.

ryandvm 7 days ago||

I think my only complaint about Kafka is the widespread misunderstanding that it is a suitable replacement for a work queue. I should not be having to explain to an enterprise architect the distinction between a distributed work queue and event streaming platform.

lisbbb 7 days ago||

It's not so much that they don't know as it they think Kafka is sexier, or, in my case, it was mandated to use it for everything because they were paying for the cluster. I solved one problem, very flexibly, in Elastic and they weren't even interested at all. It was Kafka or nothing. That's reality in a lot of companies.

brikym 7 days ago||

If you don't mind Redis then use Redis Streams. It gives you an eventlog without worrying about postgres performance issues and has consumer groups.

tele_ski 7 days ago|

Been using valkey streams recently and loving it. Took a bit to understand how to to properly use it but now that I've figured it out I'd highly recommend trying it. It's very easy to setup and get going and just works.

sc68cal 7 days ago|

> Postgres doesn’t seem to have any popular libraries for pub-sub9 use cases, so I had to write my own.

Ok so instead of running Kafka, we're going to spend development cycles building our own?

enether 7 days ago|

It would be nice if a library like pgmq got built. Not sure what the demand for that is, but it feels like there may be a niche

More comments...