Kafka is Fast – I'll use Postgres

Posted by enether 7 days ago

Kafka is Fast – I'll use Postgres(topicpartition.io)

558 points | 392 commentspage 3

dzonga 7 days ago|

what's not spoken about in the above article ?

ease of use. in ruby If I want to use kafka I can use karafka. or redis streams via the redis library. likewise if kafka is too complex to run there's countless alternatives which work as well - hell even 0mq with client libraries.

now with the postgres version I have to write my own stuff which I might not where it's gonna lead me.

postgres is scalable, no one doubts that. but what people forget to mention is the ecosystem around certain tools.

enether 7 days ago||

That's true.

There seems to be two planes of ease of use - the app layer (library) and the infra layer (hosting).

The app layer for Postgres is still in development, so if you currently want to run pub-sub (Kafka) on it, it will be extra work to develop that abstraction.

I hope somebody creates such a library. It's a one-time cost but then will make it easier for everybody.

j45 7 days ago||

I’m not sure where it says you have to write your own stuff, there seem to be some of queues with libraries.

https://github.com/dhamaniasad/awesome-postgres

There is at least a Python example here.

dagss 7 days ago||

Work queues is easy.

It is significantly more work for the client side implementation of event log consumers, which the article also talk about. For instance persisting the client side cursors. And I have not seen widely used standard implementations of those. (I started one myself once but didn't finish.)

jjice 7 days ago||

This is a well written addition to the list of articles I need to reference on occasion to keep myself from using something new.

Postgres really is a startup's best friend most of the time. Building a new product that's going to deal with a good bit of reporting that I began to look at OLAP DBs for, but had hesitation to leave PG for it. This kind of seals it for me (and of course the reference to the class "Just Use Postgres for Everything" post helps) that I should Just Use Postgres (R).

On top of being easy to host and already being familiar with it, the resources out there for something like PG are near endless. Plus the team working on it is doing constant good work to make it even more impressive.

j45 7 days ago|

It’s totally reasonable to start with fewer technologies to do more and then outgrow them.

sanskarix 7 days ago||

This mindset is criminally underrated in the startup/indie builder world. There's so much pressure to architect for scale you might never reach, or to use "industry standard" stacks that add enormous complexity.

I've been heads-down building a scheduling tool, and the number of times I've had to talk myself out of over-engineering is embarrassing. "Should I use Kafka for event streaming?" No. "Do I need microservices?" Probably not. "Can Postgres handle this?" Almost certainly yes.

The real skill is knowing when you've actually outgrown something vs. when you're just pattern-matching what Big Tech does. Most products never get to the scale where these distinctions matter—but they DO die from complexity-induced paralysis.

What's been your experience with that inflection point where you actually needed to graduate to more complex tooling? How did you know it was time?

honkostani 7 days ago||

Resume driven design, is running into the desert of moores plateau punishing the use of ever more useless abstractions. They get quieter, because their projects keep on dying after the revolutionary tech is introduced and they jump ship.

teleforce 4 days ago||

>There is also a general expectation that there is strict order - events should be read in the same order that they arrived in the system.

This is the Achilles' heel of Kafka (and Pulsar) where for all streaming systems (or workarounds) with per message key acknowledgements incur O(n^2) costs in either computation, bandwidth, or storage per n messages [1].

[1] What If We Could Rebuild Kafka from Scratch?: (Top comment of the 220 comments)

https://news.ycombinator.com/item?id=43790420

loftsy 7 days ago||

I am about to start a project. I know I want an event sourced architecture. That is, the system is designed around a queue, all actors push/pull into the queue. This article gives me some pause.

Performance isn't a big deal for me. I had assumed that Kafka would give me things like decoupling, retry, dead-lettering, logging, schema validation, schema versioning, exactly once processing.

I like Postgres, and obviously I can write a queue ontop of it, but it seems like quite a lot of effort?

singron 7 days ago||

Kafka also doesn't give you all those things. E.g. there is no automatic dead-lettering, so a consumer that throws an exception will endlessly retry and block all progress on that partition. Kafka only stores bytes, so schema is up to you. Exactly-once is good, but there are some caveats (you have to use kafka transactions, which are significantly different than normal operation, and any external system may observe at-least-once semantics instead). Similar exactly-once semantics would also be trivial in an RDBMS (i.e. produce and consume in same transaction).

If you plan on retaining your topics indefinitely, schema evolution can become painful since you can't update existing records. Changing the number of partitions in a topic is also painful, and choosing the number initially is a difficult choice. You might want to build your own infrastructure for rewriting a topic and directing new writes to the new topic without duplication.

Kafka isn't really a replacement for a database or anything high-level like a ledger. It's really a replicated log, which is a low-level primitive that will take significant work to build into something else.

oulipo2 7 days ago|||

I want to rewrite some of my setup, we're doing IoT, and I was planning on

MQTT -> Redpanda (for message logs and replay, etc) -> Postgres/Timescaledb (for data) + S3 (for archive)

(and possibly Flink/RisingWave/Arroyo somewhere in order to do some alerting/incrementally updated materialized views/ etc)

this seems "simple enough" (but I don't have any experience with Redpanda) but is indeed one more moving part compared to MQTT -> Postgres (as a queue) -> Postgres/Timescaledb + S3

Questions:

1. my "fear" would be that if I use the same Postgres for the queue and for my business database, the "message ingestion" part could block the "business" part sometimes (locks, etc)? Also perhaps when I want to update the schema of my database and not "stop" the inflow of messages, not sure if this would be easy?

2. also that since it would write messages in the queue and then delete them, there would be a lot of GC/Vacuuming to do, compared to my business database which is mostly append-only?

3. and if I split the "Postgres queue" from "Postgres database" as two different processes, of course I have "one less tech to learn", but I still have to get used to pgmq, integrate it, etc, is that really much easier than adding Redpanda?

4. I guess most Postgres queues are also "simple" and don't provide "fanout" for multiple things (eg I want to take one of my IoT message, clean it up, store it in my timescaledb, and also archive it to S3, and also run an alert detector on it, etc)

What would be the recommendation?

loftsy 7 days ago|||

Very interesting.

I need a durable queue but not indefinitely. Max a couple of hours.

What I want is Google PubSub but open source so I can self host.

stackskipton 7 days ago||

Small size, Beanstalkd (https://beanstalkd.github.io/) can get you pretty far.

Larger, RabbitMQ can handle some pretty good workloads.

rileymichael 7 days ago|||

if you need a durable log (which it sounds like you do for if you're going with event sourcing) that has those features, i'd suggest apache pulsar. you effectively get streams with message queue semantics (per-message acks, retries, dlq, etc.) from one system. it supports many different 'subscription types', so you can use it for a bunch of different use cases. running it on your own is a bit of a beast though and there's really only one hosted provider in the game (streamnative)

note that kafka has recently started investing into 'queues' in KIP-932, but they're still a long way off from implementing all of those features.

olavgg 7 days ago||

A standalone Pulsar, is actually a great way to learn Pulsar. It is one command to get started: bin/pulsar standalone

It can also be used in production. You do not have to build a distributed Pulsar cluster immediately. I have multiple projects running on a standalone Pulsar cluster, because its easy to setup and requires almost no maintenance. Doing it that way makes compliance requirements for isolation simpler and with less fights. Everyone understands host/vm isolation, few understand Pulsar Tenant isolation.

If you want a distributed Apache Pulsar cluster, be prepared to work for that. We run a cluster on bare metal. We considered Kubernetes, but performance was lacking. We are not Kubernetes experts.

whalesalad 7 days ago|||

If you build it right, the underlying storage engine for your event stream should be swappable for any other event stream tech. Could be SQLite, PSQL, Kafka, Kinesis, SQS, Rabbit, Redis ... really anything can serve this need. The right tool will appear once you dial in your architecture. Treat storage as a black box API that has "push", "pop" etc commands. When your initial engine falls over, switch to a new one and expose that same API.

The bigger question to ask is: will this storage engine be used to persist and retain data forever (like a database) or will it be used more for temporary transit of data from one spot to another.

mrkeen 7 days ago|||

Event-sourcing != queue.

Event-sourcing is when you buy something and get a receipt, you go stick it in a shoe-box for tax time.

A queue is you get given receipts, and you look at them in the correct order before throwing each one away.

loftsy 7 days ago||

True.

I think my system is sort of both. I want to put some events in a queue for a finite set of time, process them as a single consolidated set, and then drop them all from the queue.

munchbunny 7 days ago|||

> I had assumed that Kafka would give me things like decoupling, retry, dead-lettering, logging, schema validation, schema versioning, exactly once processing.

If you don't need a lot of perf but you place a premium on ergonomics and correctness, this sounds more like you want a workflow engine? https://github.com/meirwah/awesome-workflow-engines

lisbbb 7 days ago|||

One thing I learned with Kafka and Cassandra is that you are locked in to a design pretty early on. Then the business changes their mind and it take a great deal of re-work and then they're accusing you of being incompetent because they are used to SQL projects that have way more flexibility.

loftsy 7 days ago|||

Perhaps I do. I know that I don't want a system defined as a graph in yaml. Or no code. These options are over engineered for my use case. I'm pretty comfortable building some docker containers and operating them and this is the approach I want to use.

I'm checking out the list.

mkozlows 7 days ago|||

If what you want is a queue, Kafka might be overkill for your needs. It's a great tool, but it definitely has a lot of complexity relative to a straightforward queue system.

j45 7 days ago||

It might look like a lot of effort, but if you follow a tutorial/YouTube video step by step you will be surprised.

It’s mostly registering the Postgres database functions which is one time.

There are also pre-made Postgres extensions that already run the queue.

These days i would like consider m starting with Supabase self hosted which has the Postgres ready to tweak.

LinXitoW 7 days ago||

Isn't one gigantic advantage with Postgres the ACID part?

It seems to me that the hardest part of going for a MQ/distributed log like Kafka is re-working existing code to now handle the lack of ACID stuff. Things that are trivial with Postgres, like exactly once delivery, are huge undertakings without ACID.

Personally, I don't have much experience with this, so maybe I'm just missing something?

mrkeen 7 days ago|

It is a gigantic advantage! And you are missing something!

ACID is an aspirational ideal - not something that 'just works' if you have a database that calls itself ACID. What ACID promises is essentially "single-threaded thinking will work in a multi-threaded environment."

Here's a list of ways it falls short:

1) Settings: Postgres is 'Read Committed' by default (which is not quite full Isolation). You could change this, but you might not like the resulting performance drop, (and it's probably not up to you unless you're the company DBA or something.)

2) ACID=Single-node-only. Maybe some of the big players (Google?) have worked around this (Spanner?), but for your use case, the scope of a (correct) ACID transaction is essentially what you can stuff into a single SQL string and hand to a single database. It won't span to the frontend, or any partners you have, REST calls, etc. It's definitely useful to be able to make your single node transition all-or-nothing from valid state to valid state, but you still have all your distributed thinking to do (Two generals, CAP, exactly-once-delivery, idempotency, etc.) without help from ACID.

3) You can easily break ACID at the programming language level. Let's say you intend to add 10 to a row. If you do a SELECT, add 10 to the result, and then do an update, your transaction won't do what you intended. If the value was 3 when you read it, all the database will see is you setting the value to 13. I don't know whether the db will throw an exception, or retry writing 13, but neither of those is 'just increment by 10'.

The reason I use Kafka is because it actually helps with distributed systems. We can't beat CAP, but if we want to have AP, we can at least have some 'eventual consistency', that is, your services won't be in exact lockstep from valid-state to valid-state (as a group), but if you give them the same facts, then they can at least end up in the same state. And that's what Kafka's for: you append facts onto it, then each service (which may in fact have its own ACID DB!) can move from valid-state to valid-state (even if external observers can see that one service is ahead of another one).

lisbbb 7 days ago||

If you are doing high volume, there is no way that a SQL db is going to keep up. I did a lot of work with Kafka but what we constantly ran into was managing expectations--costs were higher, so the business needs to strongly justify why they need their big data toy, and joins are much harder, as well as data validation in real time. It made for a frustrating experience most of the time--not due to the tech as much as dealing with people who don't understand the costs and benefits.

On the major projects I worked on, we were "instructed" to use Kafka for, I guess, internal political reasons. They already had Hadoop solutions that more or less worked, but the code was written by idiots in "Spark/Scala" (their favorite buzzword to act all high and mighty) and that code had zero tests (it was truly a "test in prod" situation there). The Hadoop system was managed by people who would parcel out compute resources politically, as in, their friends got all they wanted while everyone else got basically none. This was a major S&P company, Fortune 10, and the internal politics were abusive to say the least.

dangoodmanUT 7 days ago||

96 cores to get 240MB/s is terrible. Redpanda can do this with like one or two cores

greenavocado 7 days ago||

Redpanda might be good (I don't know) but I threw up a little in my mouth when I opened their website and saw "Build the Agentic Data Plane"

umanwizard 7 days ago||

The marketing website of every data-related startup sounds like that now. I agree it’s dumb, but you can safely ignore it.

enether 7 days ago||

hehe, yeah it is. I could have probably got a GB/s out of that if I ran it properly - but it's at the scale where you expect it to be terrible due to the mismatch of workloads

oulipo2 7 days ago||

I want to rewrite some of my setup, we're doing IoT, and I was planning on

MQTT -> Redpanda (for message logs and replay, etc) -> Postgres/Timescaledb (for data) + S3 (for archive)

(and possibly Flink/RisingWave/Arroyo somewhere in order to do some alerting/incrementally updated materialized views/ etc)

this seems "simple enough" (but I don't have any experience with Redpanda) but is indeed one more moving part compared to MQTT -> Postgres (as a queue) -> Postgres/Timescaledb + S3

Questions:

2. also that since it would write messages in the queue and then delete them, there would be a lot of GC/Vacuuming to do, compared to my business database which is mostly append-only?

What would be the recommendation?

jdboyd 7 days ago|

While I appreciate the Postgres for everything point of view, and most of the times I use other things it could fit in Postgres, there are two areas that keep me using RabbitMQ, Redis, or a something like Elastic.

First, I frequently use Celery and Celery doesn't support using Postgres as a broker. It seems like it should, but I guess no one has stepped up to write that. So, when I use Celery, I end up also using Redis or RabbitMQ.

Second, if I need mqtt clients coming in from the internet at large, I don't feel comfortable exposing Postgres to that. Also, I'd rather use the mqtt ecosystem of libraries rather than having all of those devices talk Postgres directly.

Third, sometimes I want a size constrained memory only database or a database that automatically expires untouched records, and for either of those I usually use Redis. For these two tasks I use Redis. I imagine that it would be worth making a reusable set of stored procedures to accomplish the auto-expiring of unused records, but I haven't implemented it. I have no idea how to make Postgres be memory memory only with a constrained memory side.

More comments...