Posted by enether 7 days ago
ease of use. in ruby If I want to use kafka I can use karafka. or redis streams via the redis library. likewise if kafka is too complex to run there's countless alternatives which work as well - hell even 0mq with client libraries.
now with the postgres version I have to write my own stuff which I might not where it's gonna lead me.
postgres is scalable, no one doubts that. but what people forget to mention is the ecosystem around certain tools.
There seems to be two planes of ease of use - the app layer (library) and the infra layer (hosting).
The app layer for Postgres is still in development, so if you currently want to run pub-sub (Kafka) on it, it will be extra work to develop that abstraction.
I hope somebody creates such a library. It's a one-time cost but then will make it easier for everybody.
https://github.com/dhamaniasad/awesome-postgres
There is at least a Python example here.
It is significantly more work for the client side implementation of event log consumers, which the article also talk about. For instance persisting the client side cursors. And I have not seen widely used standard implementations of those. (I started one myself once but didn't finish.)
Postgres really is a startup's best friend most of the time. Building a new product that's going to deal with a good bit of reporting that I began to look at OLAP DBs for, but had hesitation to leave PG for it. This kind of seals it for me (and of course the reference to the class "Just Use Postgres for Everything" post helps) that I should Just Use Postgres (R).
On top of being easy to host and already being familiar with it, the resources out there for something like PG are near endless. Plus the team working on it is doing constant good work to make it even more impressive.
I've been heads-down building a scheduling tool, and the number of times I've had to talk myself out of over-engineering is embarrassing. "Should I use Kafka for event streaming?" No. "Do I need microservices?" Probably not. "Can Postgres handle this?" Almost certainly yes.
The real skill is knowing when you've actually outgrown something vs. when you're just pattern-matching what Big Tech does. Most products never get to the scale where these distinctions matter—but they DO die from complexity-induced paralysis.
What's been your experience with that inflection point where you actually needed to graduate to more complex tooling? How did you know it was time?
This is the Achilles' heel of Kafka (and Pulsar) where for all streaming systems (or workarounds) with per message key acknowledgements incur O(n^2) costs in either computation, bandwidth, or storage per n messages [1].
[1] What If We Could Rebuild Kafka from Scratch?: (Top comment of the 220 comments)
Performance isn't a big deal for me. I had assumed that Kafka would give me things like decoupling, retry, dead-lettering, logging, schema validation, schema versioning, exactly once processing.
I like Postgres, and obviously I can write a queue ontop of it, but it seems like quite a lot of effort?
If you plan on retaining your topics indefinitely, schema evolution can become painful since you can't update existing records. Changing the number of partitions in a topic is also painful, and choosing the number initially is a difficult choice. You might want to build your own infrastructure for rewriting a topic and directing new writes to the new topic without duplication.
Kafka isn't really a replacement for a database or anything high-level like a ledger. It's really a replicated log, which is a low-level primitive that will take significant work to build into something else.
MQTT -> Redpanda (for message logs and replay, etc) -> Postgres/Timescaledb (for data) + S3 (for archive)
(and possibly Flink/RisingWave/Arroyo somewhere in order to do some alerting/incrementally updated materialized views/ etc)
this seems "simple enough" (but I don't have any experience with Redpanda) but is indeed one more moving part compared to MQTT -> Postgres (as a queue) -> Postgres/Timescaledb + S3
Questions:
1. my "fear" would be that if I use the same Postgres for the queue and for my business database, the "message ingestion" part could block the "business" part sometimes (locks, etc)? Also perhaps when I want to update the schema of my database and not "stop" the inflow of messages, not sure if this would be easy?
2. also that since it would write messages in the queue and then delete them, there would be a lot of GC/Vacuuming to do, compared to my business database which is mostly append-only?
3. and if I split the "Postgres queue" from "Postgres database" as two different processes, of course I have "one less tech to learn", but I still have to get used to pgmq, integrate it, etc, is that really much easier than adding Redpanda?
4. I guess most Postgres queues are also "simple" and don't provide "fanout" for multiple things (eg I want to take one of my IoT message, clean it up, store it in my timescaledb, and also archive it to S3, and also run an alert detector on it, etc)
What would be the recommendation?
I need a durable queue but not indefinitely. Max a couple of hours.
What I want is Google PubSub but open source so I can self host.
Larger, RabbitMQ can handle some pretty good workloads.
note that kafka has recently started investing into 'queues' in KIP-932, but they're still a long way off from implementing all of those features.
It can also be used in production. You do not have to build a distributed Pulsar cluster immediately. I have multiple projects running on a standalone Pulsar cluster, because its easy to setup and requires almost no maintenance. Doing it that way makes compliance requirements for isolation simpler and with less fights. Everyone understands host/vm isolation, few understand Pulsar Tenant isolation.
If you want a distributed Apache Pulsar cluster, be prepared to work for that. We run a cluster on bare metal. We considered Kubernetes, but performance was lacking. We are not Kubernetes experts.
The bigger question to ask is: will this storage engine be used to persist and retain data forever (like a database) or will it be used more for temporary transit of data from one spot to another.
Event-sourcing is when you buy something and get a receipt, you go stick it in a shoe-box for tax time.
A queue is you get given receipts, and you look at them in the correct order before throwing each one away.
I think my system is sort of both. I want to put some events in a queue for a finite set of time, process them as a single consolidated set, and then drop them all from the queue.
If you don't need a lot of perf but you place a premium on ergonomics and correctness, this sounds more like you want a workflow engine? https://github.com/meirwah/awesome-workflow-engines
I'm checking out the list.
It’s mostly registering the Postgres database functions which is one time.
There are also pre-made Postgres extensions that already run the queue.
These days i would like consider m starting with Supabase self hosted which has the Postgres ready to tweak.
It seems to me that the hardest part of going for a MQ/distributed log like Kafka is re-working existing code to now handle the lack of ACID stuff. Things that are trivial with Postgres, like exactly once delivery, are huge undertakings without ACID.
Personally, I don't have much experience with this, so maybe I'm just missing something?
ACID is an aspirational ideal - not something that 'just works' if you have a database that calls itself ACID. What ACID promises is essentially "single-threaded thinking will work in a multi-threaded environment."
Here's a list of ways it falls short:
1) Settings: Postgres is 'Read Committed' by default (which is not quite full Isolation). You could change this, but you might not like the resulting performance drop, (and it's probably not up to you unless you're the company DBA or something.)
2) ACID=Single-node-only. Maybe some of the big players (Google?) have worked around this (Spanner?), but for your use case, the scope of a (correct) ACID transaction is essentially what you can stuff into a single SQL string and hand to a single database. It won't span to the frontend, or any partners you have, REST calls, etc. It's definitely useful to be able to make your single node transition all-or-nothing from valid state to valid state, but you still have all your distributed thinking to do (Two generals, CAP, exactly-once-delivery, idempotency, etc.) without help from ACID.
3) You can easily break ACID at the programming language level. Let's say you intend to add 10 to a row. If you do a SELECT, add 10 to the result, and then do an update, your transaction won't do what you intended. If the value was 3 when you read it, all the database will see is you setting the value to 13. I don't know whether the db will throw an exception, or retry writing 13, but neither of those is 'just increment by 10'.
The reason I use Kafka is because it actually helps with distributed systems. We can't beat CAP, but if we want to have AP, we can at least have some 'eventual consistency', that is, your services won't be in exact lockstep from valid-state to valid-state (as a group), but if you give them the same facts, then they can at least end up in the same state. And that's what Kafka's for: you append facts onto it, then each service (which may in fact have its own ACID DB!) can move from valid-state to valid-state (even if external observers can see that one service is ahead of another one).
On the major projects I worked on, we were "instructed" to use Kafka for, I guess, internal political reasons. They already had Hadoop solutions that more or less worked, but the code was written by idiots in "Spark/Scala" (their favorite buzzword to act all high and mighty) and that code had zero tests (it was truly a "test in prod" situation there). The Hadoop system was managed by people who would parcel out compute resources politically, as in, their friends got all they wanted while everyone else got basically none. This was a major S&P company, Fortune 10, and the internal politics were abusive to say the least.
MQTT -> Redpanda (for message logs and replay, etc) -> Postgres/Timescaledb (for data) + S3 (for archive)
(and possibly Flink/RisingWave/Arroyo somewhere in order to do some alerting/incrementally updated materialized views/ etc)
this seems "simple enough" (but I don't have any experience with Redpanda) but is indeed one more moving part compared to MQTT -> Postgres (as a queue) -> Postgres/Timescaledb + S3
Questions:
1. my "fear" would be that if I use the same Postgres for the queue and for my business database, the "message ingestion" part could block the "business" part sometimes (locks, etc)? Also perhaps when I want to update the schema of my database and not "stop" the inflow of messages, not sure if this would be easy?
2. also that since it would write messages in the queue and then delete them, there would be a lot of GC/Vacuuming to do, compared to my business database which is mostly append-only?
3. and if I split the "Postgres queue" from "Postgres database" as two different processes, of course I have "one less tech to learn", but I still have to get used to pgmq, integrate it, etc, is that really much easier than adding Redpanda?
4. I guess most Postgres queues are also "simple" and don't provide "fanout" for multiple things (eg I want to take one of my IoT message, clean it up, store it in my timescaledb, and also archive it to S3, and also run an alert detector on it, etc)
What would be the recommendation?
First, I frequently use Celery and Celery doesn't support using Postgres as a broker. It seems like it should, but I guess no one has stepped up to write that. So, when I use Celery, I end up also using Redis or RabbitMQ.
Second, if I need mqtt clients coming in from the internet at large, I don't feel comfortable exposing Postgres to that. Also, I'd rather use the mqtt ecosystem of libraries rather than having all of those devices talk Postgres directly.
Third, sometimes I want a size constrained memory only database or a database that automatically expires untouched records, and for either of those I usually use Redis. For these two tasks I use Redis. I imagine that it would be worth making a reusable set of stored procedures to accomplish the auto-expiring of unused records, but I haven't implemented it. I have no idea how to make Postgres be memory memory only with a constrained memory side.