Just Use Postgres for Durable Workflows

Posted by KraftyOne 1 hour ago

Just Use Postgres for Durable Workflows(www.dbos.dev)

137 points | 50 comments

buremba 50 minutes ago|

All you need is Postgres until you scale into TBs of data. We use Postgresql as a durable workflow engine, vector search, time-series data, BM25 search, OLTP/OLAP engine, and a queue. It's basically the only dependency we have for https://lobu.ai

The main benefit is centralizing all the data in one place so we don't need to worry about copying data in between multiple systems. Once something becomes the bottleneck, you can eventually migrate to a purpose specific tool to scale out.To be honest, LISTEN/NOTIFY in my opinion is the most fragile part of PG but it's fine as start until you scale out.

tibbon 1 minute ago||

But when you hit that wall, it is hard to stop and convince people to use different patterns and systems. I've seen so many tables go from "it will only be a few thousand rows" to suddenly several TB and then people are looking confused when performance and db admin tasks get really difficult.

I'm working at a scale where almost every day I have to ask people "are you use you need to treat that as relational data? It doesn't seem relational"

throwaway7783 39 minutes ago|||

I'm in the same camp. Do you use any specific extensions? Especially for OLAP and time series (partitioned tables + related extensions work fine, but curious if you use anything else)

buremba 24 minutes ago||

The native extensions are fine but I don't have good experience with any third party extensions, so far tried Timescale, pg_lake, citus, and pgvectorscale. They look very appealing but it's usually a trap as you can't get the value without using the vendor's cloud offerings.

I think if you grow enough to look for these extensions, it's usually better to bet on purpose-specific tooling. For example, I use DuckDB/Iceberg combination extensively for columnar data and connect DuckDB to PG when I need it.

hmaxdml 43 minutes ago|||

Listen/notify is poised to become much better in PG 18 and 19

stuartaxelowen 34 minutes ago||

Why’s that?

TkTech 12 minutes ago||

In pg19 https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit... will land, which significantly improves NOTIFY performance. Right now LISTEN/NOTIFY doesn't scale to very busy instances because a `NOTIFY` within a transaction takes a global lock.

ivanr 10 minutes ago||

More context: https://www.recall.ai/blog/postgres-listen-notify-does-not-s...

pphysch 39 minutes ago||

I don't see logs mentioned. I agree with most those applications but would keep my OLAP stuff (metrics, logs, traces) in a separate store like VictoriaMetrics, both for capacity and read activity.

buremba 29 minutes ago||

Yeah I have logs in Sentry, which also uses Postgresql.

pragma_x 3 minutes ago||

I completely get the concept and agree - this is great way to build this kind of durability in a workflow system.

That said, my gamer-brain wants to call this "Save-scumming at scale." Which is to say, a lot of people already know that this approach works, but maybe they haven't made the connection to abstract CS stuff.

Another strategy that can be used to build robustness is to build your workflow out of idempotent operations. That can be useful for situations where the workflow state is too large to back up. Instead, you just run the job from the top and it's a bunch of no-ops until you start making progress again.

llimllib 53 minutes ago||

Armin Ronacher's `absurd` is an implementation of durable workflows for postgres:

https://lucumr.pocoo.org/2025/11/3/absurd-workflows/

https://github.com/earendil-works/absurd

https://earendil-works.github.io/absurd/

I've not used it, but it's worth comparing to other options

vrm 2 minutes ago|

If you don't need a ton of throughput I think `absurd` (and our Rust derivative `durable`) are very nice options that keep the client side extremely simple. It's also lightweight enough that a coding agent can keep the entire thing in its head easily and just run queries to look up state as needed.

halamadrid 5 minutes ago||

We work on disk log based architecture for workflows at Unmeshed (https://unmeshed.io/) which helps it to scale at a fraction of the cost of traditional workflow systems that are based on expensive databases.

Postgres is not cheap to run in the cloud at scale. We went for the cheapest infra, which is basically the disk storage.

stuartaxelowen 28 minutes ago||

My dream is, instead of separating data storage, state machines, valid state constraints, and the logic that transitions between valid states, we can actually unify these into some kernel of app state. Honestly, Postgres already has a lot of these capabilities, but I don’t see an obvious story on the app or product level, providing provably correct sets of states that apps can transition between, and which they can automatically expose to clients in informative ways (this user can like this post, but not edit). It looks colored Petri net shaped to me, but I don’t yet see a simple app state paradigm in the same way that the database has obvious successful boundaries.

opiniateddev 1 hour ago||

Conductor OSS does this quite well https://docs.conductor-oss.org/devguide/ai/index.html

https://github.com/agentspan-ai/agentspan which is essentially an agentic SDK layer for Conductor can convert any of your langgraph, openAI, vercel, or ADK agent and makes it durable and adds orchestration with no code changes.

throwaw12 1 hour ago||

Curious to know experience of people using DBOS and Temporal.

I have used Temporal in the past, works really good, my only problem with it was some limits on request payload or event sizes, created some inconveniences to us when building solutions. It also enforces good engineering practices, but sometimes you don't want to write special logic if your CSV file is larger than 2Mb, upload it to S3, pass link, then download it in the workflow.

What is your experience with DBOS? How does it compare to Temporal in terms of operational complexity, feature parity and anything else

pants2 37 minutes ago||

I thought Temporal was overly complex, but as you said the best part is it does enforce good engineering practices.

Then I tried their Cloud offering and was appalled at their pricing. I burned through the $1,000 free credits before I even got something to production. Didn't want to bother with running a local Temporal, either.

Best solution is to just take inspiration from their architecture and then do it yourself in Postgres, IMO.

switchbak 1 hour ago|||

They've just released an external storage approach to solve the large payload issue. I don't 100% love it (it's bolted on, not an intrinsic part), and it's an early release right now - but you can consider this effectively solved for now.

hilariously 1 hour ago||

That's good because back in the day if you were putting entire documents in a message queue I would laugh people out the door, putting something in object storage + linking is much more useful (though the distributed system part/backup current state part can be annoying!)

quard8 1 hour ago|||

we're using dbos for ai gen workflows and processing video files. understanding how to migrate from celery took time, but for our case it was worth it.

temporal_thr123 1 hour ago||

I run a large on-prem temporal setup - throwaway acct as they will likely out me.

Temporal is, in my opinion having run it in prod for over a year - poorly designed, slow and ridicliously heavy infra wise.

If you're doing anything non-trivial (say, 200+ events/workflow) and you need to run only a couple hundred of them concurrently all day, you're going to spend millions on infra, and it's still going to absolutely suck.

Try running their own benchmarks, the numbers are pathetic.

Their sales team is also absolutely appalling and desperate.

From a Developer standpoint, the SDK is quite nice though.

Don't get trapped into nexus, and if the sales team call you make sure legal is in the room.

temporal_thr123 55 minutes ago|||

Since I'm in a ranting mode -- here's a good example: you're limited to _ONE_ IO per shard in the history service:

https://github.com/temporalio/temporal/blob/e22e6304b3c4a409...

Temporal does a crazy amount of database operations and all of these are behind that mutex.

Oh, and you can't change the shard count on existing clusters.

Great stuff.

dakiol 38 minutes ago|||

Agree. Have worked in a codebase using Temporal, and is pretty much a nightmare. I don't know about the infra side, but from the developer side, all the abstractions they bring to the table are poorly designed. Wouldn't recommend

rafael-lua 5 minutes ago||

The "everything can be done in Postgres" crowd is crazy. It is like a religion at this point.

sgt 1 hour ago||

Continuously amazed by what you can do with few tools, as long as Postgres is a part of your toolkit.

I recently developed a distributed queue and it works really great - benchmarks great too, with no race conditions or conflicts. I used SKIP LOCKED so that workers can compete safely.

You can also have multiple workers across nodes avoid conflict by using session wide mutexes i.e. pg advisory lock.

bootsmann 22 minutes ago|

Advisory locks are preferred for this anyways because holding a lot of SELECT FOR UPDATE doesn’t scale too well.

Edit: Actually I checked this again and apparently the advice has now changed to the inverse.

vrm 1 hour ago|

Since DBOS doesn't support Rust, we implemented a very minimal Rust version of this at https://github.com/tensorzero/durable. It has been quite stable and extensible but of course you need to be very careful with the SQL implementations. Hope this is interesting to readers here.

More comments...