Posted by davidgu 2 days ago
ya, right. just make up some reason not following the best practices
And that is assuming you have a solution for things like balancing, and routing to the correct shard.
did you comment exactly the same things some months ago?
> Most online resources chalk this up to connection churn, citing fork rates and the pid-per-backend yada, yada. This is all true but in my opinion misses the forest from the trees. The real bottleneck is the single-threaded main loop in the postmaster. Every operation requiring postmaster involvement is pulling from a fixed pool, the size of a single CPU core. A rudimentary experiment shows that we can linearly increase connection throughput by adding additional postmasters on the same host.
That said, proxies aren't perfect. https://jpcamara.com/2023/04/12/pgbouncer-is-useful.html outlines some dangers of using them (particularly when you might need session-level variables). My understanding is that PgDog does more tracking that mitigates some of these issues, but some of these are fundamental to the model. They're not a drop-in component the way other "proxies" might be.
I believe they're just referring to having several completely-independent postgres instances on the same host.
In other words: say that postgres is maxing out at 2000 conns/sec. If the bottleneck actually was fork rate on the host, then having 2 independent copies of postgres on a host wouldn't improve the total number of connections per second that could be handled: each instance would max out at ~1000 conns/sec, since they're competing for process-spawning. But in reality that isn't the case, indicating that the fork rate isn't the bottleneck.
Since Postgres is a mature project, this is a non-trivial effort. See the Postgres wiki for some context: https://wiki.postgresql.org/wiki/Multithreading
But, I'm hopeful that in 2-3 years from now, we'll see this bear fruition. The recent asynchronous read I/O improvements in Postgres 18 show that Postgres can evolve, one just needs to be patient, potentially help contribute, and find workarounds (connection pooling, in this case).
They probably don't even need a database anyway for data that is likely write once, read many. You could store the JSON of the meeting in S3. It's not like people are going back in time and updating meeting records. It's more like a log file and logging systems and data structures should be enough here. You can then take that data and ingest it into a database later, or some kind of search system, vector database etc.
Database connections are designed this way on purpose, it's why connection pools exist. This design is suboptimal.
What you describe makes sense, of course, but few can build it without it being drastically worse than abusing a database like postgres. It's a sad state of affairs.
you can keep things synced across databases easily and keep it super duper simple.
If people are building things which actually require massive amounts of data stored in databases they should be able to charge accordingly.
They are cheap if you tiny fraction of server use for $20/mo or have 50 engineers working on code
I would much rather spend 5k per month to make 1 million, keeping things extremely simple.