Posted by tanelpoder 3 days ago
I think there are solutions for that scale of data already, and simplicity is the best feature of DuckDB (at lest for me).
This is a fair point, but I think there's a middle ground. DuckDB handles surprisingly large datasets on a single machine, but "surprisingly large" still has limits. If you're querying 10TB of parquet files across S3, even DuckDB needs help.
The question is whether Ray is the right distributed layer for this. Curious what the alternative would be—Spark feels like overkill, but rolling your own coordination is painful.
What you need is a multi-tenancy shared infrastructure that is elastic.
i think this is where spark shuffling comes in? but how does it work here.
https://duckdb.org/docs/stable/guides/performance/how_to_tun...
So what does this run on then?
No docs, it's not possible to find any deployment guides for Ray using serverless solutions like Lambda, Cloud Functions or be it your own Firecracker.
Instead, every other post mentions EKS or EC2.
The Ray team even rejected Lambda support expressedly as far back as 2020 [0]. Uuuuuugh.
No thanks! shiver
I'd rather cut complexity for practically the same benefit and either do it single machine or have a thin, manageable layer on top a truly serverless infra like in this talk [1] " Processing Trillions of Records at Okta with Mini Serverless Databases".