DuckDB Internals: Why Is DuckDB Fast? (Part 1)

Posted by marklit 3 days ago

DuckDB Internals: Why Is DuckDB Fast? (Part 1)(www.greybeam.ai)

358 points | 115 commentspage 2

snissn 13 hours ago|

I'm just curious - is duckdb too slow for people? This benchmark from clickhouse shows it being fairly slow compared to some options: https://jsonbench.com/

conradkay 10 hours ago|

That's for their `JSON` data types. In DuckDB it's just a string meaning lots of queries will have to do JSON parsing on every row, but the inserts are very fast. Definitely a bit of a footgun and when you actually just need STRUCT or MAP.

There's a talk about ClickHouse's approach from its creator: https://www.youtube.com/watch?v=xHj9mysh0GI , but the gist is that it maintains (sub)columns to store different paths in the JSON

In other ways DuckDB has very good JSON support, like you can do `CREATE TABLE name AS `SELECT * FROM 'data.json';` and it'll infer the schema when possible.

hilariously 7 hours ago||

Not just that, you can SELECT * FROM read_json('folder/*') and read an entire schematically consistent folder of json files, or parquet files, or basically anything it supports with its various functions. Duckdb is insanely useful.

mcv 10 hours ago||

Is everything becoming columnar? Parquet stores data per column instead of per row because it improves compression. I get that. Arrow apparently is columnar, and now DuckDB also gets its efficiency by treating data as columns instead of rows?

I still need to wrap my head around how that works, but it's a fascinating development.

levanten 9 hours ago||

It depends on your task. In analytics where you need to scan lots of data points within few columns, then columnar storage is very much the best. But for transactional workloads where you have to deal with specific entities, row based would be more advantageous. There are hybrid systems that try to be both at the same time but in my experience they end not doing either very well.

brandmeyer 6 hours ago|||

Some day we'll get CREATE TABLE ... ( ... STORAGE ORDER COLUMN MAJOR) to have our transactional cake on the tables that need it and eat our analytics cake on the tables that need that.

But until then, separate tools for separate purposes isn't a bad place to be when those tools are both fantastic.

hilariously 7 hours ago|||

Often used to be referred to as HTAP, and yeah in most data engineering its moving things from OLTP to OLAP forms, and OLAP pretty much always benefit from columnar compression for aggregations and rollups.

charlieflowers 1 hour ago|||

BTW, columnar is very similar to struct of arrays (SOA) and some of the reasons it works well overlap with SOA.

skeeter2020 2 hours ago|||

compression is a side effect but not really the goal. To simplify, analytical queries often filter on a specific column value, and if these are laid out contiguously it makes disk-level reads much faster than rows that would involve read-skip-read-etc. In transactional systems data is typically written as rows though, so that's likely slower in a columnar system. As a general rule, heavy read workflows with known access patterns is going to benefit from a columnar layout.

squirrellous 5 hours ago||

Those three things you mentioned kind of live in the same niche - offline data storage and querying. In that world yes everything has become columnar since it’s just better. Row-oriented is still the solution for online streaming use cases.

tdhz77 3 hours ago||

Is duckdb multi region active active?

bunsenhoneydew 11 hours ago||

DuckDB is a fantastic piece of tech. One of the best, if not the best, I’ve found in several years.

Panzerschrek 13 hours ago||

If DuckDB is so fast and has no data transfer overheads, does it need all this typical SQL machinery with filtering and joining via SELECT queries? Wouldn't it be simpler and faster to return all data to the caller code (all table rows, but only requested columns) and let it perform all other necessary data processing logic?

jauco 13 hours ago||

You’d end up implementing your own home grown version of hash join and query pushdown (skipping parquet row groups entirely) etc and your own home grown heuristics in selecting the right one (planning)

Which can outperform a generic solution like this of course, but it’s not less work to make faster for most cases.

Also duckdb can give you access to an in memory representation (e.g. `fetch_arrow_table()`) so you have less “language data structure wrapping” overhead. And you can do filtering yourself on that. In most cases the “where” statements will win though.

efromvt 8 hours ago||

The SELECT machinery is the product with databases! SQL often the shortest description of the processing logic, and the database has an efficient local execution engine that can prune/reduce data read based on the plan. Very hard to match in app, especially when joins get involved.

pknerd 13 hours ago||

FTA:

> ..In-process means there's no server. You don't connect to DuckDB; you load it as a library inside your program, the same way you'd load NumPy or Polars

Does it mean it can perform all statistical computations as well if I want to use for algo trading?

sigbottle 5 hours ago||

What the fuck.

I've never been that strong of an engineer. TIL that at one of my internships I was building DuckDB but for the company's private use cases. Well, trying to anyways. I didn't really get the whole picture, the pieces did not fit into place.

Didn't get the return offer obviously, probably because I didn't make the connection (or really a coherent narrative of what I was building). RIP. You live and you learn, I guess.

thefourthchime 14 hours ago||

I’m a huge fan, I’ve been wanting to know into the internals. Look forward to digging in.

codingbear 14 hours ago||

duckdb is so nice coupled with claude code. It extensive file support and some very interesting decisions on local caching data (like from S3 or snowflake) makes it easy to slice and dice almost any kind of tabular data.

blackoil 13 hours ago|

> duckdb is so nice coupled with claude code

Can you expand upon it? You mean claude code use it to store its memory/state or it can do business queries using DuckDB.

medvezhenok 13 hours ago||

Claude code can write exploratory queries for you to give you a quick rundown on the shape of the data set, frequencies, missing values, etc etc (without having to load it into a more persistent data store or writing custom python scripts). I also find SQL snippets inherently more re-usable than custom python code.

You can also write a skill that CC can re-use if you're analyzing a lot of similar data sets with minor variance.

holografix 13 hours ago|

Why is DuckDB so popular when one can use Python + Pandas?

Better perf + SQL is that mostly it?

refactor_master 13 hours ago||

The better question is, why is DuckDB so popular when one can use Polars which has a sane, lintable, typesafe API compared to the mess that is SQL:

  WITH lagged AS (
      SELECT 
          *, 
          LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
      FROM events
  ),
  sessions AS (
      SELECT 
          *, 
          SUM(COALESCE((date_diff('minute', prev_time, event_time) > 30)::INT, 1)) 
              OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
      FROM lagged
  )
  SELECT
      user_id,
      session_id,
      MIN(event_time) AS session_start,
      MAX(event_time) AS session_end,
      COUNT(*) AS event_count
  FROM sessions
  GROUP BY ALL
  ORDER BY user_id, session_start;

  result = (
      df.sort(["user_id", "event_time"])
      .with_columns(
          session_id=(
              pl.when(pl.col("event_time").diff().is_null())
              .then(1)
              .when(pl.col("event_time").diff().dt.total_minutes() > 30)
              .then(1)
              .otherwise(0)
              .cum_sum()
              .over("user_id")
          )
      )
      .group_by(["user_id", "session_id"])
      .agg(
          session_start=pl.col("event_time").min(),
          session_end=pl.col("event_time").max(),
          event_count=pl.col("event_time").count(),
      )
      .sort(["user_id", "session_start"])
  )

coldtea 8 hours ago|||

Precisely to avoid the custom NIH Polars API, and use SQL which works everywhere (yes, inconsistencies aside).

brikym 11 hours ago||||

Polars typesafe? It doesn't show you any errors until runtime right? Kusto query language is the best I've seen at type safety and I wish open source DBs would steal some ideas from it.

SmirkingRevenge 19 minutes ago||||

Polars and duckdb interoperate so it's not really one or the other.

Other dataframe libraries work directly on duckdb also, like Ibis.

FWIW, I think SQLAlchemy does also.

In any case though, I find both the SQL and the code there equally readable.

willj 2 hours ago||||

How can you tell if someone is a polars fan? Don’t worry, they’ll tell you. :)

sceadu 4 hours ago||||

I really hate SQL as a language for doing data-related tasks, but the core tech for it is often much better in terms of reliability (and is close to having performance parity IME)... specifically for working with larger-than-memory datasets. I think the team at Polars is working on improving larger-than-memory operations though.

porridgeraisin 12 hours ago||||

I understand the linting aspect but not gonna lie I understood the first one immediately way more than the 2nd one due to knowing SQL well.

homebessguy 12 hours ago||||

"Languages come and go, but SQL is forever"

IshKebab 10 hours ago|||

That does look nicer if you have a Parquet file and want to analyze it. But DuckDB is also a database - if you want a persistent, reliable and mutable data store I don't think Polars would be suitable would it? (Genuine question - you sound like an expert and I'm not.)

tomjakubowski 4 hours ago||

The other thing DuckDB does quite well, and which is out of scope for polars, is its keeping of an absolute zoo of external data sources you can query with SQL from the same database client. It's an excellent data warehousing tool.

estetlinus 13 hours ago|||

Why would you prefer Python and Pandas over good old SQL? Pandas is so verbose and hard to debug, most of the times struggle to be performant on small datasets.

SQL has been around since the dawn of databases. I am happy to see a trend away from pandas.

paytonjjones 13 hours ago|||

Pandas has lots and lots of problems.

Performance is definitely one of them, but it also has inconsistent and duplicated methods, inconsistent defaults (e.g. some methods are inplace by default), copy by reference issues, I could go on.

It was an early winner in an extremely popular language. That's really the main thing going for it, but alternatives have been a long time coming.

RobinL 13 hours ago||

I wrote a blog post a while back to address this question here: https://www.robinlinacre.com/recommend_duckdb/

More comments...