DuckDB Internals: Why Is DuckDB Fast? (Part 1)

Posted by marklit 3 days ago

DuckDB Internals: Why Is DuckDB Fast? (Part 1)(www.greybeam.ai)

376 points | 122 commentspage 3

thefourthchime 15 hours ago|

I’m a huge fan, I’ve been wanting to know into the internals. Look forward to digging in.

codingbear 15 hours ago||

duckdb is so nice coupled with claude code. It extensive file support and some very interesting decisions on local caching data (like from S3 or snowflake) makes it easy to slice and dice almost any kind of tabular data.

blackoil 14 hours ago|

> duckdb is so nice coupled with claude code

Can you expand upon it? You mean claude code use it to store its memory/state or it can do business queries using DuckDB.

medvezhenok 14 hours ago||

Claude code can write exploratory queries for you to give you a quick rundown on the shape of the data set, frequencies, missing values, etc etc (without having to load it into a more persistent data store or writing custom python scripts). I also find SQL snippets inherently more re-usable than custom python code.

You can also write a skill that CC can re-use if you're analyzing a lot of similar data sets with minor variance.

holografix 14 hours ago||

Why is DuckDB so popular when one can use Python + Pandas?

Better perf + SQL is that mostly it?

refactor_master 14 hours ago||

The better question is, why is DuckDB so popular when one can use Polars which has a sane, lintable, typesafe API compared to the mess that is SQL:

  WITH lagged AS (
      SELECT 
          *, 
          LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
      FROM events
  ),
  sessions AS (
      SELECT 
          *, 
          SUM(COALESCE((date_diff('minute', prev_time, event_time) > 30)::INT, 1)) 
              OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
      FROM lagged
  )
  SELECT
      user_id,
      session_id,
      MIN(event_time) AS session_start,
      MAX(event_time) AS session_end,
      COUNT(*) AS event_count
  FROM sessions
  GROUP BY ALL
  ORDER BY user_id, session_start;

  result = (
      df.sort(["user_id", "event_time"])
      .with_columns(
          session_id=(
              pl.when(pl.col("event_time").diff().is_null())
              .then(1)
              .when(pl.col("event_time").diff().dt.total_minutes() > 30)
              .then(1)
              .otherwise(0)
              .cum_sum()
              .over("user_id")
          )
      )
      .group_by(["user_id", "session_id"])
      .agg(
          session_start=pl.col("event_time").min(),
          session_end=pl.col("event_time").max(),
          event_count=pl.col("event_time").count(),
      )
      .sort(["user_id", "session_start"])
  )

coldtea 9 hours ago|||

Precisely to avoid the custom NIH Polars API, and use SQL which works everywhere (yes, inconsistencies aside).

brikym 13 hours ago||||

Polars typesafe? It doesn't show you any errors until runtime right? Kusto query language is the best I've seen at type safety and I wish open source DBs would steal some ideas from it.

SmirkingRevenge 1 hour ago||||

Polars and duckdb interoperate so it's not really one or the other.

Other dataframe libraries work directly on duckdb also, like Ibis.

FWIW, I think SQLAlchemy does also.

In any case though, I find both the SQL and the code there equally readable.

willj 3 hours ago||||

How can you tell if someone is a polars fan? Don’t worry, they’ll tell you. :)

sceadu 5 hours ago||||

I really hate SQL as a language for doing data-related tasks, but the core tech for it is often much better in terms of reliability (and is close to having performance parity IME)... specifically for working with larger-than-memory datasets. I think the team at Polars is working on improving larger-than-memory operations though.

porridgeraisin 13 hours ago||||

I understand the linting aspect but not gonna lie I understood the first one immediately way more than the 2nd one due to knowing SQL well.

homebessguy 13 hours ago||||

"Languages come and go, but SQL is forever"

IshKebab 11 hours ago|||

That does look nicer if you have a Parquet file and want to analyze it. But DuckDB is also a database - if you want a persistent, reliable and mutable data store I don't think Polars would be suitable would it? (Genuine question - you sound like an expert and I'm not.)

tomjakubowski 5 hours ago||

The other thing DuckDB does quite well, and which is out of scope for polars, is its keeping of an absolute zoo of external data sources you can query with SQL from the same database client. It's an excellent data warehousing tool.

estetlinus 14 hours ago|||

Why would you prefer Python and Pandas over good old SQL? Pandas is so verbose and hard to debug, most of the times struggle to be performant on small datasets.

SQL has been around since the dawn of databases. I am happy to see a trend away from pandas.

paytonjjones 14 hours ago|||

Pandas has lots and lots of problems.

Performance is definitely one of them, but it also has inconsistent and duplicated methods, inconsistent defaults (e.g. some methods are inplace by default), copy by reference issues, I could go on.

It was an early winner in an extremely popular language. That's really the main thing going for it, but alternatives have been a long time coming.

RobinL 14 hours ago||

I wrote a blog post a while back to address this question here: https://www.robinlinacre.com/recommend_duckdb/

f311a 14 hours ago||

I wish this article was not LLM written

sigbottle 6 hours ago||

What the fuck.

I've never been that strong of an engineer. TIL that at one of my internships I was building DuckDB but for the company's private use cases. Well, trying to anyways. I didn't really get the whole picture, the pieces did not fit into place.

Didn't get the return offer obviously, probably because I didn't make the connection (or really a coherent narrative of what I was building). RIP. You live and you learn, I guess.

bunbun69 4 hours ago||

Holy LLM slop article…

explodes 4 hours ago|

> You open a connection, send SQL over TCP (a protocol to send data over a network),

Thanks for the clarification, LLM!!

greenavocado 2 hours ago||

To be fair, most managers don't know what TCP is

pknerd 14 hours ago||

umm can we say it can replace SQLite?

pierregillesl 8 hours ago||

I wouldn’t see it as a replacement.

SQLite and DuckDB solve different problems and actually complement each other quite well. SQLite is excellent for transactional workloads (OLTP), while DuckDB shines for analytical workloads (OLAP), especially time-series data and aggregations.

We’ve been using both side-by-side in an open-source project for about two years: SQLite for configuration and transactional data, DuckDB for historical sensor data and analytics. So far, it’s been a very good combination.

skeeter2020 3 hours ago|||

DuckDB kind of created this false comparison by their own early positioning, but I've tried to charitably interpret it as modeling the spirit and motivations of SQLite, not literally being "the SQLite for Analytics". Aside from both being in-process databases they are very different.

tomjakubowski 3 hours ago|||

You can even use DuckDB to query SQLite :^)

3eb7988a1663 14 hours ago|||

OLAP vs OLTP. Sure you could use one for the other, but they have ideal use cases. Updating a single record in SQLite is going to be more efficient than doing the same in DuckDB.

steve_adams_86 14 hours ago||

They seem similar at a glance but they’re quite different. You can think of SQLite as a transactional database while DuckDB is better used as an analytical database.

I can see applications having valid reasons to use both. You can use SQLite as the catalog in duck lake systems, for example. SQLite is your metadata record, DuckDB is your ingestion/scanning/aggregating/joining engine.

pierregillesl 8 hours ago||

[dead]

kunal183 11 hours ago||

[flagged]

Omniloop 11 hours ago|

[flagged]

More comments...