Posted by marklit 3 days ago
There's a talk about ClickHouse's approach from its creator: https://www.youtube.com/watch?v=xHj9mysh0GI , but the gist is that it maintains (sub)columns to store different paths in the JSON
In other ways DuckDB has very good JSON support, like you can do `CREATE TABLE name AS `SELECT * FROM 'data.json';` and it'll infer the schema when possible.
I still need to wrap my head around how that works, but it's a fascinating development.
But until then, separate tools for separate purposes isn't a bad place to be when those tools are both fantastic.
Which can outperform a generic solution like this of course, but it’s not less work to make faster for most cases.
Also duckdb can give you access to an in memory representation (e.g. `fetch_arrow_table()`) so you have less “language data structure wrapping” overhead. And you can do filtering yourself on that. In most cases the “where” statements will win though.
> ..In-process means there's no server. You don't connect to DuckDB; you load it as a library inside your program, the same way you'd load NumPy or Polars
Does it mean it can perform all statistical computations as well if I want to use for algo trading?
I've never been that strong of an engineer. TIL that at one of my internships I was building DuckDB but for the company's private use cases. Well, trying to anyways. I didn't really get the whole picture, the pieces did not fit into place.
Didn't get the return offer obviously, probably because I didn't make the connection (or really a coherent narrative of what I was building). RIP. You live and you learn, I guess.
Can you expand upon it? You mean claude code use it to store its memory/state or it can do business queries using DuckDB.
You can also write a skill that CC can re-use if you're analyzing a lot of similar data sets with minor variance.
Better perf + SQL is that mostly it?
WITH lagged AS (
SELECT
*,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_time
FROM events
),
sessions AS (
SELECT
*,
SUM(COALESCE((date_diff('minute', prev_time, event_time) > 30)::INT, 1))
OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
FROM lagged
)
SELECT
user_id,
session_id,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
COUNT(*) AS event_count
FROM sessions
GROUP BY ALL
ORDER BY user_id, session_start;
vs result = (
df.sort(["user_id", "event_time"])
.with_columns(
session_id=(
pl.when(pl.col("event_time").diff().is_null())
.then(1)
.when(pl.col("event_time").diff().dt.total_minutes() > 30)
.then(1)
.otherwise(0)
.cum_sum()
.over("user_id")
)
)
.group_by(["user_id", "session_id"])
.agg(
session_start=pl.col("event_time").min(),
session_end=pl.col("event_time").max(),
event_count=pl.col("event_time").count(),
)
.sort(["user_id", "session_start"])
)Other dataframe libraries work directly on duckdb also, like Ibis.
FWIW, I think SQLAlchemy does also.
In any case though, I find both the SQL and the code there equally readable.
SQL has been around since the dawn of databases. I am happy to see a trend away from pandas.
Performance is definitely one of them, but it also has inconsistent and duplicated methods, inconsistent defaults (e.g. some methods are inplace by default), copy by reference issues, I could go on.
It was an early winner in an extremely popular language. That's really the main thing going for it, but alternatives have been a long time coming.