Posted by marklit 3 days ago
duckdb is becoming a kind of data superglue between a lot of data ecosystems (GIS, observability, analytics, lakehouses, object storage, etc) that don't talk to each other typically, and it's worth checking out in 2026.
* https://github.com/duckdb/extension-template * https://duckdb.org/community_extensions/
I'm not very good at C++, but coupled with the extension template and codex I got a basic version of my extension working within an hour. Go for it!
Thanks in Advance
You probably don't realize this, but you're asking one of the hardest questions when starting a business, and one of the questions others are least likely to be able to answer for you.
"finding" a niche, and connecting to the business folks inside that 'niche' is hard, and is inherently a personal journey.
There's an old writing adage, "Write about what you know", and the same adage works in business: Do business with what you know.
Your question goes into another issue that you have to resolve when building a business: going into a platform specialization necessarily means folks know about that platform or they know they need you to solve a problem they have with that platform.
In general, there are two ways out of each problem:
1. Build an ecosystem with DuckDB at its center that solves a business problem that a particular niche cares about. 2. Build a reputation solving problems with DuckDB that would attract those that know they have a problem with DuckDB.
Honestly, best of luck here, becoming successful at business is hard if you're not already in tune with why folks buy and ensuring you're selling something they want to buy from you.
There is a theory called diffusion of innovation. The simple explanation is that there are 5 different cohorts of buyers. Early adopters, visionaries, pragmatists, conservatives and laggards. Early adopters and visionaries are risk takers, who will make bold moves to achieve order of magnitude results. This is called the early market, which represents 13% of the market. The pragmatists and conservatives make up the mainstream market which is about 70%.
In order to get into the mainstream market, you need solid adoption from the early market.
To choose a niche, you need to develop a solution that fits nicely into the buyers expectations for different types of market participants. There is the market alternative and product alternative. The market alternative is the solution that owns the highest proportion of market share. The product alternative is innovative tech that challenges superiority to the market alternative.
You need to introduce a solution that fits in between those participants to stand out.
To choose a solution, go to industry trade events and talk to people about high value problems that aren’t solved by current participants. That is the purpose of industry associations, to solve difficult problems.
Visionaries and early adopters love new vendors. They will champion you through their organization if your solution will help them meet their goals.
Good luck
Define the smallest market possible or something like that. I’m not sales though.
We didn't know that for GP3 disks, you can increase not only IOPS but also Read/Write Throughput [1] which by default is 125 MB/s. So by default we were not seeing the performance we expected.
Once we increased the throughput of the EBS, it was amazing. So if you are not seeing the performance you read about online when using DuckDB, it may be something like that.
[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-p...
Is Amazon running on super outdated legacy networking?
Recently at work I've been using it to analyse the Claude code sessions of every engineer at our company (that we upload to S3) and it's been extremely helpful to help us find gaps in devex and have clear metrics to back up the impact of fixing them
Another thing it's been really useful for has been getting metrics on Claude skills usage and then dive into use-cases by looking at the transcripts
Other engineers that had never touched DuckDB were so impressed with how easy it is for AI agents to write queries on our dataset
I agree, and the dirty (not so) secret big data providers like Snowflake try to hide: the majority of your work is not big data and WILL fit on your local machine. My last company was spending $2M/yr on contract with Snowflake, and another million between Fivetran and Matillion. Of the 1200 clients using analytics maybe 2 had enough data to warrant "infinite scalability" and a dozen wanted Snowflake because they already had corporate warehouses in Snowflake (they probably didn't need it either). Turns out the Extract and Load could be handled by bog-standard C# code and a bunch of SQL, while almost everyone was better off with a DuckDB database running locally, often in the browser. You've probably heard YAGNI before (You Ain't Gonna Need It) but it's even more likely with "Big Data". #SmallDataConvert
The file fits in memory and can do all sort of computation in the browser itself. The backend is extremely simple, it just loads the JS and serves the parquet files.
It was also trivial to let the owner do their own queries, just give the schema to an LLM and let it use the charting library, no data hallucinations. If they need it in the dashboard they can either use that one or ask me to review that query.
To be honest, given how simple some things became, it's been really fun to work on.
Dangerous thing to assert. It’ll happily run SQL that works, but doesn’t necessarily correspond to intentions or unstated assumptions about the data.
It can only emit SQL and the json spec of the chart.
Since shipping I've reviewed dozens of queries and charts it produces answering the user. I'm yet to catch sonnet off guard.
At the other end of the spectrum, working with random data on "what if?" and exploration tasks with DuckDB is fun again. it's so straightforward and fast, with tools and functions for pretty much everything.
For now building the 10% of the SaaS that you need still leaves you operating 100% of a new service/process
Nice! How do you set things up so that your engineers's claude code sessions upload to S3? Thanks for the help in advance
UPDATE my_table
SET x = file1.x,
y = file2.y
FROM 'first_file.csv' file1
LEFT JOIN 's3://my_bucket/second_file.parquet' file2
ON file1.id = file2.id
WHERE mytable.id = file1.id;This was a major factor in my initial adoption. Since then it has stuck because it’s also absurdly capable, versatile, and fast.
If it wasn’t so easy to use I suspect I wouldn’t have adopted it when I did. The ergonomics are crazy. It still impresses me regularly.
It has connectors for Postgres & other stores, so I find it faster to connect to a Postgres instance, pull all of the data from a table (even if the table is like 50GB - if you have 30 cores on the machine it will pull from Postgres using 30 cores in parallel, so it will only take a minute or two) - and then any analytical queries on the data are 10+ times faster in DuckDB over native Postgres (GROUP BY, regexp_replace, count(distinct...) etc).
THis will give you some experience and you'll start to see applicable problem spaces for DuckDB in product areas, especially anything with BI or DW.
There are other embeddable options out there but I found DuckDb fit better for the potentially massive datasets, and also because of how naturally it ingests the types of data we work with, some of its unique features, and how trivial it was to learn and integrate with the project.
Otherwise I use it almost daily for doing guardrailed data exploration with LLMs. I prefer SQL over random DSLs in AWS or Sentry or what have you. I’ll ingest the data I need and just run SQL against it. I mentioned in another comment that I’ll tend to store more useful data (especially data I export routinely, like infra cost reports) on S3 and use a Rill instance to do basic exploration in a GUI (it will query remote parquet files).
* fastapi + duckdb + parquet for the backend for a relatively high profile website
* wasm duckdb + react for a few visualization websites
* yaml driven ETL from lots of sources, principally ugly spreadsheets, into usable data. More T than E or L really
For data I reference frequently, and especially which I know will grow over time, I’ve started using Rill because it makes ad-hoc exploration very smooth and low-friction.
My process tends to be something like:
1. Explore logs or some other at least somewhat structured dataset
2. Use Claude to find useful patterns and determine how I might benefit from this data in ways I wasn’t yet aware
3. See how often it’s useful for decision making
4. If it’s frequently useful, formalize it as a view in my Rill instance and refine the models to maximize their utility
DuckDB is fast for some specific workloads. If you use it for most other things, it is at least an order of magnitude slower than SQLite. It also has some limitations in terms of what SQL it will currently run (e.g. I immediately ran into an issue with recursive queries). That will probably get better with time.
[1] If you search HN for "sqlite" and "duckdb" you get 4,310 hits and 2,398 hits respectively. That's a very heavy skew, considering SQLite is everywhere and had been around for a quarter century, while DuckDB effectively appeared on the scene two years ago.
SQLite is awesome and I would love to see more posts about it, but the reality is one of the major reasons it's awesome is the no-drama/stability/it just works. DuckDB is seeing a lot of development on many fronts so there's a lot more to learn and talk about right now.
Yes, it's specifically promoted as DBMS for OLAP workload. And it's usually compared to ClickHouse, another analytical DBMS. So people who use it know why it's good.
This is where Arrow wins I think. Arrow CPP for example has very portable builds and the C interface is very usable for building bindings.
DuckDB is excellent, but it’s more a black box than a library.
Edit: after a conversation with a robot, it would seem that the DuckDB and ArrowCPP C APIs are complimentary, so it's very possible to have Arrow CPP and DuckDB to coexist in an app, each with its own strength. Arrow CPP doen't have a simple SQL story for example.
So being more specific, I don't know how I could get a static build of DuckDB to work with Parquet and httpfs (i.e. query S3) working in an app store environment. It was a day's work to get Arrow CPP to call back into Swift for the transport layer.
However I do now see that DuckDB recently provided an extension point for providing your own transport layer, so my point might well be moot for that reason [2].
[1] https://github.com/duckdb/duckdb/issues/16190 [2] https://github.com/duckdb/duckdb/pull/17464
I do a lot of experiments with regexes, and if you get used to the RE2 syntax that DuckDB uses, you can see up to 10-100x uplift in terms of speed compared to Postgres on things like regexp_matches(), regexp_extract(), etc (depending on query/table/machine specifics). It has quite powerful scripting with custom Macros, fixes a lot of annoyances of SQL for me compared to Postgres.
I think if you have access to a machine with a lot of RAM / cores and a beefy data set, then it's basically like a RAMdisk version of Snowflake running locally on your machine.
(and of course the fact that it makes it convenient to read CSV/parquet, read/write from S3, etc) - it's a very ergonomic tool.
It’s not really a database in the traditional sense, there is no ACID complexity, it’s a library that lets use write SQL to query a tabular data file.
There are companies that write cluster computing engines with duckdb as the byte-cruncher at their heart, but usually it's more like NumPy, Pandas or Polars on steroids. Or SQLite, but for running OLAP queries.
The key thing is that this scaled horizontally pretty much forever, since each vehicle had a fixed amount of data per year we could tightly control the performance characteristics of the analysis. Adding more vehicles didn't make things slower, just linearly more expensive.
I vaguely remember the data from those containers also being used to process some aggregate analysis (like the each vehicle-container would output some data that would be consumed by another job that did aggregates). But I don't remember the specifics.
[1]: I believe we used JSONL or parquet format, but I didn't work in that part of the stack directly
Still a bit raw, but getting there