Statistics that live in your SQL

Posted by caerbannogwhite 3 days ago

Statistics that live in your SQL(kolistat.com)

134 points | 19 comments

pacbard 1 day ago|

This fits a need I had when working with DuckDB: running statistical analyses directly within the database without having to spin up external tools (like R/Stata/Python).

I really appreciate the API for calling stats functions and retrieving results. This seamless integration was exactly what I was missing.

Regarding my concerns, the project currently gets you through about a first semester's worth of grad-level statistics or quantitative methods. It's sufficient for exploratory descriptive statistics, paired tests, and basic linear regressions. However, this also means it isn't close to the "production-level" statistics required for rigorous research work. At a minimum, it needs to support heteroskedastic-robust standard errors (Huber-White and clustered at a minimum; jackknife and bootstrapped as a non-parametric bonus), multilevel linear models, and generalized linear models (GLMs). I notice there is an open issue for GLM support, though it sounds like that will require a full rewrite of the inference backend. Including marginal estimates following a regression would also be highly useful, especially if GLMs are implemented.

Taking this library from an MVP to a production-ready replacement for SAS, R, or Stata will require significant effort. I am unsure about the market fit for a tool like this; organizations paying for SAS or Stata are unlikely to abandon them for an upstart project, and R has a deeply entrenched ecosystem that is impossible to replace. The situation feels similar to the dynamic between Octave and Matlab, or SageMath and Mathematica. It risks becoming a free alternative used primarily by those who cannot afford the paid products.

As others in the thread have pointed out, this extension includes functionality that is already handled well by other community extensions (like ggsql, stochastic, and read_stat). Because the ultimate goal seems to be providing a SAS-compatible frontend built on DuckDB, which is a huge and exciting undertaking. I wonder if the statistical backend might progress faster by focusing only on tests and regressions. Since it serves as the foundation for the other frontend services, zeroing in on the core stats and relying on the broader DuckDB ecosystem for the rest might make this massive scope a bit more manageable.

caerbannogwhite 1 day ago|

Thank you, that's quite a lot of great quality feedback!

Agreed, it's not production-grade yet, and robust standard errors are the priority. HC and cluster-robust SEs are the biggest credibility gap, which is why I just added an (Eigen-based) linalg kernel for the next release: it's the groundwork so the regression fitter can ship HC0–HC3 and clustered SEs (and bootstrapped, via the existing bootstrap aggregate). Margins and GLMs (IRLS on the same kernel, so not really a backend rewrite) are the next layers.

About the scope, agreed, and already narrowing. I'll credit posit ggsql as the real grammar-of-graphics tool; mine stays a minimal built-in. The plan is to go deep on the statistical core and lean on the DuckDB ecosystem for the rest.

For the market fit I'd frame it a bit differently. Not trying to pry anyone off Stata/R for rigorous work, I know that those aren't going anywhere soon. The niche is stats where the data already lives: people in SQL/DuckDB who currently round-trip to R/Python for a regression. R can run in the browser now (webR), but not in the same engine as the data. The stats-duck runs inside duckdb-wasm, so there's no separate runtime and no marshalling. And DuckDB is already far faster than R for the ETL/wrangling around the stats.

The plan is to figure out from real usage. For now, I'll focus on the core.

williamcotton 1 day ago||

The plotting aspect of this seems very similar to:

https://opensource.posit.co/blog/2026-04-20_ggsql_alpha_rele...

caerbannogwhite 1 day ago|

That's exactly the inspiration! I made a post here on HN about that a few weeks ago: https://news.ycombinator.com/item?id=48108815

My plan is to release a blog post about all VISUALIZE current features next week, explicitly mentioning Posit's alpha GGPLOT.

/edit: clarifications

thomasp85 1 day ago||

ggsql developer here. It's quite fun to see an alternative implementation of our syntax so early. Why did you decide on this path rather than working with the ggsql duckdb extension? (honest curious question - not trying to push you away from your path)

I can only imagine the load you might end up in if you have to keep feature parity with ggsql along with all the other features you have

jochapjo 1 day ago|||

If you're interested, this isn't an alternative implementation of ggsql's syntax (I published this last year and it is based on a slightly modified layered grammar), but the SGL language is a similar take on the grammar of graphics + SQL idea: https://arxiv.org/pdf/2505.14690. Currently implemented as an R package: https://sgl-projects.github.io/rsgl/index.html.

caerbannogwhite 1 day ago||

Well done for laying that down in that way! I'm just wondering how much it would be on my side to support that, since I can see that you support cases like "count(*)" and "group by" inside visualize. I can see you have a full bison grammar, I only have a custom parser at the moment, but at least your implementation is in C. But I'm happy to follow thomasp85.

jochapjo 12 hours ago||

No worries, just wanted to mention it since you are both working on similar things and might find it interesting. Congratulations to both of you on your releases.

caerbannogwhite 1 day ago|||

First of all, nice to meet you! The honest answer is timing: the first VISUALIZE commit on my side was april 25th; ggsql-duckdb's first commit was April 23rd. So I genuinely didn't know it existed!

About the name: yours is the official Posit one, and you were there first, so I'll rename my branding; there should be one ggsql, and it's yours. Mine only exposes VISUALIZE as the keyword anyway.

The actual name of the extension is the-stats-duck, which runs inside duckdb-wasm (it powers an in-browser data tool) and emits a Vega-Lite spec inline for the host to render. Your implementation (which I think is Rust based and an in-process HTTP server that opens a browser), is a native pattern, but correct me if I'm wrong! mine is deliberately thin and wasm-safe, not a whole engine.

About the parity, you're right, and I'm not chasing it; for real grammar-of-graphics, ggsql should be the tool! but, if that's ok with you, I'd love to keep the syntax aligned!

thomasp85 1 day ago||

No objections at all. But probably good to describe that it is ggsql-inspired rather than a full reimplementation as it could lead to user confusion about what syntax is supported etc

And you are correct about how our extension is implemented and it isn’t currently wasm ready

caerbannogwhite 1 day ago||

Appreciate it, good call! I'll rename the ggsql bits and the docs to describe it as ggsql-inspired. I'll also point to ggsql for the real thing. Thanks for being kind about it!

caerbannogwhite 16 hours ago||

In reply to HackerThemAll's https://news.ycombinator.com/item?id=48659482 comment (which I think is a bit pretentious and unfair, and HN does not show my direct reply to the comment):

I think that's a red herring: the query is sandboxed DuckDB-WASM SQL (never executed as script), it's never injected into the DOM as HTML, and the page enforces a strict CSP that blocks inline script regardless. NoScript probably flags it because it's SQL-shaped text in a cross-site query string, and it matches its injection heuristic.

geysersam 1 day ago||

Looks great!

One minor correction - the `summarize` function in duckdb can also be used in CTEs etc.

But you have to wrap the `summarize` in a `from` clause like this:

  with
    some_table as (from range(10)),
    x as (from (summarize some_table))
  from x;

caerbannogwhite 1 day ago|

Thank you! I'll add note about it

PashaGo 1 day ago||

Interesting, but I think it works only for quick ad-hoc analysis. For dashboards or deeper research, you still need other tools

caerbannogwhite 1 day ago|

Yes, that's exactly its main purpose! I initially started because I needed a dataset browser. I work with clinical trials, so we usually get raw data files in all possible formats, from CSV to EXCEL and, of course, SAS formats. But since I was already using DuckDB, I thought about extending it a bit further, so you can quickly get a glance at the data.

HackerThemAll 1 day ago|

NoScript detected a potential Cross-Site Scripting attack

from https://kolistat.com to https://bedeverewise.app.

Suspicious data:

(URL) https://bedeverewise.app/embed?autorun=1&query=WITH pois AS (

SELECT k, dpois(k, 3) AS pmf

FROM range(0, 11) AS t(k)

)

VISUALIZE

k AS x

, pmf AS y

FROM pois

DRAW bar

;

so... no, thanks.

g8oz 1 day ago||

Both sites are from the same guy.

caerbannogwhite 1 day ago|||

Thank you for clarifying that! I don't think they got that and they also trusted NoScript a bit too blindly

caerbannogwhite 14 hours ago|||

[dead]

caerbannogwhite 18 hours ago|||

[dead]

caerbannogwhite 1 day ago||

[dead]