Posted by thomasp85 11 hours ago
I was kind of guessing that it doesn't run in a database, that it's a SQL-like syntax for a visualisation DSL handled by front end chart library.
That appears to be what is described in https://ggsql.org/get_started/anatomy.html
But then https://ggsql.org/faq.html has a section, "Can I use SQL queries inside the VISUALISE clause," which says, "Some parts of the syntax are passed on directly to the database".
The homepage says "ggsql interfaces directly with your database"
But it's not shown how that happens AFAICT
confused
ggsql connects directly with your database backend (if you wish - you can also run it with an in-memory DuckDB backend). Your visual query is translated into a SQL query for each layer of the visualisation and the resulting table is then used for rendering.
E.g.
VISUALISE page_views AS x FROM visits DRAW smooth
will create a SQL query that calculates a smoothing kernel over the data and returns points along that. Those points are then used to create the final line chart
As an alpha, we support just a few readers today: duckdb, sqlite, and an experimental ODBC reader. We have largely been focusing development mainly around driving duckdb with local files, though duckdb has extensions to talk to some other types of database.
The idea is that ggsql takes your visualisation query, and then generates a selection of SQL queries to be executed on the database. It sends these queries using the reader, then builds the resulting visualisation with the returned data. That is how we can plot a histogram from very many rows of data, the statistics required to produce a histogram are converted into SQL queries, and only a few points are returned to us to draw bars of the correct height.
By default ggsql will connect to an in-memory duckDB database. If you are using the CLI, you can use the `--reader` argument to connect to files on-disk or an ODBC URI.
If you use Positron, you can do this a little easier through its dedicated "Connections" pane, and the ggsql Jupyter kernel has a magic SQL comment that can be issued to set up a particular reader. I plan to expand a little more on using ggsql with these external tools in the docs soon.
I eventually found this readme https://github.com/posit-dev/ggsql/tree/main/ggsql-python which tells me far more than anything I found on the website
"SQL" and "databases" are different things
SQL is a declarative language for data manipulation. You can use SQL to query a database, but there's nothing special about databases. You can also write SQL to query other non-database sources like flat files, data streams, or data in a program's memory.
Conversely, you can query a database without SQL.
fond memories of quel.
Or is the idea that SQL is such a great language to write in that a lot of people will be thrilled to do their ggplots in this SQL-like language?
EDIT: OK, after looking at almost all of the documentation, I think I've finally figured it out. It's a standalone visualization app with a SQL-like API that currently has backends for DuckDB and SQLite and renders plots with Vegalite. They plan to support more backends and renderers in the future. As a commenter below said, it's supposed to help SQL specialists who don't know Python or R make visualizations.
In my experience, the only thing data fields share is SQL (analysts, scientists and engineers). As you said, you could do the same in R, but your project may not be written in R, or Python, but it likely uses an SQL database and some engine to access the data.
Also I've been using marimo notebooks a lot of analysis where it's so easy to write SQL cells using the background duckdb that plotting directly from SQL would be great.
And finally, I have found python APIs for plotting to be really difficult to remember/get used to. The amount of boilerplate for a simple scatterplot in matplotlib is ridiculous, even with a LLM. So a unified grammar within the unified query language would be pretty cool.
I mean, is it to avoid loading the full data into a dataframe/table in memory?
I just don't see what the pain point this solves is. ggplot solves quite a lot of this already, so I don't doubt that the authors know the domain well. I just don't see the why this.
What makes it interesting is the interface (SQL) coupled with the formalism (GoG). The actual visualization or runtime is an implementation detail (albeit an important one).
So it’s something. But for most uses just prompting your favorite LLM to generate the matplotlib code is much easier.
However, I don't see what the benefits of this are (other than having a simple DSL, but that creates the yet another DSL problme) over ggplot2. What do I gain by using this over ggplot2 in R?
The only problem, and the only reason I ever leave ggplot2 for visualizations, is how difficult it is to do anything "non-standard" that hasn't already had a geom created in the ggplot ecosystem. When you want to do something "different" it's way easier to drop into the primitive drawing operations of whatever you're using than it is to try to write the ggplot-friendly adapter.
Even wrapping common "partial specificiations" as a function (which should "just work" imo) is difficult depending on whether you're trying to wrap something that composes with the rest of the spec via `+` or via pipe (`|>`, the operator formerly known as `%>%`)
ggsql is (partly) about reaching new audiences and putting powerful visualisation in new places. If you live in R most of the time I wouldn't expect you to be the prime audience for this (though you may have fun exploring it since it contains some pretty interesting things ggplot2 doesn't have)
Usage:
lhs |> rhs
Arguments: lhs: expression producing a value.
rhs: a call expression.
Details:
[...] It is also possible to use a named argument with the placeholder
‘_’ in the ‘rhs’ call to specify where the ‘lhs’ is to be
inserted. The placeholder can only appear once on the ‘rhs’.We reached a similar conclusion for GFQL (oss graph dataframe query language), where we needed an LLM-friendly interface to our visualization & analytics stack, especially without requiring a code sandbox. We realized we can do quite rich GPU visual analytics pipelines with some basic extensions to opencypher . Doing SQL for the tabular world makes a lot of sense for the same reasons!
For the GFQL version (OpenCypher), an example of data loading, shaping, algorithmic enrichment, visual encodings, and first-class pipelines:
- overall pipelines: https://pygraphistry.readthedocs.io/en/latest/gfql/benchmark...
- declarative visual encodings as simple calls: https://pygraphistry.readthedocs.io/en/latest/gfql/builtin_c...
I devised a similar in spirit (inside SQL, very simplified vs GoG) approach that does degrade (but doesn't read as nice): https://sqlnb.com/spec
# %%
foo = 1
# %%
print(foo)
Above is notebook with two "cells" & also a valid Python script. Perhaps it matters less with SQL vs Python, but it's a nice property.
Feedback: A notable omission in the ggsql docs: I cannot find any mention of the possible outputs. Can I output a graphic in PDF? In SVG? PNG? How do I control things like output dimensions (e.g., width=8.5in, height=11in)?
The closest I got was finding these few lines of example code in the docs for the Python library:
# Display or save
chart.display() # In Jupyter
chart.save("chart.html") # Save to fileOur ggsql Jupyter kernel can use these vegalite specifications to output charts in a Quarto document, for example.
In the future we plan to create a new high performance writer module from scratch, avoiding this intermediate vegalite step, at which point we’ll have better answers for your questions!
The point of this is not to superseed ggplot2 in any way, but to provide a different approach which can do a lot of the things ggplot2 can, and some that it can't. But ggplot2 will remain more powerful for a lot of tasks in many years to come I predict
What made it click for me was the following snippet from: https://ggsql.org/get_started/grammar.html
> We’ve tried to make the learning curve as easy as possible by keeping the grammar close to the SQL syntax that you’re already familiar with. You’ll start with a classic SELECT statement to get the data that you want. Then you’ll use VISUALIZE (or VISUALISE ) to switch from creating a table of data to creating a plot of that data. Then you’ll DRAW a layer that maps columns in your data to aesthetics (visual properties), like position, colour, and shape. Then you tweak the SCALEs, the mappings between the data and the visual properties, to make the plot easier to read. Then you FACET the plot to show how the relationships differ across subsets of the data. Finally you finish up by adding LABELs to explain your plot to others. This allows you to produce graphics using the same structured thinking that you already use to design a SQL query.