Top
Best
New

Posted by robbie-c 1 day ago

I rewrote PostHog's SQL parser, 70x faster, while barely looking at the code(posthog.com)
141 points | 69 commentspage 2
lovasoa 1 day ago|
The thing I would have liked to know is why they don't use an existing fast SQL parser. Was being slightly incompatible with all existing SQL dialects a product requirement?
robbie-c 1 day ago||
Our SQL is very similar to ClickHouse SQL, in that we used ClickHouse SQL as a starting point as that's what our underlying DB is. We needed to have our own parser so that we could add additional language features on top.
bonzini 1 day ago||
I think you should clarify that (or whether) while you didn't look at the generated code, you are actually going to adjust it in the future.

How did the two approaches compare in terms of code readability?

robbie-c 13 hours ago||
The previous parser is mostly a declarative grammar file, which is extremely readable. It codegens a C++ parser, which is hard to read. It depends which of those you count as the previous parser's source code!

In the future, we'd make changes by modifying the ANTLR parser first, then using the same approach as in the blog post to get the new parser to parity. We have no plans to get rid of the C++ parser as an oracle!

bonzini 11 hours ago||
Sorry, by two approaches I meant the two parsers generated by the LLM, recursive descent and graph based; not the ANTLR one.

Sort of unexpected that you're keeping the old one as an oracle—but a very good idea for anyone that writes such a complicated parser, indeed!

nijave 23 hours ago|||
Yeah curious why they didn't use Presto/Trino, DuckDB, or Clickhouse SQL directly with UDFs and views to augment

Zuora exposes a Trino-based data warehouse which is quite nice and powerful

Besides the parser side, existing dev tools and docs automatically work, too

__s 1 day ago|||
This is pretty much the case with every SQL dialect
-warren 1 day ago||
I think thats exactly what indirectly happened. This guy didnt optimize the parser. Someone else did -- years ago. That work was pulled into the LLM and made it look like magic.
bonzini 1 day ago||
Note that it's not a particularly optimized algorithm: recursive descent + specialized subparser for expressions is simply the standard way to write parsers by hand. It's ANTLR which is super flexible but also dog slow.
robbie-c 1 day ago||
Yeah, one of the interesting parts to me while working on this is that the breakpoint for when it's worth writing your own parser vs accepting ANTLR's slowness has shifted massively. Previously it would have been someone's full-time job to maintain. Now with this approach you can get the best of both worlds.
boiler_up800 1 day ago||
Very good and interesting article, particularly the “loop” that he ended up with.

Amusing anecdotes on LLMs to:

> It did, in fact, make a lot of mistakes, kept doubting whether such a rewrite was even possible, and wanted to call it a day after each round of coding.

> Hilariously one of the most effective was to tell Claude to “think really hard about edge cases" in a background agent.

zingar 1 day ago||
This must the most compelling look I’ve seen at how software might work with LLMs doing a ton of heavy lifting.

There’s something kind of amazing here in that having read about property based testing I’m pretty confident I could apply it if I had a good use case.

sam_lowry_ 1 day ago||
Dunno about the parser, but you broke scrolling on your fancy website without noticing it also ;-)
spinachsalad 10 hours ago||
> there’s a test for SELECT SELECT FROM FROM WHERE WHERE AND AND which is completely valid SQL

Is this even true? I tried it in SQLite and there's a syntax error after first SELECT. It would work when "SELECT", "FROM" etc. are quoted, but that's not the same thing.

charlieflowers 8 hours ago|
I interpreted it as referring to a table named "FROM" and fields named "SELECT", "WHERE" and "AND".
duke_of_vandals 1 day ago||
How long did this take?
robbie-c 1 day ago||
It took about 2 days to get a proof of concept, and about a week to get something I could ship to production.

I skipped a few features for the PoC (like XML tag support, token positions), so most of the delta was adding those back in!

duke_of_vandals 1 day ago||
If you didn’t need to look at the code at all, why not write it in asm instead of Rust, and make it even faster?
robbie-c 1 day ago|||
Ha I did consider that! But 70x is plenty fast enough (we still have to query an actual database!) and the parser runs in a shared process on untrusted input, so it wasn't worth the security risk
ncruces 1 day ago||||
This is a good point.

If we don't care about inspecting the output, why don't we just replace the Rust compiler with something that generates assembly from Rust source, and then cross check the output with a few million random Rust programs?

orsorna 1 day ago||
About 1/1000 of the duration of their interview process where they gloat about wasting your time.
eudamoniac 9 hours ago||
Surprised anyone gets an interview there. My resume matched the job requirements pretty much exactly with 11yoe and I got an auto rejection mail at 1am.
ncruces 1 day ago||
You have a grammar file in a formal language, and want to generate a faster parser in another formal language.

What's wrong with the source language that it's better to use a sufficiently smart random code generator for the target language, and then fuzz the hell out of the output of it until it behaves the same as the slow translated code, than to create a sufficiently smart compiler from the source to target languages?

I mean this sounds like if we replaced GCC with a really smart random assembly generator and a fuzzer for the output.

sayrer 1 day ago||
ha, try to keep going. Run it under samply and Gungraun (need AMD64 for this)
westurner 1 day ago||
Could the agent traces from this be used to improve sqlglot?

tobymao/sqlglot: Python SQL Parser and Transpiler; with tests and support for 30+ dialects: https://github.com/tobymao/sqlglot

Ibis depends upon sqlglot: https://github.com/tobymao/sqlglot/network/dependents

akitowerns 18 hours ago|
[dead]
More comments...