Posted by ibobev 2 days ago
The lexer/parser is never the bottleneck. In fact, you can write those two by hand over a single weekend for a largish language. With LLMs, it takes 15 minutes if you have an unambiguous spec.
The biggest time sink, and the reason you will fail for sure, is the inability to restrict the scope of the project. You start with a limited feature set and produce the entire compiler/vm toolchain. Then you get greedy and fiddle with the type system, adding features that you have never used and probably never will. And now you have to change every single phase from start to end.
I mostly give up at this stage.
He went through straight recursive descendant parser and said same thing.
I think compiler courses teach from yacc, bison etc that's where this whole thing came from but in practice people discovered that hand written recursive descendant parsers are all you need.
Very true. I have a shelf full of books on compiler development and optimization. I have read them selectively, a chapter here, a chapter there. But that shelf is useless for a vast majority of people.
You might find it useful if you are developing a production-level compiler/vm (I cannot make this statement with a straight face while Python rules the world). But a simple and sensible architecture that uses recursive-descent parsing takes you a long way.
Most hobbyist compilers (and even some production ones) are written as a heavy front-end compiling down to C or LLVM. Very few people actually write their own backend.
Not any of the ones I have worked on, nor the ones I know about: they all use hand-written parsers. In practice, error reporting and recovery tends to be tedious and/or difficult with a generated parser, which is a serious issue for practical tools.
Parsing has turned out to be simpler, in practice, than the computing pioneers expected it to be, because simpler grammars are easier for both machines and humans to reason about. Instead of using sophisticated parser generators, we just design dumb grammars: that works out better all around.
I used to give up somewhere around the type system, too, but this time I’m approaching something vaguely useful. It even has a basic LSP.
It’s been both enjoyable and enlightening, and LLMs turn out to be an excellent pair designer as (in addition to implementation) they’re really good at summarising the impact of various decisions.
> the reason you will fail for sure, is the inability to restrict the scope of the project
This will be the reason, for sure. But then the scope of every project like this tends towards building an OS with it then replacing every piece of software, including all embedded devices :)
I cannot do slow. It is either burn the candle at both ends, or do nothing at all.
I am using LLMs this time as well, but I spent close to 400 hours over a period of 6-7 weeks on my project before I put it to the side temporarily (got bored once the thinking part was done). About 300 of those were spent on iterating over the language and VM specs and eliminating all ambiguities and needless features. The remaining 100 were used to produce the code --- the VM, the assembler and the compiler --- and to repeated rewrite it to conform to my way of doing things.
LLMs have let me become extremely choosy about which code I am willing to keep.
I've managed to get OpenCode setup such that I can have a productive discussion about the design or an issue / change then leave the LLM iterating for long periods while I do other work. It's instructed to maintain test coverage and treat quality very seriously - as a result there are over 5000 tests (some I suspect are useless...) and it's pretty rare to get a regression.
I'm pretty sure there are plenty of significant bugs and gaps, but also that once found it seems like all of them will be fixed pretty quickly by the LLM.
I just have to avoid looking at the code...
Even stuff like Crenshaw's Let's Build a Compiler was more useful to me than all these books that do lexical analysis using regular expressions. I have written lexers and parsers hundreds of times for all kinds of DSLs and config languages and not once have I used regular expressions to scan the text.
I am not quite sure what you mean by having a recursive descent parser, because you can write one manually, or you can generate one from a grammar, which would have the additional benefits of having a grammar. I recommend having a grammar.
With regard to portability: I've found cross-language parser generators especially unpleasant to work with. Instead, I just implement the parser in a language that runs on all platforms I care about.
Profiling results show that the amount of time spent lexing/parsing is negligible - less than 1% of the total compilation time.
The easiest syntax to copy if you’re looking for a high level language is Smalltalk.
But most of the time, I wouldn’t even use that. Simple imperative languages that look like BASIC works pretty well in most domains. If you simplify the syntax a little, it’s very easy to understand the compiler and use it for say when you want users to input code into existing systems.
Syntax is a minor issue but something that people are very opinionated about. You could technically build multiple front ends that share the typechecking, CFG validation, optimization, register allocation and byte code emission phases. But it is too much work for what is presently a personal project.
The language, Sapphire, is Ruby inspired, so the most interesting part is digging into the internals of the latter when I'm trying to figure out how something should work.
I wrote a compiler toolchain and debugger that takes a Turing machine description plus input string and emits an encoded tape runnable by a Universal Turing Machine [0]. I had some prior PL experience, but never did an end-to-end compiler pipeline, at least not this low level.
It started as a joke/experiment, but I couldn't believe how fast it pulled me into designing:
- a small low-level ASM for building the UTM
- an ABI for symbol widths and encoding grammar
- an interpreter used as the behavioral oracle
- raw TM transitions for each ASM instruction, generated by having an LLM iterate on candidate emissions and checked against the interpreter oracle
- a CFG-style IR to fix the LLM mess once direct ASM -> TM emission became too hard to keep sane (LLM did a decent job actually, I don't think I would have done a much better job without the IR either)
- a gdb-style debugger for raw transitions, ASM routines, and blocks
- a trace visualizer
- a bootstrapping experiment where an L1 UTM/input pair was itself run through an L2 UTM
- optimisation experiments
And every step came quite naturally and was easy to tie in with everything else. Each one was just the next local repair needed to make the previous layer tractable.
[0] Repo: https://github.com/ouatu-ro/mtm
The team implementing the survey system wound up using the same language to implement the runtime portion, something I never expected or designed in.
I don't recall anything about what it looked like now. I do remember it was a lot of fun to write.
The tail ends of a language implementation (parsing and code generation) are a fixed cost; the "middle end" can grow unbounded as more production-quality items are added.
My language: https://www.empirical-soft.com