Build a Compiler in Five Projects

(kmicinski.com)

194 points | by azhenley 1 day ago

4 comments

fjfaase 16 hours ago
It surprises that they are still teaching parsing techniques that are based on limitation from 40 years ago, when memory was limited and you had to parse a file one character at the time. Why not start with a back-tracking recursive descent parser on a file stored in memory? Can be made efficient with some simple caching. In an introduction course there is no need to aim for maximum performance if parsing a 10k lines program takes less than a second.
[-]
- norir 10 hours ago
  Parsing is strange in that many people tend to believe it is a solved problem and yet every project handles it slightly differently (and almost none do it truly well).
  I have been studying compiler design for several years and I have found that writing a simple parser by hand is the best way to go most of the time. There is a process to it: You start with a "Hello, world!" program and you parse it character by character with no separate lexer. You ensure that at each step in your parser, you make an unambiguous decision at each character and never backtrack. The decision may be that you need to enter a disambiguation function that also only moves forward. If the grammar gets in the way of conserving this property, change the grammar not the parser design.
  If you follow that high level algorithm, you will end up with a parser with performance linear in the number of characters which is asymptotically as well as you can hope to do. It is both easy and simple to implement (provided you have solid programming fundamentals) and no caching is needed for efficiency.
  Deliberate backtracking in a compiler is an evil that should be avoided at all costs. It is potentially injecting exponentially poor performance into a developer's primary feedback loop which is a theft of their time for which they have little to no recourse.
  [-]
  - fjfaase 9 hours ago
    I agree, that if you want to write a production grade parser, this is probably the best way to go. I also agree that parsing is not a solved problem for all cases. But that is the case with many more problems. However, for many cases it is a solved problem and that often it is not the first thing you should focus on to optimize.
    If you teach a course about compiler construction, I think it might be better to teach your students how to write a grammar for some language and use some interactive parser that can parse some input according to the grammar (and visualize the AST). See for example: [1] and [2] (Even if you feed it the C grammar, it succeeds parsing thousands of lines (preprocessed) C code at every keystroke. This interpreting parser is written in JavaScript and uses a simple caching strategy for performance improvement.)
    For the scripting language [3] in some of the Bizzdesigns modeling tools, a similar interactive parser was used (implemented in C++). This scripting language is also internally used for implementing the various meta-models. These scripts are parsed once, cached, and interpreted often.
    I think it is also true for many domain-specific languages (DSL).
    [1] https://info.itemis.com/demo/agl/editor
    [2] https://fransfaase.github.io/MCH2022ParserWorkshop/IParseStu...
    [3] https://help.bizzdesign.com/articles/#!horizzon-help/the-scr...
- marcosdumay 9 hours ago
  Well, I can immediately think of two reasons:
  Backtracking parsers lead people into creating bad grammars. In principle people are perfectly capable of creating simple context-free grammars and write any parser they want to read it. But on practice your tools guide your decisions for a huge extent, and the less experience people have, the more true that becomes; so it's a really dangerous tool, in particular for students.
  Also, fully backtracking parsers have the most unpredictable and hard to fix error conditions for all possibilities. There exist middle grounds where the parser execution is still predictable but you do get most of the benefit from backtracking, but that's a lot of complex engineering decisions to reach and keep your project close that optimal.
  Immediate edit: On a CS context there is one reason that is probably more important than any other. People use parsers as an application of automata and regular languages theory. Those two concepts are way more important than the practical implications of parsing.
  [-]
  - fjfaase 8 hours ago
    What do you mean with bad grammars? Do you mean grammars that are hard to parse (require a lot of backtracking) or do you mean that it leads people to creating bad languages?
    My experience is that if a back-tracking parser list all the possible terminals it is expecting at the first location (with some additional information about the rules they occurred in) it fails to get passed, that this usually gives enough information to understand what is wrong about the input or the grammar.
    [-]
    - marcosdumay 8 hours ago
      I mean grammars that are hard for people to follow.
- kmicinski 9 hours ago
  > In an introduction course there is no need to aim for maximum performance if parsing a 10k lines program takes less than a second.
  I'll do you one better. The main compiler in my course uses only six characters to parse every single project: `(read)`.
- vintagedave 14 hours ago
  Are you referring to lookahead, as in allowing more ambiguous grammars?
  [-]
  - fjfaase 11 hours ago
    Sorry, I mend to say: Even if a grammar is not ambiguous, it can require unbound look-ahead to be parsed correctly [1].
    The grammar of C is ambiguous. The statement "a * b;" can be both parsed as a variable declaration of the variable 'b' of type pointer to 'a' and as an expression consisting of a multiplication of 'a' and 'b'. It all depends on whether 'a' is a type or not. In most cases it would be a type definition, because why multiply two variables and ignore the result. One trick to deal with this is to give precedence for the type declaration grammar rule over the expression grammar rule. However, this is not something that can be done with many parser generating tools.
    Yet the first C compiler where single pass compilers with a single look-ahead lexical token probably implemented as a recursive descent parser. It worked, because it kept a (reverse) list of all variable declarations, which allowed it to determine when 'a' was parsed if it was the start of some declaration or the start of a statement based on whether it was defined before as a type or not.
    [1] https://stackoverflow.com/questions/12971191/grammars-that-r...
  - fjfaase 14 hours ago
    No, even if a grammar is ambigious it can require unbound look-ahead to be parsed, although this is very rare the case for meaningfull grammars such as the ones you would write for a programming language.
    What I wanted to say that you do not need complex algorithms to implement parser if you do not have a grammar that can be parsed with look-ahead lexical element.
    [-]
    - fjfaase 11 hours ago
      I mend to say: No, even if a grammar is not ambiguous ...
UncleOxidant 23 hours ago
The Essentials of Compilation book mentioned is only ~$24 on Amazon. Usually books like this are much more expensive. I ordered a copy.
[-]
- declank 14 hours ago
  There is also the Python version too and available free. I do like the register allocation/graph colouring chapter.
- almostgotcaught 23 hours ago
  looks like a fun book but just be forewarned real compiler engineering is nothing like what's covered there.
  [-]
  - kryptiskt 17 hours ago
    I'm knee deep in clang at the moment and I'm so fed up with real compiler engineering. Give me Chez Scheme and the nanopass compiler any day. That is so much better than the big ball of mud that goes into a "real" compiler.
    [-]
    - almostgotcaught 11 hours ago
      ..... Yes that's exactly my point that these cutesy books give a completely inaccurate picture of what the job is really like.
  - pjmlp 11 hours ago
    Most people just want to compile their toy language into machine code and be done with it.
  - anta40 23 hours ago
    Any recommendation for a more realistic book?
    I think hacking GCC/LLVM can be pretty challenging, but hey they are real, production-grade compilers and not just typical academic projects.
    [-]
    - almostgotcaught 22 hours ago
      there are no good modern compiler books - everything that's been written down pales in comparison to what GCC/LLVM really involve. recently i found Engineering a Compiler by Cooper and Torczon when reviewing/prepping for interviews - it wasn't bad. also there's now LLVM Code Generation by Quentin Colombet but that's basically a code walk-through of LLVM (it doesn't cover any of the algos). and it was probably out of the date the second it got published lol (not really but maybe). the truth is that trying to learn how to build a compiler from a single book is like trying to learn how to build a skyscraper from a single book.
      [-]
      - sph 18 hours ago
        > the truth is that trying to learn how to build a compiler from a single book
        I think you conflate “learning to build a compiler for a toy language” with “being effective at working on a modern optimizing compiler suite like GCC/LLVM”
        The book is perfectly fine for the first use case, and never claims to touch upon the latter.
      - kmicinski 21 hours ago
        Respectfully, I think what you mean is that there are no books which give you the experience of hacking on LLVM for several years.
      - mamcx 11 hours ago
        A good counterpoint is that a lot of information about this is dense, cryptic, weird, confusing and hard to get.
        The major problem is not to find the sophisticated things, but understand how do it in simple-ish ways.
        Do otherwise is a major waste of time!
        P.D: And yes, only when you get the basic and learn the jargon still is a problem to find the neat tricks, but is likely that you already get that there is nothing like read the source... (sadly that source is in C or worse C++, but lately with Rust that is gaining traction at least it make more sense!)
      - yu3zhou4 21 hours ago
        Is Dragon book still relevant? Do you recommend any other learning resources other than reading the source and contributing to llvm?
        [-]
        f1shy 17 hours ago
        IMHO absolutely. The basics of lexer and parser are still there. Some of the optimizations are also relevant. You just cannot expect to read the book and be able to write GCC or LLVM from scratch(1).
        For learning deeper about other advanced topics there is:
        https://www.cs.cornell.edu/courses/cs6120/2025fa/
        and
        https://mcyoung.xyz/2025/10/21/ssa-1/
        So maybe writing a compiler with exactly one FE (for a simple language) and one BE (for a simple architecture), with say 80% of the optimizations could be a doable project.
        (1) We should define what we mean by that, because there are thousands of front-ends and back-ends.
        anta40 21 hours ago
        I heard that new volume is updated with newer stuffs like data flow analysis, garbage collection, etc. Anyway the book doesn't teach you how to build a basic working compiler, so need to consult another materials.
        Try Andrew Appel's "Modern Compiler implementation in Java/C/ML" or Writing a C Compiler (https://norasandler.com/book) which is much more recent.
        Eventually, you'd want to hack GCC/LLVM because they are production-grade compilers.
        kronnpp 17 hours ago
        I taught in the past and still like the trilogy of books
        > Modern Compiler Implementation by Andrew W. Appel
        It comes in three flavors C, ML (Meta Language), and Java
        https://www.cs.princeton.edu/~appel/modern/
        Writing a compiler in Standard ML is as natural as writing a grammar and denotational semantics.
        Compiler writing is becoming an extinct art.
        [-]
        kaz-inc 50 minutes ago
        I'll take the bait.
        Do you distinguish between writing a compiler and writing an optimizing compiler, and if so, how is writing an optimizing compiler an extinct art?
        Equality saturation, domination graphs, chordal register allocation, hardware-software codesign, etc there are many new avenues of research for compilers, and these are just the ones on the top of my head that are relevant to my work. Most optimization work is R&D and much of it is left unimplemented at scale, and things like the phase-ordering problem and IR validation are hard to do in practice, even given ample resources and time.
        pjmlp 11 hours ago
        The ML version is my favourite and I can vouch for the books being quite interesting.
        For more modern folks, one can use the ML version as inspiration for doing the book exercises in OCaml, Haskell, F# or Rust.
        Writing compilers for a living, and CS research is a niche domain, not something I would consider an extinct art.
        yu3zhou4 15 hours ago
        Thanks!
        Are you sure it’s an extinct art though? LLVM is flourishing, many interesting IRs come to life like MLIR, many ML-adjacent projects build their own compilers (PyTorch, Mojo, tinygrad), many big tech like Intel, AMD, Nvidia, Apple and others contribute to multiple different compilers, projects integrate one to another at different levels of abstraction (PyTorch -> Triton -> CUDA) - there is a lot of compilation going on from one language to another
        Not to mention many languages in a mainstream that weren’t that popular 10 years ago - think Rust, Zig, Go
        [-]
        norir 10 hours ago
        The prominence of LLVM is a symptom of the dying of compiler writing as an art, not evidence of its vitality.
        [-]
        almostgotcaught 8 hours ago
        > compiler writing as an art
        cooking is an art. software is engineering. no one would say "building skyscrapers as an art is dying".
        pjmlp 11 hours ago
        You should look into GraalVM as well, as it is another approach for compiler development.
        alabhyajindal 13 hours ago
        I heard that the ML version was a translation of the C version, and is thus not easy to follow along. Or it may have been the other way around!
        [-]
        pjmlp 11 hours ago
        The other way around, the best book is the ML version, the other two try to do ML in the respective language.
        Ironically, now with modern Java you can that much easier than the approach done in the Java variant back in 1997.
        bmn__ 12 hours ago
        > Is Dragon book still relevant?
        No, not at all, the teachings and techniques have been surpassed since four decades or so.
        The algorithm LALR is flawed, it only works for a subset of CFG instead of all. That alone is already a death blow. If you want to try out BNF grammars in the wild, it is nearly guaranteed that they are complex enough for LALR to shit itself with S-R conflicts.
        The technique of generating and dumping source code is awkward and the reasons that made that a necessity back then are no longer relevant. A good parser is simply a function call from a code library.
        The technique of tokenising, then parsing in a second pass is awkward, introduces errors and again the reasons that made that a necessity back then are no longer relevant. A good parser works "on-line" (term of art, not meaning "over a computer network" here) by tokenising and parsing at the same time/single-pass.
        The book precedes Unicode by a long time and you will not learn how to properly deal with text according to the rules laid out in its various relevant reports and annexes.
        The book does not take into consideration the syntactic and semantic niceties and features that regex have gained since and thus should definitely also be part of a grammar parser.
        > recommend any other learning resources
        Depends on what your goals are. For a broad and shallow theoretical introduction and to see what's out there, browse the slide decks of university lectures for this topic on the Web.
        [-]
        yu3zhou4 11 hours ago
        Thanks. I'd like to learn the most important things in traditional compilers that are still relevant and translate these skills into ML compilers
  - fragmede 22 hours ago
    Real compiler engineering covers a lot of ground. This book is an intro to it, not the whole everything. No need to posture about it.
AdityaSanthosh 1 day ago
Hi, seems like an interesting course. I haven't studied compilers in my undergrad( I'm an electronics student) but I have been working as a programmer who studied c and bit of low level languages. Is there any prerequisite compiler knowledge required for this course?
[-]
- ktimespi 1 day ago
  The only prerequisite here is probably Racket, to follow along with the book
evnix 17 hours ago
Most of these compiler projects and books would be 100x more popular and accessible if they were in Javascript
[-]
- wiseowise 17 hours ago
  Just use LLM to translate example to whatever language you want.
  [-]
  - derrida 15 hours ago
    Now now, isn't that what a compiler does?