Every single time I try to do an online Compilers class, my brain gets stuck at ...

jakear · on July 12, 2019

Have you ever looked into Pratt parsing? That was the “aha” moment for me, as everything prior to learning those was so incredibly opaque (especially the yacc/lex awkward DSLs and precedence stuff).

Good intro: http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-e...

Project I made using a Pratt parser, effectively 0 dependencies (for the interesting bits at least): https://github.com/JacksonKearl/RollDice-Discord

Check the “parse.ts” for how simple the Pratt implementation is, “calculator-bot.ts” for how easy it is to define a simple language with infix operators and precedence and etc, and “dice-bot.ts” and “dice-expression.ts” for a complete implementation of an actually “interesting” language. The “tokenize.ts” file has the tokenizer, but it’s incredibly sketchy. Enter at your own risk. ;)

Edit: the Readme is a bit outdated. Proper order of operations is implemented

aidenn0 · on July 12, 2019

Assuming you are talking about using a grammar to parser tool like yacc, you probably should at some point figure out what your hangup is there, but shouldn't let that stop you from writing a compiler. I very rarely use such a tool for writing a compiler, and find that real-world compilers similarly rarely use a tool.

Regardless of whether you are writing a grammar or a recursive descent parser, start with a really simple language say:

    Expr = "A" | "(" Expr* ")"

Which is balanced parentheses with the token that is the exact character "A" possibly appearing. All of these are valid matches:

    (())
    ((A(A)A)A)
    (((((AAA)))))

The parse tree should be such that each node is a list where each element is either an A token or another parse tree node. Once you can parse that, then you can start moving on to more complicated grammars; a possible complication is to add "B" as a valid expression, with the caveat that B cannot be the sibling of a parenthesized expression. That is:

    (ABA)

is valid, but

    (B(A))

is not valid.

I used single characters as the primitive here, so perhaps the next step would be to add a tokenizer. "B" has special rules, so it will be the keyword "bananas". A will be any other alphabetic token.

If all of the above is already doable, then perhaps I've misunderstood where your hangup is. Let me know, perhaps I can help.

panda_4 · on July 12, 2019

Should that grammar be:

    Expr = "A" | "(" Expr* ")" | ""

so that (()) is s a valid match? Because then Expr can be an empty string also?

aidenn0 · on July 12, 2019

Typically Star (*) means zero or more while Plus (+) means one or more.

Technetium_Hat · on July 12, 2019

`Expr*` means zero or more, so the empty string is implicit.

codr7 · on July 12, 2019

I assume you're aiming to make your own compiler/interpreter.

Consider postponing the deep-dive into theory until you've tried solving the problem using what you already know. Once you know your way around a problem, digesting theory becomes much easier.

On one level, there's nothing magical about compilers and interpreters. It's the same old regular code [0] solving the same kinds of problems.

What makes it tricky is working on a meta-level in parallel, visualizing what's happening in two dimensions. Like writing Lisp macros if that makes any sense.

I think it's a shame that so many get stuck on parsing and type theory. The top priority should be to get a feedback loop up and running.

[0] https://github.com/codr7/cidk/blob/master/src/cidk/read.cpp

socksy · on July 12, 2019

If it's literally going from rules to writing a grammar that is difficult to grok, perhaps this website might help for playing around (a bit like those regex websites):

http://instaparse-live.matt.is/

With the syntax from Instaparse:

https://github.com/Engelberg/instaparse/blob/master/README.m...

derefr · on July 12, 2019

Programming-language syntax specifically is pretty daunting to formalize. Programming languages (other than Lisp) have huge numbers of syntactic “features”! And they usually contain binary expressions with precedences, which—if your grammar notation doesn’t have a specific DSL for these—introduces the special hell of coercing a parser-generator into implementing precedence-climbing. This makes for hella opaque grammars that you really can’t understand until you’ve had to implement precedence-climbing yourself at least once.

PL grammars also tend to have a lot of degrees of freedom in the “production” side of them; because it’s basically up to you (as the author of the compiler) what the resulting AST output by the generated parser will “need” to look like, as you also control the next stage that consumes the AST, and there’s nothing forcing that API between the generated parser and your “code DOM builder” to take any particular shape. This is bad, for https://en.m.wikipedia.org/wiki/Analysis_paralysis reasons. Things are a lot easier when you lock down an AST “shape” (i.e. a known data structure) that your generated parser should aim to output. (Once you have this, you can write a conformance test suite as a set of {text input, data-structure output} pairs!)

So, my suggestion would be to 1. try writing a grammar for something simpler than a complete language syntax, 2. where it’s obvious what the resulting AST “should” look like.

Personally, I threw myself into the deep end, and my first experience with grammars was in writing a PR for a library that parsed just function-signatures of a particular language, previously with a hand-rolled parser. I wrote a grammar for a parser-generator to replace this hand-rolled parser. This was pretty good as an exercise, as I just had to match the output of the existing parser.

But even then, it was a bit of a struggle trying to understand what this thing was I was parsing at the same time that I was trying to write the grammar for it.

I’d instead suggest, a learning exercise, taking a text format you know well (for example, URLs, or email addresses) which is specified rigidly by an RFC (including a BNF grammar!) and attempting to recreate the BNF grammar (or an equivalent for your parser-generator of choice) by reading the RFC but avoiding the BNF part until you’re ready to “check your work.”

kazinator · on July 12, 2019

And that's not even compiling yet! Consider that Kuper's 43 pass Scheme compiler would start the very first pass with the Scheme program already being an nested-list-based abstract syntax tree: no grammar to deal with. Maybe, work in some Lisp so you can learn about compiling without the stumbling block of formal languages and their grammars.

mikorym · on July 12, 2019

We wrote a toy Oberon compiler in second year computer science. I think it was really useful as an exercise, but it was time consuming and this contributed to me dropping computer science due to the work load and practicals counting more than 50% and tests less than 50%.

For what it is worth, I think you can emulate this if you really want to (and have enough time); it may make you feel more confident in the topic. But to be clear, we did not do anything fancy. The project was done in C and wasn't done "cleverly". What I mean by that is that it (at least our compiler) was very straightforward. Read in sentences based on seperators; remove whitespace; identify keywords; set/identify variables; and a few steps later execute the commands (in C I think, we only did assembly after this).

marcosdumay · on July 12, 2019

Well, you can sidestep the problem at first by either writing your compiler in Prolog or Haskell, or by writing a Lisp compiler.

Or, alternatively you can dive into it and get a book or a course focused on parsers. It's not something that you can easily learn by just thinking about it, so being stuck on it is completely natural.

zerr · on July 12, 2019

You can start with AST and go to the back-end. Yes, parsing/grammars tend to be boring.