PARSER AND LEXER GENERATORS – Turning the Source File into an Abstract Syntax Tree
By Reginald Bellamy / May 24, 2021 / No Comments / Exams of IT, Handling the scope of names, IT Certifications, Technical requirements
Manually constructing a parser and a lexer can be a tedious task, especially if you try to invent a new programming language and change the grammar very often. Luckily, some tools automate this task.
The classic Linux tools are flex (https://github.com/westes/flex) and bison (https://www.gnu.org/software/bison/). flex generates a lexer from a set of regular expressions, while bison generates an LALR(1) parser from a grammar description. Both tools generate C/C+ source code and can be used together.
Another popular tool is AntLR (https://www.antlr.org/). AntLR can generate a lexer, a parser, and an AST from a grammar description. The generated parser belongs to the LL(*) class, which means it is a top-down parser that uses a variable number of lookaheads to solve conflicts. The tool is written in Java but can generate source code for many popular languages, including C/C++.
All these tools require some library support. If you are looking for a tool that generates a self-contained lexer and parser, then Coco/R (https://ssw.jku.at/Research/Projects/Coco/) may be the tool for you. Coco/R generates a lexer and a recursive-descent parser from an LL(1) grammar description, similar to the one used in this book. The generated files are based on a template file that you can change if needed. The tool is written in C but ports to C++, Java, and other languages.
There are many other tools available, and they vary a lot in terms of the features and output languages they support. Of course, when choosing a tool, there are also trade-offs to consider. An LALR(1) parser generator such as bison can consume a wide range of grammars, and free grammars you can find on the internet are often LALR(1) grammars.
As a downside, these generators generate a state machine that needs to be interpreted at runtime, which can be slower than a recursive descent parser. Error handling is also more complicated. bison has basic support for handling syntax errors, but the correct use requires a deep understanding of how the parser works. Compared to this, AntLR consumes a slightly smaller grammar class but automatically generates error handling, and can also generate an AST. So, rewriting grammar so that it can be used with AntLR may speed up development later.
Performing semantic analysis
The parser we constructed in the previous section only checks the syntax of the input. The next step is to add the ability to perform semantic analysis. In the calc example in the previous chapter, the parser constructed an AST. In a separate phase, the semantic analyzer worked on this tree. This approach can always be used. In this section, we will use a slightly different approach and intertwine the parser and the semantic analyzer more.
What does the semantic analyzer need to do? Let’s take a look:
- For each declaration, the names of variables, objects, and more must be checked to ensure they have not been declared elsewhere.
- For each occurrence of a name in an expression or statement, it must be checked that the name is declared and that the desired use fits the declaration.
- For each expression, the resulting type must be computed. It is also necessary to compute if the expression is constant and if so, which value it has.
- For assignment and parameter passing, we must check that the types are compatible. Further, we must check that the conditions in IF and WHILE statements are of the BOOLEAN type.
That’s already a lot to check for such a small subset of a programming language!