Tuesday, May 13, 2014

A simple LL1 expression parser written in C#

If you just want the code and want to skip the details go to the bottom of the page.

The Problem

Recently I had this need to translate some simple SQL Server formulas to another language. The last time I had built a parser was more than 20 years ago in the pre-internet era. From what I can recall, that time I used C++ and I probably relied heavily on macro expansion to simplify the syntaxis. This time I needed to use C# since I was going to integrate the Parser as a function in SQL Server and also because C# is my favorite language to develop nowadays, but it doesn't have macro expansion which was going to make writing the grammar definition a little tortuous.

The LL1 grammar I needed to implement looks a little bit like this:

So I needed to write this grammar in C#. The logic can be written using basic functions you find in every language like ifs and the operators && and ||. But I needed to write it in a way that still resembled the grammar above for easier debugging and further enhancement. Therein laid the problem, how to write elegant C# code that behaves and looks like a grammar and do it in just a few hours.

There are tools out there that can take a grammar like that and generate the code for you. I gave a couple a quick try but realized the learning curve would take me longer than I could afford for this project so I looked for a simpler alternative using just C#.

The Solution

My solution is based on delegates. Delegates are values that represent functions which then you can call. The neat thing is that they can be stored in variables and passed as parameters. This is the declaration for my delegates:

delegate bool rule();
delegate bool ruleL(int L);

They both return a boolean and one takes an integer. So I declare a variable for each one of my rules:

ruleL expr, expr2;
rule wexpr, mexpr, whexp, whexp2, id2, list, list2;

The second element of the solution is using Lambda Expressions which is a very compact way of defining functions with a truly minimal syntax. There is no type declaration for the function or the parameters. no brakets, not even a return statement. The compiler inferrs all that from context. The syntax goes like this:

expr2   = (L) => L < 4;

Now expr2 is a function that returns true when the parameter is less than 4. With Lambda Expressions I can define the rule expr2 something like this (compare with grammar above):

expr2  = (L) => ( L < 4 && Token == "+"  && expr(4) && expr2(L)    )
       || ( L < 4 && Token == "-"   && expr(4) && expr2(L)    )
       || ( L < 5 && Token == "*" ...                                  
                                                                        ...        );
Looks promising, this syntax is functional in C# and is close enough to the grammar definition. But, it is missing the part that detects and handles errors in the input. For that I created the P function that wraps around each rule line. P stands for parser but I use the short name to keep the syntax as uncluttered as possible. Here is how it is used:

expr2  = (L) => P(() =>  L < 4 && Token == "+"  && expr(4) && expr2(L)    )
       || P(() =>  L < 4 && Token == "-"   && expr(4) && expr2(L)    )
       || P(() =>  L < 5 && Token == "*" ...                                     
                                                                        ...        );

A little bit more cluttered but not too bad. 
Here is how my grammar looks in C# now:

Notice how the P function is passed a lambda expression as a parameter. 
Here is my P function:

It is pretty simple really. The variable Position holds the current position in the input string. If the rule R fails and the position changed it means some Tokens were consumed which means the parser failed and should stop.

The Lexer

A simple parser deserves a simple lexer. The lexer is the part of the code in charge of reading the string and extracting the tokens. Writing one from scratch is tedious and laborious. The code needs to skip spaces and comments, recognize identifiers, numbers, strings and symbols. Fortunately .NET has a powerful Regex class that makes it much easier.

In the C# grammar above the lexer are the calls to the functions C() for matching symbols and W() for matching whole words. But the core of the lexer is the R function for matching any Regex, here is the code:

Again very simple If the pattern matches then the token is saved in the variable Token and the Position is advanced.  Functions C and W just call R.

The only part missing is the skipping of spaces and comments, again using Regex:


That's it, that's all. All put together in one class called Parser:

Notice the definitions for ID, NUMBER, EOS & NIL.

To test it create a Form with 2 textboxes and a button and call it like this:

Next post how to add semantics...

No comments: