When starting up Gekko (nine years ago), I started up simply looking at input (user input, or command files) as text strings, trying to figure out the meaning character by character. Such an approach quickly gets extremely messy, because you need to try to match, for instance, parentheses, while at the same time matching quoted strings and the like (and quoted strings may contain parentheses, etc….). I won’t even begin to talk about parsing floating numbers using an approach like this.

Much to my relief, I discovered a tokenizer. A tokenizer splits a string into ‘words’, that is, chunks of characters. It uses some logic to do this, and a good tokenizer has good logic. Using a tokenizer, a command file can be returned as an array of strings, where each element is a string, and where these elements even have types. So the tokenizer may split “PRINT 1.2 * x1;” up into the following:

 PRINT   ident
 1.2     number
 *       operator
 x1      ident

An ‘ident’ is typically a chunk of letters and digits (including underscore), think of ‘ident’ it as a variable name. A tokenizer may even return the blanks, too, if these have some use. For instance, in some languages “f(x)” may mean something else than “f (x)”. The first string may be a function, whereas the last string may be a kind of list with a ‘f’ element, followed by a ‘(x)’ element. The tokenizer will also take care of quoted strings, and even out-commented parts of the input.

Using a tokenizer helped a lot, but the code was still messy, since parentheses had to be matched ‘manually’, etc. Therefore, I switched to ANTLR some six years ago. I had tried out some other parsers, of the so-called LALR kind, but ANTLR uses LL(*), that is, a recursive top-down approach. I liked that idea, and ANTLR seemed popular.

Since then, I’ve learned to get along with ANTLR, but I have to say that some of the most time-consuming problems I’ve encountered has been related to parsing problems, and ANTLR getting stuck on parsing something that I thought was simple enough. The problem is probably that I never got to learn ANTLR really thoroughly, from first principles. When it works, it is really good, and fast! But when it does not work, or does not parse some code as you thought it would, you feel like you want to pull out your hair. In that case, it simply feels like a black box that you do not understand.

One of the worst things has to do with blanks. Blanks appear all over the place, and usually they are just removed while parsing. But sometimes they matter. For instance, Gekko often uses free-standing brackets, for instance ‘[x*y]’ to denote a wild-card for looking up variables. But ‘a[x*y]’ is not a wild-card, but should be interpreted as an indexer, that  is, looking for some element of ‘a’. So ‘a[x*y]’ and ‘a [x*y]’ (with a blank) should be parsed differently, but ANTLR just removes the blank if there is one. ANTLR could be set up to return the blanks, too, but then you would have to deal with blanks in all places of the ANTLR grammar. I believe there are ways around this problem, but I never figured them out, so instead I insert some cheat characters (before parsing) to tell the parser if there is a blank or not. For instance, ‘a[x*y]’ may become ‘a¨[x*y]’ to tell ANTLR that the ‘a’ and the rest are glued together. Using these cheat characters is a bit messy, but it works.

Another thing to realize is that ANTLR uses a tokenizer (also called a lexer) as a first step. And the tokenizer has no sense of context, it just tokenizes without knowing much about anything. This can be awkward at times, and I have dreamt about so-called scannerless parsing.

All in all, using ANTLR is time-consuming when it does not work, or when it does not do what you expected (that is, interprets some code in the ‘wrong’ way). When a grammar rule has many possibilities, ANTLR may choose the ‘wrong’ possibility first, and it often gets stuck following that rule, and refuses to rewind and try some other possibilities. In principle, it should do this, but in practice I often find that it doesn’t. Maybe because I don’t understand ANTLR well enough.

On the other hand, organizing the parsing via a parser grammar (grammar rules) keeps a lot of things tidy and managable. So I am more or less satisfied with ANTLR, but sometimes I wonder if I should not just use a standard tokenizer, and on top of that write my own parser. This would make parser debugging so much easier!

I could say much more about ANTLR, and I’ve been using ANTLR 3. But the burning issue, and the reason I am writing this is that ANTLR 4 includes a big change that seems really mind-boggling.

ANTLR 3 advocated the use of a so-called AST (abstract syntax tree). This tree is outputted while ANTLR digests the code, and the tree is really helpful when converting the input code to runnable C# code. It was even advised to parse that AST tree using a tree parser, so that you in effect end up with two parsers. Fortunately I did not follow that advise, because the people behind ANTLR 3 decided to completely remove support for AST trees in ANTLR 4!! It seems AST trees are not so useful for simple translation tasks, but for Gekko that omission is disastrous. And there is no sensible work-around. In principle, one can make ANTLR 4 output something akin to AST trees by handcrafting the tree nodes, but it would be a lot of work. So I’ll stick with ANTLR 3, which feels kind of bad. Normally, I like to upgrade to the most recent software packages whenever I can, but ANTLR 4 just fundamentally breaks existing work.

What I don’t understand is why there is no public outcry regarding this? A lot of people have invested heavily in ANTLR 3 and AST tree generation (often including tree parsers, too), and if you are writing a compiler, AST trees are more or less a must. So why are people not complaining about the missing AST’s in ANTLR 4? To me it seems like kind of a joke that AST trees were touted as the best thing since the invention of the wheel or sliced bread, with a lot of chapters in the ANTLR 3 books dedicated to AST’s. But in ANTLR 4 (and its book), not a word on AST’s…

It is really a mystery. And not something that promotes loyalty to ANTLR.