ANTLR and grammar descriptions:

ANTLR reads a grammar description and builds a set of parsing functions for a top-down parser of the language described. The language must be LL(k) for ANTLR to generate a proper parser. These functions parse the input token stream in a "recursive-descent" manner, with each function analyzing one part of the language. They attempt to match expected tokens with tokens returned from the lexer, and call other parsing functions to handle subparts. All of these functions call the lexer to retrieve a new token whenever an expected token is matched.

The parsing process is started when the function which parses the starting rule is called via an ANTLR macro which takes the starting rule and the input stream as arguments. Any rule in the grammar can be used as the starting rule, and any readable character stream can be used as the input stream.

Grammar Descriptions

A grammar description consists of a set of rules which describe the non-terminal parts of a language. Each rule consists of a rule name, followed by a colon (:), and a rule description, which is followed by a semicolon (;). For example, the rule describing a paragraph in the written English language may look like this:

	paragraph: (sentence)+;

Here, "paragraph" is the name of the rule, and "(sentence)+" is the rule description which shows how a paragraph is formed from its constituent parts. The colon separates the name and the rule description, and the semicolon marks the end of the rule. Spaces, tabs, and newlines are ignored in rules, so we can write a long rule in a format that helps clarify its structure.

Rule Names

A rule name can be any C identifier which begins with a lower-case letter. When ANTLR generates a parser from this grammar description, each rule name will become the name of a function which will be called to parse the part of the language described by the rule. Each rule represents a non-terminal in the language (i.e. a part of the language made up of other parts). Thus, the rule names are the names of the non-terminals in the language.

Rule Descriptions

The rule description shows the possible ways in which a non-terminal can be formed from other parts of the language. It may contain terminal (a.k.a. token) names, non-terminal (rule) names, and meta-symbols which describe repetitions or alternatives in the structure of the non-terminal being described.

Terminals (Tokens)

The simplest kind of language part that the parser deals with (from its point of view) is a terminal (a.k.a. token). Parsing functions call the lexer to receive a new token whenever the last one has been matched. Within a grammar rule, a token can be referred to by a character string or by a token type name. Here is an example:

	one_sentence: "The" "dog" "barked" "at" "the" "cat" PERIOD;

In this example, the rule name is "one_sentence". "The", "dog", "barked", "at", "the", and "cat" are all explicitly defined tokens. PERIOD is the name of a token type defined earlier in a DLG lexclass. PERIOD is most likely, but not necessarily, defined as the period character (i.e. "\."). It could be defined as some type of "smiley" ( ":\-\)" ). We do not know from the grammar rule alone.

When the parsing function for "one_sentence" is running, the items in this series of tokens are processed one at a time from left to right. First, one_sentence() will compare the current token with the expected token "The". The lexer is always one step ahead of the parsing functions, so the current token is the one that was returned by the lexer after the last token match. If the current token matches "The", then the lexer is called to return a new current token, and the parsing function moves on to "dog". Otherwise, a syntax error is issued which shows the text of the current token and the text or name of the token that was expected by the parsing function. (How the parser continues after an error depends on the error actions defined and other things. To learn more about error handling, see Parr's book.) The parsing function continues this process until it reaches the semicolon, at which time it has recognized a "one_sentence", and returns to whatever function called it.

Typically, we must allow more flexibility when describing parts of a language than we have when using only literal character strings and single-string tokens. It would not be feasible to write a rule such as "one_sentence" above for every possible sentence in the English language. Instead we define token types to represent multiple strings. For example, we may choose to represent a set of nouns as a single token type. Suppose we had defined the token types NOUN and VERB to represent some of the nouns and verbs of the English language, respectively. Then, the following rule would include not only the sentence described by one_sentence above, but also many other sentences:

	sentences: "The" NOUN VERB "at" "the" NOUN PERIOD;

For example, "The car turned at the corner" matches the above rule if "car" and "corner" match the definition for the token NOUN, and "turned" matches the definition of the token VERB.

Non-terminals (Rule Names)

Non-terminals are used to avoid rewriting a particular description in multiple rules, or to separate a large rule into smaller pieces. They are used by placing their rule name in a rule description. Here's an example:

	paragraph: TAB sentence sentence sentence;

Here, we've defined a "paragraph" to be made up of one TAB followed by exactly three "sentence"s. This isn't a very good definition for a paragraph, but it will serve for now. When the function which parses the non-terminal "paragraph", called "paragraph()", is called to parse the input token stream, the calling function may have already compared the current token with the token type "TAB" and found that they match. Otherwise, depending on the structure of the calling rule, paragraph() may not even have been called.

paragraph() itself starts by "matching" the current token with "TAB" and calls the lexer to receive a new current token. paragraph() then calls sentence() to attempt to recognize a sentence on the input stream. If the new current token is not in the first set of the non-terminal "sentence", sentence() will issue a syntax error. When sentence() returns, paragraph() will call it again to attempt to recognize another sentence. When sentence() returns again, this process will be repeated one last time.

Meta-symbols

Meta-symbols are used to indicate different forms for a part of the language or multiple occurrences of a subpart. The following meta-character forms are available for use in a rule description:

|: A vertical bar (a.k.a. pipe) indicates the start of an alternative form for the rule.
( subrule ): Parentheses cast the enclosed part of a description into a subrule. They have the effect of limiting the scope of alternatives.
( subrule )*: An asterisk (a.k.a. star) following a parentheses pair indicates that the enclosed subrule should be recognized zero or more times. This is effectively a "while" loop wrapped around the subrule.
( subrule )+: A plus sign following a parentheses pair indicates that the enclosed subrule should be recognized one or more times. This is effectively a "do-while" loop wrapped around the subrule.
{ subrule }: Braces form the enclosed part of a description into a subrule, and indicate that it should be recognized exactly once, or not at all. They have the effect of limiting the scope of alternatives, and are effectively an "if" statement wrapped around the subrule.

Here are some examples to help clarify things (or confuse you more, whichever the case may be).

The following rule has two alternatives. In this example, a "pet" may be either a "DOG" or a "CAT": nothing else:
```
	pet: DOG | CAT;
  
```
We may decide to allow other "pet"s though:
```
	pet: CAT
	   | DOG
	   | HAMSTER
	   | MOUSE
	   | SNAKE
	   ;
  
```
This version of "pet" has five alternatives.
Suppose I want to let my two children (no I'm not admitting to anything here) each have one pet, but I want to make sure that they have compatible pets. Suppose that I have also learned (from cartoons, of course) that:
1. DOGs hate CATs
2. CATs hate MOUSEs (mice or meeses depending on the cartoon)
3. DOGs and MOUSEs (mice, meeses, whatever) can live together, and can carry on highly intelligent conversation (usually about torturing the cat).
From nature shows (on PBS, one of the best teaching tools ever devised), I've learned that:
1. SNAKEs often eat small animals such as MOUSEs and HAMSTERs.
2. Some SNAKEs are large enough to eat DOGs, CATs, children, and maybe even adults. We'll just assume that these aren't available where I and my children live.
A rule that would allow us to choose two compatible pets (to try to talk the children into) in alphabetical order might be:
```
	compets : CAT (HAMSTER | SNAKE)
		| DOG (HAMSTER | MOUSE | SNAKE)
		| HAMSTER MOUSE
		;
  
```
This says that we should either choose a CAT with either a HAMSTER or SNAKE, or we should choose a DOG with a HAMSTER, a MOUSE, or a SNAKE, or we should choose a HAMSTER with a MOUSE. Any other pair won't fit this rule, and would be incompatible or out of alphabetical order. Note that once we choose a CAT, we must choose either a HAMSTER or a SNAKE: the alternatives "DOG (HAMSTER | MOUSE | SNAKE)" and "HAMSTER MOUSE" are no longer an option.
Note that there are no non-terminals in this rule. The parsing function compets() only calls the lexer and tries to match the tokens it returns with the ones listed in the rule. compets will first compare the current token with CAT. If they match, it will choose the first alternative (i.e. "CAT (HAMSTER | SNAKE)", call the lexer for the next token, and compare it with HAMSTER. If this matches, it returns successfully; otherwise, it compares the new token with SNAKE. If this matches it returns successfully; otherwise it issues a syntax error and fails.
Suppose instead of matching CAT, the first "current" token matched DOG. Then compets() would have followed the alternative "DOG (HAMSTER | MOUSE | SNAKE)". If the first token had matched HAMSTER instead, compets() would have followed the third, and last, alternative.
If none of the first tokens from any of the alternatives had matched the first "current" token, then compets() would have failed. It is more likely that it would not even have been called by the calling function, because if other alternatives had been available, the caller would have checked the first sets before calling compets(), and would have avoided the call.
Let's look at a description for a book now:
```
	book: dedication {introduction} {forward} (chapter)+ (appendix)*;
  
```
Using this description, a book consists of a required "dedication", followed by an optional "introduction", followed by an optional "forward". The book may have neither, one, or both, of the introduction and forward. If it does have both, they must occur in the order given above. Following these optional parts, the book must have one or more "chapters". After all of the chapters, there are zero or more "appendices".
Note that there are no terminals in this rule. The parsing function book() only calls other parsing functions, and performs loops while the current token is in the first set of the enclosed non-terminals. The structure for book() would look something like this:
```
	book ()
	{
		dedication();
		if (nextoken==K_INTRO) introduction();
		if (nextoken==K_FORWARD) forward();
		do {
			chapter();
		} while ( nextoken==K_CHAPTER );
		while ( nextoken==K_APPEND) {
			appendix();
		}
	}
  
```

This assumes that an "introduction" starts with the token "K_INTRO", "forward" starts with the token "K_FORWARD", "chapter" starts with the token "K_CHAPTER", and "appendix" starts with the token "K_APPEND".

This page was last modified .