ANTLR and grammar descriptions:

ANTLR reads a grammar description and builds a set of parsing functions for a top-down parser of the language described. The language must be LL(k) for ANTLR to generate a proper parser. These functions parse the input token stream in a "recursive-descent" manner, with each function analyzing one part of the language. They attempt to match expected tokens with tokens returned from the lexer, and call other parsing functions to handle subparts. All of these functions call the lexer to retrieve a new token whenever an expected token is matched.

The parsing process is started when the function which parses the starting rule is called via an ANTLR macro which takes the starting rule and the input stream as arguments. Any rule in the grammar can be used as the starting rule, and any readable character stream can be used as the input stream.

Grammar Descriptions

A grammar description consists of a set of rules which describe the non-terminal parts of a language. Each rule consists of a rule name, followed by a colon (:), and a rule description, which is followed by a semicolon (;). For example, the rule describing a paragraph in the written English language may look like this:
	paragraph: (sentence)+;
Here, "paragraph" is the name of the rule, and "(sentence)+" is the rule description which shows how a paragraph is formed from its constituent parts. The colon separates the name and the rule description, and the semicolon marks the end of the rule. Spaces, tabs, and newlines are ignored in rules, so we can write a long rule in a format that helps clarify its structure.

Rule Names

A rule name can be any C identifier which begins with a lower-case letter. When ANTLR generates a parser from this grammar description, each rule name will become the name of a function which will be called to parse the part of the language described by the rule. Each rule represents a non-terminal in the language (i.e. a part of the language made up of other parts). Thus, the rule names are the names of the non-terminals in the language.

Rule Descriptions

The rule description shows the possible ways in which a non-terminal can be formed from other parts of the language. It may contain terminal (a.k.a. token) names, non-terminal (rule) names, and meta-symbols which describe repetitions or alternatives in the structure of the non-terminal being described.

Terminals (Tokens)

The simplest kind of language part that the parser deals with (from its point of view) is a terminal (a.k.a. token). Parsing functions call the lexer to receive a new token whenever the last one has been matched. Within a grammar rule, a token can be referred to by a character string or by a token type name. Here is an example:
	one_sentence: "The" "dog" "barked" "at" "the" "cat" PERIOD;
In this example, the rule name is "one_sentence". "The", "dog", "barked", "at", "the", and "cat" are all explicitly defined tokens. PERIOD is the name of a token type defined earlier in a DLG lexclass. PERIOD is most likely, but not necessarily, defined as the period character (i.e. "\."). It could be defined as some type of "smiley" ( ":\-\)" ). We do not know from the grammar rule alone.

When the parsing function for "one_sentence" is running, the items in this series of tokens are processed one at a time from left to right. First, one_sentence() will compare the current token with the expected token "The". The lexer is always one step ahead of the parsing functions, so the current token is the one that was returned by the lexer after the last token match. If the current token matches "The", then the lexer is called to return a new current token, and the parsing function moves on to "dog". Otherwise, a syntax error is issued which shows the text of the current token and the text or name of the token that was expected by the parsing function. (How the parser continues after an error depends on the error actions defined and other things. To learn more about error handling, see Parr's book.) The parsing function continues this process until it reaches the semicolon, at which time it has recognized a "one_sentence", and returns to whatever function called it.

Typically, we must allow more flexibility when describing parts of a language than we have when using only literal character strings and single-string tokens. It would not be feasible to write a rule such as "one_sentence" above for every possible sentence in the English language. Instead we define token types to represent multiple strings. For example, we may choose to represent a set of nouns as a single token type. Suppose we had defined the token types NOUN and VERB to represent some of the nouns and verbs of the English language, respectively. Then, the following rule would include not only the sentence described by one_sentence above, but also many other sentences:

	sentences: "The" NOUN VERB "at" "the" NOUN PERIOD;
For example, "The car turned at the corner" matches the above rule if "car" and "corner" match the definition for the token NOUN, and "turned" matches the definition of the token VERB.

Non-terminals (Rule Names)

Non-terminals are used to avoid rewriting a particular description in multiple rules, or to separate a large rule into smaller pieces. They are used by placing their rule name in a rule description. Here's an example:
	paragraph: TAB sentence sentence sentence;
Here, we've defined a "paragraph" to be made up of one TAB followed by exactly three "sentence"s. This isn't a very good definition for a paragraph, but it will serve for now. When the function which parses the non-terminal "paragraph", called "paragraph()", is called to parse the input token stream, the calling function may have already compared the current token with the token type "TAB" and found that they match. Otherwise, depending on the structure of the calling rule, paragraph() may not even have been called.

paragraph() itself starts by "matching" the current token with "TAB" and calls the lexer to receive a new current token. paragraph() then calls sentence() to attempt to recognize a sentence on the input stream. If the new current token is not in the first set of the non-terminal "sentence", sentence() will issue a syntax error. When sentence() returns, paragraph() will call it again to attempt to recognize another sentence. When sentence() returns again, this process will be repeated one last time.

Meta-symbols

Meta-symbols are used to indicate different forms for a part of the language or multiple occurrences of a subpart. The following meta-character forms are available for use in a rule description:
|
A vertical bar (a.k.a. pipe) indicates the start of an alternative form for the rule.
( subrule )
Parentheses cast the enclosed part of a description into a subrule. They have the effect of limiting the scope of alternatives.
( subrule )*
An asterisk (a.k.a. star) following a parentheses pair indicates that the enclosed subrule should be recognized zero or more times. This is effectively a "while" loop wrapped around the subrule.
( subrule )+
A plus sign following a parentheses pair indicates that the enclosed subrule should be recognized one or more times. This is effectively a "do-while" loop wrapped around the subrule.
{ subrule }
Braces form the enclosed part of a description into a subrule, and indicate that it should be recognized exactly once, or not at all. They have the effect of limiting the scope of alternatives, and are effectively an "if" statement wrapped around the subrule.

Here are some examples to help clarify things (or confuse you more, whichever the case may be).

This assumes that an "introduction" starts with the token "K_INTRO", "forward" starts with the token "K_FORWARD", "chapter" starts with the token "K_CHAPTER", and "appendix" starts with the token "K_APPEND".
This page was last modified .