The parsing process is started when the function which parses the starting rule is called via an ANTLR macro which takes the starting rule and the input stream as arguments. Any rule in the grammar can be used as the starting rule, and any readable character stream can be used as the input stream.
paragraph: (sentence)+;Here, "paragraph" is the name of the rule, and "(sentence)+" is the rule description which shows how a paragraph is formed from its constituent parts. The colon separates the name and the rule description, and the semicolon marks the end of the rule. Spaces, tabs, and newlines are ignored in rules, so we can write a long rule in a format that helps clarify its structure.
one_sentence: "The" "dog" "barked" "at" "the" "cat" PERIOD;In this example, the rule name is "one_sentence". "The", "dog", "barked", "at", "the", and "cat" are all explicitly defined tokens. PERIOD is the name of a token type defined earlier in a DLG lexclass. PERIOD is most likely, but not necessarily, defined as the period character (i.e. "\."). It could be defined as some type of "smiley" ( ":\-\)" ). We do not know from the grammar rule alone.
When the parsing function for "one_sentence" is running, the items in this
series of tokens are processed one at a time from left to right. First,
one_sentence()
will compare the current token with the expected
token "The". The lexer is always one step ahead of the parsing functions, so
the current token is the one that was returned by the lexer after the last
token match. If the current token matches "The", then the lexer is called
to return a new current token, and the parsing function moves on to "dog".
Otherwise, a syntax error is issued which shows the text of the current token
and the text or name of the token that was expected by the parsing function.
(How the parser continues after an error depends on the error actions defined
and other things. To learn more about error handling, see Parr's book.)
The parsing function continues this process until it reaches the semicolon, at
which time it has recognized a "one_sentence", and returns to whatever function
called it.
Typically, we must allow more flexibility when describing parts of a language than we have when using only literal character strings and single-string tokens. It would not be feasible to write a rule such as "one_sentence" above for every possible sentence in the English language. Instead we define token types to represent multiple strings. For example, we may choose to represent a set of nouns as a single token type. Suppose we had defined the token types NOUN and VERB to represent some of the nouns and verbs of the English language, respectively. Then, the following rule would include not only the sentence described by one_sentence above, but also many other sentences:
sentences: "The" NOUN VERB "at" "the" NOUN PERIOD;For example, "The car turned at the corner" matches the above rule if "car" and "corner" match the definition for the token NOUN, and "turned" matches the definition of the token VERB.
paragraph: TAB sentence sentence sentence;Here, we've defined a "paragraph" to be made up of one TAB followed by exactly three "sentence"s. This isn't a very good definition for a paragraph, but it will serve for now. When the function which parses the non-terminal "paragraph", called "
paragraph()
", is called to parse the input
token stream, the calling function may have already compared the current token
with the token type "TAB" and found that they match. Otherwise, depending on
the structure of the calling rule, paragraph()
may not even have
been called.
paragraph()
itself starts by "matching" the current token with
"TAB" and calls the lexer to receive a new current token.
paragraph()
then calls sentence()
to attempt to
recognize a sentence on the input stream. If the new current token is not in
the first set of the non-terminal "sentence", sentence()
will
issue a syntax error. When sentence()
returns,
paragraph()
will call it again to attempt to recognize another
sentence. When sentence()
returns again, this process will be
repeated one last time.
Here are some examples to help clarify things (or confuse you more, whichever the case may be).
pet: DOG | CAT;
pet: CAT | DOG | HAMSTER | MOUSE | SNAKE ;This version of "pet" has five alternatives.
compets : CAT (HAMSTER | SNAKE) | DOG (HAMSTER | MOUSE | SNAKE) | HAMSTER MOUSE ;This says that we should either choose a CAT with either a HAMSTER or SNAKE, or we should choose a DOG with a HAMSTER, a MOUSE, or a SNAKE, or we should choose a HAMSTER with a MOUSE. Any other pair won't fit this rule, and would be incompatible or out of alphabetical order. Note that once we choose a CAT, we must choose either a HAMSTER or a SNAKE: the alternatives "DOG (HAMSTER | MOUSE | SNAKE)" and "HAMSTER MOUSE" are no longer an option.
Note that there are no non-terminals in this rule. The parsing function
compets()
only calls the lexer and tries to match the tokens it
returns with the ones listed in the rule. compets
will first
compare the current token with CAT. If they match, it will choose the first
alternative (i.e. "CAT (HAMSTER | SNAKE)", call the lexer for the next token,
and compare it with HAMSTER. If this matches, it returns successfully;
otherwise, it compares the new token with SNAKE. If this matches it returns
successfully; otherwise it issues a syntax error and fails.
Suppose instead of matching CAT, the first "current" token matched DOG. Then
compets()
would have followed the alternative
"DOG (HAMSTER | MOUSE | SNAKE)". If the first token had matched HAMSTER
instead, compets()
would have followed the third, and last,
alternative.
If none of the first tokens from any of the alternatives had matched the
first "current" token, then compets()
would have failed. It is
more likely that it would not even have been called by the calling function,
because if other alternatives had been available, the caller would have
checked the first sets before calling compets()
, and would have
avoided the call.
book: dedication {introduction} {forward} (chapter)+ (appendix)*;Using this description, a book consists of a required "dedication", followed by an optional "introduction", followed by an optional "forward". The book may have neither, one, or both, of the introduction and forward. If it does have both, they must occur in the order given above. Following these optional parts, the book must have one or more "chapters". After all of the chapters, there are zero or more "appendices".
Note that there are no terminals in this rule. The parsing function
book()
only calls other parsing functions, and performs loops
while the current token is in the first set of the enclosed non-terminals.
The structure for book()
would look something like this:
book ()
{
dedication();
if (nextoken==K_INTRO) introduction();
if (nextoken==K_FORWARD) forward();
do {
chapter();
} while ( nextoken==K_CHAPTER );
while ( nextoken==K_APPEND) {
appendix();
}
}