START
", and stays in that mode until the mode is explicitly
changed in a lexer or parser action.
Each lexclass description contains a list of token types that the incoming character stream should be broken into. These tokens are analogous to the words, numbers, punctuation, and other symbols found in a book. For each token type, there is a regular expression (a.k.a. regex) which describes the input patterns that may match that token, and an optional action to be performed whenever the lexer recognizes a token of that type.
A lexclass description starts with the directive:
#lexclass classname
where classname is the name given to the mode described by this
lexclass.
The #lexclass
directive is followed by a list of token
descriptions. These define the tokens that the lexer may recognize when
operating in the mode defined by the lexclass. Each token description contains
a regular expression which describes the character input patterns that match
the token type, an optional name for the token, and an optional action to be
taken whenever an input pattern is recognized as this type of token.
The token description begins with the #token
directive. This may be followed by a token type name which can be used
in an ANTLR grammar to refer to character strings which match the token.
This is followed by a regular expression which describes the character strings
that match this token. Finally, a lexical action may follow, enclosed
in a double-</double-> pair (<<>>). This action will be
executed by the lexer whenever it recognizes an input pattern as this
token type.
For example, the following token matches the keyword "break" in a C source file:
#token K_BREAK "break"
This token has been named "K_BREAK
". The token name must begin
with an upper case letter, but is otherwise restricted only to alphanumeric
characters including the underscore. I usually use all capitals just to keep
tokens obvious when referred to in the grammar rules. The token name does not
have to be the same as the pattern matched, or even representative of the
pattern: we could just as easily have named this token "Lucy" if we wanted to.
Of course, choosing a representative name helps to make it clear which patterns
this token represents.
After the name, we see the regular expression "break
" enclosed in
quotes. This regex simply describes the character string "break". Whenever
this string appears on the input stream, the lexer built by DLG will know that
it matches the token type K_BREAK
. We will see later that it may
choose to recognize this string as a K_BREAK
, or it may choose to
recognize it as some other token type, or as part of a larger token.
Before we move on, I want to talk about concatenation. This is when two smaller patterns, which I'll refer to as subpatterns, are joined end-to-end, to form a single, longer pattern. For example, the concatenation of the strings "abc" and "xyz" (in this order) is the string "abcxyz". Note that this is an ordered operation, and requires some indication of the order to use. In the above regex for the keyword "break", the characters 'b', 'r', 'e', 'a', and 'k' where concatenated together in the order listed (from left to right) in the regex to form the string "break". Thus, the pattern "break" is actually made up of the subpatterns containing the individual letters used to spell it. Concatenation works for more complex patterns also, as you will see.
Simple patterns can be written out, just as we did for the keyword "break" above, but more complicated patterns require some special notation to describe.
For example, a slightly more complicated token description is needed to describe the input strings which represent decimal integers in the C programming language. This token must match any character string which starts with one of the characters in the range '1'-'9' and continues with zero or more characters in the range '0'-'9'. The string "2487054" must be matched by this token type, and so must the string "9872485".
Here is a typical DLG description of such a token:
#token INT "[1-9][0-9]*"
This token type is named "INT", and matches any string as described in the
preceding paragraph. The regex for this rule has two features that are new:
two pair of brackets, and an asterisk or "star". These symbols have special
meaning when used in a DLG regex, and modify how DLG interprets the characters
in the regex. Spaces and tabs are ignored within a DLG regex unless escaped
(we'll talk about that later), so they may be inserted into the regex to
increase clarity for the programmer.
Brackets, [], are used to enclose a set of characters which may match a single character in the input stream. Within a pair of brackets, a dash - can be used to describe all the characters whose ASCII values fall in the range between the ASCII values of the two characters it joins. For example, "[abcdez]" describes any single character in the set {a, b, c, d, e, z}. Because this pattern includes all the characters between, and including, 'a' and 'e', the pattern "[a-ez]" also describes any single character from this same set. As another example, the pattern "abc[dD]ef" matches either the string "abcdef" or the string "abcDef".
A vertical bar, |, is used to indicate a choice between two subpatterns. For example, the pattern "a|b" matches either the subpattern consisting of the character 'a' or the one consisting of the character 'b'.
The bar has the lowest precedence of the operators which can occur in a regex. Thus, the pattern "ab|c" matches either the string "ab" or the string "c", and does not match the string "ac". It is interpreted as "ab or c", not "a, followed by b or c". Here are some more examples:
ab c dIn words, this pattern matches either the string "ab", or either one of the characters in the set {c, d}.
while WHILEThus, it matches the word "while" written either in all lower case letters or in all upper case letters. It does not match mixed case spellings of "while".
On the other hand, the pattern "[wW][hH][iI][lL][eE]" represents the word "while" written using any combination of upper and lower case letters. Note that this pattern is the concatenation of the subpatterns which match either case of each individual character in the word "while".
Parenthesis, (), are used to group subpatterns together to force a different grouping than would normally occur. We saw that the pattern "ab|[cd]" represents one of the strings "ab", "c", or "d". To represent the strings "ab", "ac", and "ad" instead, we might regroup this pattern as:
"a(b|[cd])".
This new pattern matches an "a" followed immediately by either a "b" or one of
the characters "c" or "d". This could also be written as either
"a[b-d]", "a(b|c|d)",
or"ab|ac|ad",
Braces, {}, are used to indicate that the enclosed subpattern should be matched exactly one time or not at all. They also have the effect of grouping the enclosed characters into a single subpattern just as parentheses do. An example is:
"a{b|[cd]}".
Here, the pattern matches either an "a" by itself, or an "a" followed
immediately by either a "b" or one of characters "c" or "d". Thus, the
following strings (and only the following strings) match this pattern:
a
ab
ac
ad
The tilde, ~, is used to logically negate a set. When it precedes a pair of bracket-enclosed characters, it indicates the set of characters that are not in the enclosed set. This is best described by example, so here are a couple:
Subpatterns can be modified by the special characters asterisk, *, and the plus sign, +, which cause these subpatterns to be matched zero or more times, or one or more times, respectively. Here are some examples:
ab abc abcc . . .In other words, this pattern matches the string "ab" followed by zero or more 'c's.
abc abcc abccc . . .In words, this pattern matches the string "ab" followed by one or more 'c's. It does not match the string "ab", which the previous pattern does, because it requires there to be at least one 'c' at the end of the string.
ab abc abd abcc abcd abdc abdd . . .That is, this pattern matches the string "ab" followed by zero or more characters, each of which is from the set {c, d}.
abc abd abcc abcd abdc abdd . . .In other words, this pattern matches the string "ab" followed by one or more characters, each of which is from the set {c, d}. Again, this pattern does not match the string "ab", which the previous pattern does, because there must be at least one 'c' at the end of the string for this pattern to match it.
ab abcd abef abcdef abefcd abcdcdef abcdefef abcdefcd
Let's return for a moment to our earlier example. Here is the token description presented earlier for matching C integers:
#token INT "[1-9][0-9]*"
We see now that the part which says "[1-9]" is a subpattern description of the
set of characters {1, 2, 3, 4, 5, 6, 7, 8, 9}, and that it matches any one
character in this set. The subpattern "[0-9]*" indicates that zero or more
characters from the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} should be matched.
The concatenation of these two subpatterns represents the concatenation of the
patterns that each individually recognize into a single, larger pattern. Thus,
the entire pattern "[1-9][0-9]*" matches any one character in the set
{1, 2, 3, 4, 5, 6, 7, 8, 9}, followed immediately by zero or more characters in
the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.
There is one more special character I want to bring attention to. That character is the commercial-at character (which probably has some other name that I don't know): '@'. This character represents the end of file on the input stream. Thus "abcd@" represents the string "abcd" followed immediately by the end of the input stream.
DLG allows certain escape sequences to represent control and whitespace characters in regexs. The following escape sequences are understood:
Some characters are not printable and do not have a related escape sequence. These characters can be inserted into a regex using their ASCII code values. The following patterns are available for this purpose:
We have seen that certain characters have special meaning to DLG when used in a regex. In order to include these characters in the language we are trying to describe, there needs to be a way to indicate when they are being used in a regex without their special meanings. This is done by escaping them with a preceding backslash "\", just as in an escape sequence. For example, to represent a plus sign, we use the pattern "\+".
Here is a list of the special characters which must be escaped to be used without their special meanings:
| ( ) { } [ ] - ~ * + @ The "space" character The "tab" characterI'll give you a couple of hints here:
parser.dlg, line 94: syntax error at ">>" missing {...}
you have probably forgotten to escape a special character.
"[\ \t]"
matches either a space or a tab character, while
"[ \t]"
matches only a tab character.
If we place the token description for identifiers first in the lexclass description, then the string "case" will be recognized by the DLG-generated lexer as an identifier, not as a keyword. The proper ordering would be:
#token KEY_CASE "case"
#token ID "[a-zA-Z_][a-zA-Z_0-9]*"
Now, when the string "case" is read on the input stream, it will be recognized
as a keyword rather than as just another identifier. The string "caseone"
will be recognized as the single ID "caseone", rather than the keyword "case"
followed by the ID "one", because "caseone" is a longer string than "case".