DLG and lexclass descriptions:

DLG reads one or more lexclass descriptions and creates a DFA-based lexical analyzer (a.k.a. scanner or lexer) function. Each time this lexer function is called, it will read and store characters from an incoming stream until it recognizes a token. It then returns the token type to the calling function. The calling function may be handwritten or ANTLR-generated.

Lexclasses, modes, and tokens

The created lexer may have multiple lexical modes, each of which is described by a lexclass. Conventionally, these lexclasses are named using uppercase letters only. The lexer starts in the mode described by the lexclass "START", and stays in that mode until the mode is explicitly changed in a lexer or parser action.

Each lexclass description contains a list of token types that the incoming character stream should be broken into. These tokens are analogous to the words, numbers, punctuation, and other symbols found in a book. For each token type, there is a regular expression (a.k.a. regex) which describes the input patterns that may match that token, and an optional action to be performed whenever the lexer recognizes a token of that type.

A lexclass description starts with the directive:


	#lexclass classname

where classname is the name given to the mode described by this lexclass.

The #lexclass directive is followed by a list of token descriptions. These define the tokens that the lexer may recognize when operating in the mode defined by the lexclass. Each token description contains a regular expression which describes the character input patterns that match the token type, an optional name for the token, and an optional action to be taken whenever an input pattern is recognized as this type of token.

The token description begins with the #token directive. This may be followed by a token type name which can be used in an ANTLR grammar to refer to character strings which match the token. This is followed by a regular expression which describes the character strings that match this token. Finally, a lexical action may follow, enclosed in a double-</double-> pair (<<>>). This action will be executed by the lexer whenever it recognizes an input pattern as this token type.

For example, the following token matches the keyword "break" in a C source file:


	#token  K_BREAK        "break"

This token has been named "K_BREAK". The token name must begin with an upper case letter, but is otherwise restricted only to alphanumeric characters including the underscore. I usually use all capitals just to keep tokens obvious when referred to in the grammar rules. The token name does not have to be the same as the pattern matched, or even representative of the pattern: we could just as easily have named this token "Lucy" if we wanted to. Of course, choosing a representative name helps to make it clear which patterns this token represents.

After the name, we see the regular expression "break" enclosed in quotes. This regex simply describes the character string "break". Whenever this string appears on the input stream, the lexer built by DLG will know that it matches the token type K_BREAK. We will see later that it may choose to recognize this string as a K_BREAK, or it may choose to recognize it as some other token type, or as part of a larger token.

Before we move on, I want to talk about concatenation. This is when two smaller patterns, which I'll refer to as subpatterns, are joined end-to-end, to form a single, longer pattern. For example, the concatenation of the strings "abc" and "xyz" (in this order) is the string "abcxyz". Note that this is an ordered operation, and requires some indication of the order to use. In the above regex for the keyword "break", the characters 'b', 'r', 'e', 'a', and 'k' where concatenated together in the order listed (from left to right) in the regex to form the string "break". Thus, the pattern "break" is actually made up of the subpatterns containing the individual letters used to spell it. Concatenation works for more complex patterns also, as you will see.

DLG Regular Expressions (regexs)

Simple patterns can be written out, just as we did for the keyword "break" above, but more complicated patterns require some special notation to describe.

For example, a slightly more complicated token description is needed to describe the input strings which represent decimal integers in the C programming language. This token must match any character string which starts with one of the characters in the range '1'-'9' and continues with zero or more characters in the range '0'-'9'. The string "2487054" must be matched by this token type, and so must the string "9872485".

Here is a typical DLG description of such a token:


	#token  INT        "[1-9][0-9]*"

This token type is named "INT", and matches any string as described in the preceding paragraph. The regex for this rule has two features that are new: two pair of brackets, and an asterisk or "star". These symbols have special meaning when used in a DLG regex, and modify how DLG interprets the characters in the regex. Spaces and tabs are ignored within a DLG regex unless escaped (we'll talk about that later), so they may be inserted into the regex to increase clarity for the programmer.

Brackets, [], are used to enclose a set of characters which may match a single character in the input stream. Within a pair of brackets, a dash - can be used to describe all the characters whose ASCII values fall in the range between the ASCII values of the two characters it joins. For example, "[abcdez]" describes any single character in the set {a, b, c, d, e, z}. Because this pattern includes all the characters between, and including, 'a' and 'e', the pattern "[a-ez]" also describes any single character from this same set. As another example, the pattern "abc[dD]ef" matches either the string "abcdef" or the string "abcDef".

A vertical bar, |, is used to indicate a choice between two subpatterns. For example, the pattern "a|b" matches either the subpattern consisting of the character 'a' or the one consisting of the character 'b'.

The bar has the lowest precedence of the operators which can occur in a regex. Thus, the pattern "ab|c" matches either the string "ab" or the string "c", and does not match the string "ac". It is interpreted as "ab or c", not "a, followed by b or c". Here are some more examples:

Parenthesis, (), are used to group subpatterns together to force a different grouping than would normally occur. We saw that the pattern "ab|[cd]" represents one of the strings "ab", "c", or "d". To represent the strings "ab", "ac", and "ad" instead, we might regroup this pattern as:


	"a(b|[cd])".

This new pattern matches an "a" followed immediately by either a "b" or one of the characters "c" or "d". This could also be written as either

	"a[b-d]",
	"a(b|c|d)",

or

	"ab|ac|ad",

Braces, {}, are used to indicate that the enclosed subpattern should be matched exactly one time or not at all. They also have the effect of grouping the enclosed characters into a single subpattern just as parentheses do. An example is:


	"a{b|[cd]}".

Here, the pattern matches either an "a" by itself, or an "a" followed immediately by either a "b" or one of characters "c" or "d". Thus, the following strings (and only the following strings) match this pattern:

	a
	ab
	ac
	ad

The tilde, ~, is used to logically negate a set. When it precedes a pair of bracket-enclosed characters, it indicates the set of characters that are not in the enclosed set. This is best described by example, so here are a couple:

Subpatterns can be modified by the special characters asterisk, *, and the plus sign, +, which cause these subpatterns to be matched zero or more times, or one or more times, respectively. Here are some examples:

Let's return for a moment to our earlier example. Here is the token description presented earlier for matching C integers:


	#token  INT        "[1-9][0-9]*"

We see now that the part which says "[1-9]" is a subpattern description of the set of characters {1, 2, 3, 4, 5, 6, 7, 8, 9}, and that it matches any one character in this set. The subpattern "[0-9]*" indicates that zero or more characters from the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} should be matched. The concatenation of these two subpatterns represents the concatenation of the patterns that each individually recognize into a single, larger pattern. Thus, the entire pattern "[1-9][0-9]*" matches any one character in the set {1, 2, 3, 4, 5, 6, 7, 8, 9}, followed immediately by zero or more characters in the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.

There is one more special character I want to bring attention to. That character is the commercial-at character (which probably has some other name that I don't know): '@'. This character represents the end of file on the input stream. Thus "abcd@" represents the string "abcd" followed immediately by the end of the input stream.

DLG allows certain escape sequences to represent control and whitespace characters in regexs. The following escape sequences are understood:

\t
The "tab" character.
\n
The "newline" character.
\r
The "carriage return" character.
\b
The "backspace" character.

Some characters are not printable and do not have a related escape sequence. These characters can be inserted into a regex using their ASCII code values. The following patterns are available for this purpose:

\0nnn
The character that has the octal value nnn.
\0xnn
The character that has the hexadecimal value nn.
\mnn
The character that has the decimal value mnn, where m is between 1 and 9 inclusively.

We have seen that certain characters have special meaning to DLG when used in a regex. In order to include these characters in the language we are trying to describe, there needs to be a way to indicate when they are being used in a regex without their special meanings. This is done by escaping them with a preceding backslash "\", just as in an escape sequence. For example, to represent a plus sign, we use the pattern "\+".

Here is a list of the special characters which must be escaped to be used without their special meanings:

	|
	(
	)
	{
	}
	[
	]
	-
	~
	*
	+
	@
	The "space" character
	The "tab" character
I'll give you a couple of hints here:

Order of token descriptions

Note that there is a difference between an input string matching a pattern, and the lexer recognizing the input string as belonging to the token type described by that pattern. The lexer may choose to recognize the input string as a different token type which also matches the string. The DLG-created lexer will always choose to match the longest input string that it can to a token. If the longest possible input string can still match more than one token type, the lexer will choose the first of these listed in the lexclass.
This page was last modified .