DLG and lexclass descriptions:

DLG reads one or more lexclass descriptions and creates a DFA-based lexical analyzer (a.k.a. scanner or lexer) function. Each time this lexer function is called, it will read and store characters from an incoming stream until it recognizes a token. It then returns the token type to the calling function. The calling function may be handwritten or ANTLR-generated.

Lexclasses, modes, and tokens

The created lexer may have multiple lexical modes, each of which is described by a lexclass. Conventionally, these lexclasses are named using uppercase letters only. The lexer starts in the mode described by the lexclass "START", and stays in that mode until the mode is explicitly changed in a lexer or parser action.

Each lexclass description contains a list of token types that the incoming character stream should be broken into. These tokens are analogous to the words, numbers, punctuation, and other symbols found in a book. For each token type, there is a regular expression (a.k.a. regex) which describes the input patterns that may match that token, and an optional action to be performed whenever the lexer recognizes a token of that type.

A lexclass description starts with the directive:


	#lexclass classname

where classname is the name given to the mode described by this lexclass.

The #lexclass directive is followed by a list of token descriptions. These define the tokens that the lexer may recognize when operating in the mode defined by the lexclass. Each token description contains a regular expression which describes the character input patterns that match the token type, an optional name for the token, and an optional action to be taken whenever an input pattern is recognized as this type of token.

The token description begins with the #token directive. This may be followed by a token type name which can be used in an ANTLR grammar to refer to character strings which match the token. This is followed by a regular expression which describes the character strings that match this token. Finally, a lexical action may follow, enclosed in a double-</double-> pair (<<>>). This action will be executed by the lexer whenever it recognizes an input pattern as this token type.

For example, the following token matches the keyword "break" in a C source file:


	#token  K_BREAK        "break"

This token has been named "K_BREAK". The token name must begin with an upper case letter, but is otherwise restricted only to alphanumeric characters including the underscore. I usually use all capitals just to keep tokens obvious when referred to in the grammar rules. The token name does not have to be the same as the pattern matched, or even representative of the pattern: we could just as easily have named this token "Lucy" if we wanted to. Of course, choosing a representative name helps to make it clear which patterns this token represents.

After the name, we see the regular expression "break" enclosed in quotes. This regex simply describes the character string "break". Whenever this string appears on the input stream, the lexer built by DLG will know that it matches the token type K_BREAK. We will see later that it may choose to recognize this string as a K_BREAK, or it may choose to recognize it as some other token type, or as part of a larger token.

Before we move on, I want to talk about concatenation. This is when two smaller patterns, which I'll refer to as subpatterns, are joined end-to-end, to form a single, longer pattern. For example, the concatenation of the strings "abc" and "xyz" (in this order) is the string "abcxyz". Note that this is an ordered operation, and requires some indication of the order to use. In the above regex for the keyword "break", the characters 'b', 'r', 'e', 'a', and 'k' where concatenated together in the order listed (from left to right) in the regex to form the string "break". Thus, the pattern "break" is actually made up of the subpatterns containing the individual letters used to spell it. Concatenation works for more complex patterns also, as you will see.

DLG Regular Expressions (regexs)

Simple patterns can be written out, just as we did for the keyword "break" above, but more complicated patterns require some special notation to describe.

For example, a slightly more complicated token description is needed to describe the input strings which represent decimal integers in the C programming language. This token must match any character string which starts with one of the characters in the range '1'-'9' and continues with zero or more characters in the range '0'-'9'. The string "2487054" must be matched by this token type, and so must the string "9872485".

Here is a typical DLG description of such a token:


	#token  INT        "[1-9][0-9]*"

This token type is named "INT", and matches any string as described in the preceding paragraph. The regex for this rule has two features that are new: two pair of brackets, and an asterisk or "star". These symbols have special meaning when used in a DLG regex, and modify how DLG interprets the characters in the regex. Spaces and tabs are ignored within a DLG regex unless escaped (we'll talk about that later), so they may be inserted into the regex to increase clarity for the programmer.

Brackets, [], are used to enclose a set of characters which may match a single character in the input stream. Within a pair of brackets, a dash - can be used to describe all the characters whose ASCII values fall in the range between the ASCII values of the two characters it joins. For example, "[abcdez]" describes any single character in the set {a, b, c, d, e, z}. Because this pattern includes all the characters between, and including, 'a' and 'e', the pattern "[a-ez]" also describes any single character from this same set. As another example, the pattern "abc[dD]ef" matches either the string "abcdef" or the string "abcDef".

A vertical bar, |, is used to indicate a choice between two subpatterns. For example, the pattern "a|b" matches either the subpattern consisting of the character 'a' or the one consisting of the character 'b'.

The bar has the lowest precedence of the operators which can occur in a regex. Thus, the pattern "ab|c" matches either the string "ab" or the string "c", and does not match the string "ac". It is interpreted as "ab or c", not "a, followed by b or c". Here are some more examples:

The pattern "ab|[cd]" represents any of the following strings:
```
	ab
	c
	d
  
```
In words, this pattern matches either the string "ab", or either one of the characters in the set {c, d}.
The pattern "while|WHILE" represents either of the following strings:
```
	while
	WHILE
  
```
Thus, it matches the word "while" written either in all lower case letters or in all upper case letters. It does not match mixed case spellings of "while".
On the other hand, the pattern "[wW][hH][iI][lL][eE]" represents the word "while" written using any combination of upper and lower case letters. Note that this pattern is the concatenation of the subpatterns which match either case of each individual character in the word "while".

Parenthesis, (), are used to group subpatterns together to force a different grouping than would normally occur. We saw that the pattern "ab|[cd]" represents one of the strings "ab", "c", or "d". To represent the strings "ab", "ac", and "ad" instead, we might regroup this pattern as:


	"a(b|[cd])".

This new pattern matches an "a" followed immediately by either a "b" or one of the characters "c" or "d". This could also be written as either


	"a[b-d]",
	"a(b|c|d)",

or

	"ab|ac|ad",

Braces, {}, are used to indicate that the enclosed subpattern should be matched exactly one time or not at all. They also have the effect of grouping the enclosed characters into a single subpattern just as parentheses do. An example is:


	"a{b|[cd]}".

Here, the pattern matches either an "a" by itself, or an "a" followed immediately by either a "b" or one of characters "c" or "d". Thus, the following strings (and only the following strings) match this pattern:


	a
	ab
	ac
	ad

The tilde, ~, is used to logically negate a set. When it precedes a pair of bracket-enclosed characters, it indicates the set of characters that are not in the enclosed set. This is best described by example, so here are a couple:

The pattern "~[c]" represents any single character other than 'c'.
The pattern "~[a-dxz]" represents any single character not in the set {a, b, c, d, x, z}.
The pattern "ab~[c]" represents any three character string whose first two characters are "ab", and whose third character is any character other than 'c'.
The pattern "~[]" takes a bit of thought. It matches any single character that is not nothing, and is useful as a "catch-all" pattern.

Subpatterns can be modified by the special characters asterisk, *, and the plus sign, +, which cause these subpatterns to be matched zero or more times, or one or more times, respectively. Here are some examples:

The pattern "ab[c]*" represents any of the following strings:
```
	ab
	abc
	abcc
	  .
	  .
	  .
  
```
In other words, this pattern matches the string "ab" followed by zero or more 'c's.
The pattern "ab[c]+" represents any of these strings:
```
	abc
	abcc
	abccc
	  .
	  .
	  .
  
```
In words, this pattern matches the string "ab" followed by one or more 'c's. It does not match the string "ab", which the previous pattern does, because it requires there to be at least one 'c' at the end of the string.
The pattern "ab[cd]*" represents any of the following strings:
```
	ab
	abc
	abd
	abcc
	abcd
	abdc
	abdd
	  .
	  .
	  .
  
```
That is, this pattern matches the string "ab" followed by zero or more characters, each of which is from the set {c, d}.
The pattern "ab[cd]+" represents any of the following strings:
```
	abc
	abd
	abcc
	abcd
	abdc
	abdd
	  .
	  .
	  .
  
```
In other words, this pattern matches the string "ab" followed by one or more characters, each of which is from the set {c, d}. Again, this pattern does not match the string "ab", which the previous pattern does, because there must be at least one 'c' at the end of the string for this pattern to match it.
The pattern "ab(cd|ef)*" represents any of the following strings (and others):
```
	ab
	abcd
	abef
	abcdef
	abefcd
	abcdcdef
	abcdefef
	abcdefcd
  
```

Let's return for a moment to our earlier example. Here is the token description presented earlier for matching C integers:


	#token  INT        "[1-9][0-9]*"

We see now that the part which says "[1-9]" is a subpattern description of the set of characters {1, 2, 3, 4, 5, 6, 7, 8, 9}, and that it matches any one character in this set. The subpattern "[0-9]*" indicates that zero or more characters from the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} should be matched. The concatenation of these two subpatterns represents the concatenation of the patterns that each individually recognize into a single, larger pattern. Thus, the entire pattern "[1-9][0-9]*" matches any one character in the set {1, 2, 3, 4, 5, 6, 7, 8, 9}, followed immediately by zero or more characters in the set {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.

There is one more special character I want to bring attention to. That character is the commercial-at character (which probably has some other name that I don't know): '@'. This character represents the end of file on the input stream. Thus "abcd@" represents the string "abcd" followed immediately by the end of the input stream.

DLG allows certain escape sequences to represent control and whitespace characters in regexs. The following escape sequences are understood:

\t: The "tab" character.
\n: The "newline" character.
\r: The "carriage return" character.
\b: The "backspace" character.

Some characters are not printable and do not have a related escape sequence. These characters can be inserted into a regex using their ASCII code values. The following patterns are available for this purpose:

\0nnn: The character that has the octal value nnn.
\0xnn: The character that has the hexadecimal value nn.
\mnn: The character that has the decimal value mnn, where m is between 1 and 9 inclusively.

We have seen that certain characters have special meaning to DLG when used in a regex. In order to include these characters in the language we are trying to describe, there needs to be a way to indicate when they are being used in a regex without their special meanings. This is done by escaping them with a preceding backslash "\", just as in an escape sequence. For example, to represent a plus sign, we use the pattern "\+".

Here is a list of the special characters which must be escaped to be used without their special meanings:

	|
	(
	)
	{
	}
	[
	]
	-
	~
	*
	+
	@
	The "space" character
	The "tab" character

I'll give you a couple of hints here:

If you get a message from DLG such as:
```
	parser.dlg, line 94: syntax error at ">>" missing {...}
  
```
you have probably forgotten to escape a special character.
"[\ \t]" matches either a space or a tab character, while "[ \t]" matches only a tab character.

Order of token descriptions

Note that there is a difference between an input string matching a pattern, and the lexer recognizing the input string as belonging to the token type described by that pattern. The lexer may choose to recognize the input string as a different token type which also matches the string. The DLG-created lexer will always choose to match the longest input string that it can to a token. If the longest possible input string can still match more than one token type, the lexer will choose the first of these listed in the lexclass.

For example, given the choice between matching "float" to the token K_FLOAT or "floating" to the token WORD, the lexer will match the longer string, "floating", to WORD. Given the choice between matching "floating" to WORD or VERBS. It will choose the token type whose description appears first in the lexclass description.
As an example of how this may cause unexpected problems, consider the recognition of a keyword. A keyword will often match both its identity pattern (formed by spelling the word out) and the token pattern for identifiers in the language. For example, the C keyword "case" is usually matched by both the pattern "case" and the pattern "[a-zA-Z_][a-zA-Z_0-9]*", which is meant to match C identifiers such as variable, function, and label names.
If we place the token description for identifiers first in the lexclass description, then the string "case" will be recognized by the DLG-generated lexer as an identifier, not as a keyword. The proper ordering would be:
```
	#token	KEY_CASE	"case"
	#token	ID		"[a-zA-Z_][a-zA-Z_0-9]*"
  
```
Now, when the string "case" is read on the input stream, it will be recognized as a keyword rather than as just another identifier. The string "caseone" will be recognized as the single ID "caseone", rather than the keyword "case" followed by the ID "one", because "caseone" is a longer string than "case".

This page was last modified .