Adding actions to an ANTLR grammar:

A recognizer will parse an input token stream according to a grammar description, and may find syntax errors if any exist, but this is all it will do. Often, we need output written in a machine or assembly language or even in another human language. For this we need a language translator or a compiler. These read input in one language and generate output in a different language.

ANTLR allows us to add actions to the grammar file to store or retrieve information, generate output, and make semantic checks. These actions can be added within the grammar rules or at the top-level. All actions are enclosed in a double-</double-> pair, and may enclose any legal C code (or code from whatever the base language ANTLR is generating).

Top-level actions

Top-level actions occur outside of lexclass and grammar descriptions. They will be inserted directly into the top-level of the ANTLR-generated C code, so must be legal as top-level C code. This means that global variable declarations, file inclusions, function prototypes and definitions, and macro definitions can be placed in top-level actions.

A special top-level action, called the header is included in all the C sources files generated by ANTLR. It is useful for inserting file inclusions, external declarations of variables and function prototypes, and type, struct, and macro definitions in multiple files. All other top-level functions are placed in the parser source file only, so variable declarations (i.e. space allocations), and other non-shared code (such as the definition of main() may be placed in non-header top-level actions.

The header

The header starts with the directive #header followed by an action to be inserted into every source file generated by ANTLR. Let's look at a typical header:


#header
<<
#include "charptr.h"

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

#define DEBUG   1

#define SINGLE          1
#define PLURAL          2
#define NONARRAY        3
#define CALL            4

#define sym     struct _sym
sym {
        char    *text;
        int     type;
        int     class;
        int     base;
        int     size;
};

extern symtab[];

#define ZZCOL
>>

This section of code starts by including the file "charptr.h". This file includes definitions for pre-defined code which allows character pointers to be used for handling lexeme texts. The next three lines include some of the standard header files for dealing with output, type definitions, etc. that may be used in later actions. After these, there is a definition for DEBUG which I use to turn on some debugging output in the rest of the actions. Then, there are some definitions which are used in the symbol table, the definition of the structure of the symbol table elements, and an extern definition of the symbol table itself. Notice that we don't actually allocate (i.e. create) the symbol table here, because we want to ensure that it is not allocated as a separate entity in each source file. Finally, ZZCOL is defined. This tells DLG to track line and column information as it scans the input character stream.

Other top-level actions

Top-level actions, other than the header, are included in a single source file. Support function and variable definitions or declarations can be placed in these top-level actions. A main() which calls the ANTLR macro may be included in a top-level action to define a complete program.

Some PCCTS startup code may have to be included in a top-level action to initialize the token text storage system.

The length of any given action is limited, so you may have to split a long action into two or more shorter ones. Because all actions are directly inserted into the source, a single long action can be split into two consecutive actions with no effect on functionality. You will have to do this if the action is too long.

Here is an example of another top-level action:


<<
sym     symtab[513];
sym     *symptr = &(symtab[0]);

int     indent = 0;     /* current output indent */
int     offset = 0;     /* stack offset for var */
int     reg = -1;       /* next register to use */

char buf[513];          /* Used by rval_array and fact to pass a string */

#include "charptr.c"

main(argc, argv)
int argc;
char **argv;
{
        if (argc != 1) {
                error("no command-line args; use redirection for file I/O");
        }

        ANTLR (prog(), stdin);

        return(0);
}

warn(fmt, a, b, c)
char *fmt;
int a, b, c;
{
        fprintf(stderr, "line %d: ", zzline);
        fprintf(stderr, fmt, a, b, c);
        fprintf(stderr, "\n");
}

error(fmt, a, b, c)
char *fmt;
int a, b, c;
{
        warn(fmt, a, b, c);
        exit(1);
}

>>

This action declares the actual symbol table, and some other global variables. It also includes the file charptr.c which contains startup code for the lexeme text system. Following this is the main() function for a compiler, which starts the parsing process by calling ANTLR with the name of the starting rule and the character stream that the lexer should read input from. After main(), the definitions for a few support functions are given. These can be called within other actions as needed.

Simple in-line actions

Actions can be added to grammar rule descriptions in order to maintain information or generate output. These actions are inserted directly into the description. For example:


prog    :       << printf("#include \"nempl.h\"\n\n"); >>
                decls
                funcs
        ;

Here, when "prog" is executed, it starts by outputting a line to stdout which will include the header file "nempl.h" in the output code for the compiler. After this, it calls the parsing functions decls(), and funcs().

Actions can also be placed after or between the parts of a description:


stc :
                K_INT           <<printf ( "SINGLE\n" );>>
        |       K_PLURAL        <<printf ( "PLURAL\n" );>>
        ;

This rule searches for either a K_INT token or a K_PLURAL token. If it finds a K_INT, it outputs the string "SINGLE\n" to stdout. If it finds a K_PLURAL, it outputs the string "PLURAL\n" to stdout. We'll see more examples later.

Declaring local variables inside a grammar rule

Sometimes it is handy to count the number of times a subrule has been executed, or to store some information (such as an array name) until we are ready to use it later in the rule. In these cases, local variables are quite useful.

Here is an example of how locals can be declared and used in a grammar rule:


paragraph:	<<
		{
			int count = 0;
		>>
		( sentence
		  <<
			count++;
		  >>
		)+
		<<
			printf ( "%d sentences found\n", count );
		}
		>>
		;

In this rule, the first action contains a left brace, {. This is used to open a new scope in the output C code. The closing action has the corresponding right brace, }, which is used to close this scope. We do this because the C definition allows new local variables to be defined whenever a new scope is opened (this is why you can have local variables in a function). The first action opens a new scope, declares the local variable count, and initializes it to 0. Each time the subrule containing "sentence" is executed (which happens only while the next token is in the first set of the non-terminal sentence), the sentence "count" will be incremented. In the final action, we print out the number of sentences found in the paragraph, then close the scope opened in the first action.

Passing arguments to a grammar rule

Sometimes we wish to pass information into a grammar rule. For example, we may have determined something in an earlier part of the calling rule, and need to know this to generate the proper output in the called rule.

We can modify the rule to indicate that arguments are expected when the rule is called in other rules. Suppose we have a rule for handling declarations that must put all global definitions in a global symbol table, and all local definitions in a local symbol table. We might define this rule as follows:


decls[int scope]:
		<<
			{
				int type;
		>>
		(  FLOAT <<	type = 0; >>
		 | INT   <<	type = 1; >>
		)
		NAME
		<<
				if ( $scope == GLOBAL )
					enter ( $2, type, gtable);
				else
					enter ( $2, type, ltable);
			}
		>>
		;

This rule takes the int "scope" as an argument, and uses it to choose which table to use. The dollar sign before "scope" in the final action indicates that "scope" is an argument to the rule, and has not been declared in an action. Note, that we have also declared the local variable "type" here. If we had not, the braces would not have been necessary.

Actually, they are not necessary at all because ANTLR automatically opens a new scope at the beginning of each rule, and closes it at the end. Thus, this particular pair of braces is redundant. Also note that there is no dollar sign in front of "type" when it is referenced in the actions. This is because it is declared within an action, rather than being an argument or return value for an ANTLR rule.

In a rule which calls "decls", the notation looks like this:


	decls[n]

where n is a number or variable whose value will be passed to the function decls(). For example, our rule for "prog" could be changed to read as:


prog    :       << printf("#include \"nempl.h\"\n\n"); >>
                decls[0];
                funcs
        ;

Here, GLOBAL would be defined as the value 0 so that top-level declaration in the source code will be placed in the global symbol table by decls().

Multiple values can be passed and accepted by rules by using a comma-separated list or values and argument declarations, respectively.

Returning values from a grammar rule

Sometimes we use a non-terminal to parse some input information and find out some information for the calling rule. For example, we may want a rule to look for the base type names of the input language:


stc > [int class] :
                K_INT           <<$class = SINGLE;>>
        |       K_PLURAL        <<$class = PLURAL;>>
        ;

Here, we look for the tokens K_INT and K_PLURAL, and set class equal to a value which indicates which token we found. The dollar sign before class in the action indicates that the variable is a return value for the rule, and not defined in an action. The value is returned to the calling function when the rule completes. Multiple values can be returned to the caller by separating the values with commas.

In a rule which calls "stc", the notation looks like this:


	stc > [v]

where v is a variable whose value will be set when the function stc() returns. For example, we might see:


decls	:	<<
			{
				register int type;
		>>
		stc > [type]
		<<
				printf ( "type is %d\n", type );
			}
		>>
        ;

Note that here "type" was a return value from "stc", but it is not an argument or return value of "decls", thus, we need to allocate space for it with a local declaration, and access it without using a dollar sign.

Generalized format for a rule

We have introduced several concepts since first talking about the layout of a grammar rule, so let's see how they all fit together. The general format for a rule is:


	name [type1 arg1, ..., typeN argN] > [type1 rval1, ..., typeM rvalM]:
			alternate 1
		|	...
		|	alternate X
		;

There are other things (such as error actions), but I'm not going to go into them here.

Accessing lexeme texts within an action

Sometimes we need to access the actual text of a token. Suppose we have the following rule for declaring integers:


	intdecl:	INT
			VARNAME
		;

We would like to add an action to this rule to store the newly declared variable in the symbol table, but how do we get the name of the variable? The answer is dollar attributes. The elements of a rule are numbered starting at one. Actions do not count as elements, and subrules count as one element.

Let's look at an example to solidify this:


	arule:	<< printf ( "hello\n" ); >>
		WORD			/* This is $1 */
		NUM			/* This is $2 */
		( alt1 | alt2 )*	/* This is $3 */
		<< printf ( "First WORD: %s\n", $1 );
			 printf ( "First NUM: %s\n", $2 );
		>>
		WORD			/* This is $4 */
		NUM			/* This is $5 */
		<< printf ( "Second WORD: %s\n", $4 );
			 printf ( "Second NUM: %s\n", $5 );
		>>
		;

This should be sufficient to collect and move data to wherever you need it to generate the proper output. If you need some text from "alt1" or "alt2" to generate output from this rule, you can separate the ( alt1 | alt2 ) into a new rule, then declare a local character buffer in this rule, and pass its address into the new rule, which copies the text you need into the array.

This page was last modified .