ANTLR allows us to add actions to the grammar file to store or retrieve information, generate output, and make semantic checks. These actions can be added within the grammar rules or at the top-level. All actions are enclosed in a double-</double-> pair, and may enclose any legal C code (or code from whatever the base language ANTLR is generating).
A special top-level action, called the header is included in all the C
sources files generated by ANTLR. It is useful for inserting file inclusions,
external declarations of variables and function prototypes, and type, struct,
and macro definitions in multiple files. All other top-level functions are
placed in the parser source file only, so variable declarations (i.e. space
allocations), and other non-shared code (such as the definition of
main()
may be placed in non-header top-level actions.
#header
followed by an action
to be inserted into every source file generated by ANTLR. Let's look at a
typical header:
#header
<<
#include "charptr.h"
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#define DEBUG 1
#define SINGLE 1
#define PLURAL 2
#define NONARRAY 3
#define CALL 4
#define sym struct _sym
sym {
char *text;
int type;
int class;
int base;
int size;
};
extern symtab[];
#define ZZCOL
>>
This section of code starts by including the file "charptr.h". This file
includes definitions for pre-defined code which allows character pointers to be
used for handling lexeme texts. The next three lines include some of the
standard header files for dealing with output, type definitions, etc. that may
be used in later actions. After these, there is a definition for DEBUG which
I use to turn on some debugging output in the rest of the actions. Then, there
are some definitions which are used in the symbol table, the definition of the
structure of the symbol table elements, and an extern definition of the symbol
table itself. Notice that we don't actually allocate (i.e. create) the symbol
table here, because we want to ensure that it is not allocated as a separate
entity in each source file. Finally, ZZCOL
is defined. This
tells DLG to track line and column information as it scans the input character
stream.
main()
which calls the
ANTLR
macro may be included in a top-level action to define a
complete program.
Some PCCTS startup code may have to be included in a top-level action to initialize the token text storage system.
The length of any given action is limited, so you may have to split a long action into two or more shorter ones. Because all actions are directly inserted into the source, a single long action can be split into two consecutive actions with no effect on functionality. You will have to do this if the action is too long.
Here is an example of another top-level action:
<<
sym symtab[513];
sym *symptr = &(symtab[0]);
int indent = 0; /* current output indent */
int offset = 0; /* stack offset for var */
int reg = -1; /* next register to use */
char buf[513]; /* Used by rval_array and fact to pass a string */
#include "charptr.c"
main(argc, argv)
int argc;
char **argv;
{
if (argc != 1) {
error("no command-line args; use redirection for file I/O");
}
ANTLR (prog(), stdin);
return(0);
}
warn(fmt, a, b, c)
char *fmt;
int a, b, c;
{
fprintf(stderr, "line %d: ", zzline);
fprintf(stderr, fmt, a, b, c);
fprintf(stderr, "\n");
}
error(fmt, a, b, c)
char *fmt;
int a, b, c;
{
warn(fmt, a, b, c);
exit(1);
}
>>
This action declares the actual symbol table, and some other global variables.
It also includes the file charptr.c
which contains startup code
for the lexeme text system. Following this is the main()
function
for a compiler, which starts the parsing process by calling ANTLR with the
name of the starting rule and the character stream that the lexer should read
input from. After main()
, the definitions for a few support
functions are given. These can be called within other actions as needed.
prog : << printf("#include \"nempl.h\"\n\n"); >>
decls
funcs
;
Here, when "prog" is executed, it starts by outputting a line to stdout which
will include the header file "nempl.h" in the output code for the compiler.
After this, it calls the parsing functions decls()
, and
funcs()
.
Actions can also be placed after or between the parts of a description:
stc :
K_INT <<printf ( "SINGLE\n" );>>
| K_PLURAL <<printf ( "PLURAL\n" );>>
;
This rule searches for either a K_INT token or a K_PLURAL token. If it finds a
K_INT, it outputs the string "SINGLE\n" to stdout. If it finds a K_PLURAL, it
outputs the string "PLURAL\n" to stdout. We'll see more examples later.
Here is an example of how locals can be declared and used in a grammar rule:
paragraph: <<
{
int count = 0;
>>
( sentence
<<
count++;
>>
)+
<<
printf ( "%d sentences found\n", count );
}
>>
;
In this rule, the first action contains a left brace, {.
This is used to open a new scope in the output C code. The closing action
has the corresponding right brace, }, which is used to close
this scope. We do this because the C definition allows new local variables to
be defined whenever a new scope is opened (this is why you can have local
variables in a function). The first action opens a new scope, declares the
local variable count, and initializes it to 0. Each time the
subrule containing "sentence" is executed (which happens only while the next
token is in the first set of the non-terminal sentence), the sentence "count"
will be incremented. In the final action, we print out the number of
sentences found in the paragraph, then close the scope opened in the first
action.
We can modify the rule to indicate that arguments are expected when the rule is called in other rules. Suppose we have a rule for handling declarations that must put all global definitions in a global symbol table, and all local definitions in a local symbol table. We might define this rule as follows:
decls[int scope]:
<<
{
int type;
>>
( FLOAT << type = 0; >>
| INT << type = 1; >>
)
NAME
<<
if ( $scope == GLOBAL )
enter ( $2, type, gtable);
else
enter ( $2, type, ltable);
}
>>
;
This rule takes the int "scope" as an argument, and uses it to choose which
table to use. The dollar sign before "scope" in the final action indicates
that "scope" is an argument to the rule, and has not been declared in an
action. Note, that we have also declared the local variable "type" here. If
we had not, the braces would not have been necessary.
Actually, they are not necessary at all because ANTLR automatically opens a new scope at the beginning of each rule, and closes it at the end. Thus, this particular pair of braces is redundant. Also note that there is no dollar sign in front of "type" when it is referenced in the actions. This is because it is declared within an action, rather than being an argument or return value for an ANTLR rule.
In a rule which calls "decls", the notation looks like this:
decls[n]
where n is a number or variable whose value will be passed to the
function decls()
. For example, our rule for "prog" could be
changed to read as:
prog : << printf("#include \"nempl.h\"\n\n"); >>
decls[0];
funcs
;
Here, GLOBAL would be defined as the value 0 so that top-level declaration in
the source code will be placed in the global symbol table by
decls()
.
Multiple values can be passed and accepted by rules by using a comma-separated list or values and argument declarations, respectively.
stc > [int class] :
K_INT <<$class = SINGLE;>>
| K_PLURAL <<$class = PLURAL;>>
;
Here, we look for the tokens K_INT and K_PLURAL, and set class equal
to a value which indicates which token we found. The dollar sign before class in the action indicates that the variable is a return value for the
rule, and not defined in an action. The value is returned to the calling
function when the rule completes. Multiple values can be returned to the
caller by separating the values with commas.
In a rule which calls "stc", the notation looks like this:
stc > [v]
where v is a variable whose value will be set when the function
stc()
returns. For example, we might see:
decls : <<
{
register int type;
>>
stc > [type]
<<
printf ( "type is %d\n", type );
}
>>
;
Note that here "type" was a return value from "stc", but it is not an argument
or return value of "decls", thus, we need to allocate space for it with a local
declaration, and access it without using a dollar sign.
We have introduced several concepts since first talking about the layout of a grammar rule, so let's see how they all fit together. The general format for a rule is:
name [type1 arg1, ..., typeN argN] > [type1 rval1, ..., typeM rvalM]:
alternate 1
| ...
| alternate X
;
There are other things (such as error actions), but I'm not going to go into
them here.
intdecl: INT
VARNAME
;
We would like to add an action to this rule to store the newly declared
variable in the symbol table, but how do we get the name of the variable?
The answer is dollar attributes. The elements of a rule are numbered starting
at one. Actions do not count as elements, and subrules count as one element.
Let's look at an example to solidify this:
arule: << printf ( "hello\n" ); >>
WORD /* This is $1 */
NUM /* This is $2 */
( alt1 | alt2 )* /* This is $3 */
<< printf ( "First WORD: %s\n", $1 );
printf ( "First NUM: %s\n", $2 );
>>
WORD /* This is $4 */
NUM /* This is $5 */
<< printf ( "Second WORD: %s\n", $4 );
printf ( "Second NUM: %s\n", $5 );
>>
;
This should be sufficient to collect and move data to wherever you need it to
generate the proper output. If you need some text from "alt1" or "alt2" to
generate output from this rule, you can separate the
( alt1 | alt2 )
into a new rule, then declare a local character
buffer in this rule, and pass its address into the new rule, which copies the
text you need into the array.