Converting a PCCTS Grammar into C Code

Converting a PCCTS grammar to C code is a matter of repeatedly applying a set of transformations. This is what antlr does to generate a compiler in C, C++, Java, Sanskrit, whatever. In this tutorial we will see how to do this by hand.

Each grammar rule can be implemented by a single C function. PCCTS has one rule per nonterminal, which consists of a label, followed by a semicolon, followed by the alternatives.

For example, the following rule describes a program in a small language:


	program: { K_PROGRAM } (declaration)* (function)+;

This says that a program optionally starts with the token K_PROGRAM, followed by zero or more declarations, followed by one or more functions.

When translating a grammar rule to a C function, we replace the label with the declaration of void function whose name is the same as that of the label. This function will take no arguments (i.e. takes void). Nonterminals within the rule are replaced by calls in the function to other functions with the same name as the nonterminal. Terminals within the rule are replaced by calling match() with the name of the terminal as an argument. match() takes a token as an argument and checks to see if the next token from the lexer matches the argument. If they match, it returns TRUE (or PASSED), otherwise it returns FALSE (or FAILED).

Before we discuss at the above rule, let's look at a much simpler form:


	program: K_PROGRAM declaration function;

Applying the above transformations to this rule yields:



	void program (void)
	{
		match (K_PROGRAM);
		declaration();
		function();
	}

We see that there is a direct mapping of each part of the rule to the parts of the C function which implements it. After translating the rules for declaration and function (which I have not given), we would have a more complete recognizer for our language. When all the rules are translated, we will have a set of C functions, one for each grammar rule, which are ready to parse a source stream written in our language.

Let's go back to the original version of the rule for program and see how the meta-symbols {, }, *, and + translate to C code. The original rule was:


	program: { K_PROGRAM } (declaration)* (function)+;

Whenever we see a brace pair '{' and '}' around a subrule, our function will need to test to see if the next group of tokens will match that subrule. If so, it will need to execute the code which translates that subrule, otherwise, it needs to skip over the code for that subrule and look for the next part of the rule. This is equivalent to an if construct whose condition is the type of the next token on the input stream. The general form for { subrule } is:



	if ( next_token_on_input_stream == any_token_in_first_set_of_subrule )
	{
		Code to recognize subrule
	}

Graphically, it looks like this:


		           -------------------------
		           |                       |
		           |     -----------      \|/
	       	---> ------------| subrule |--------------->
		                 -----------

We see that we can either skip the subrule altogether, or go through it exactly one time before leaving. Thus, we execute the subrule zero or one times.

By the way, the act of looking at the next token on the input stream is referred to as performing look-ahead. Here, the look-ahead was by one token. Thus, we are building an LL(1) parser. (If you don't know what that means, you're probably heading for a C or worse in your Compiler's class).

In the above rule, the code to translate K_PROGRAM is:


	match (K_PROGRAM);

The "first set" of the subrule K_PROGRAM is {K_PROGRAM}. That is, there is only one possibility, K_PROGRAM. So the code for the construct:


	{ K_PROGRAM }

is:

	if ( next_token_on_input_stream == K_PROGRAM )
	{
		match (K_PROGRAM);
	}

I haven't described what next_token_on_input_stream means exactly, but I'll get to it later.

Whenever we see a subrule enclosed by parentheses and followed by a '+', we need to apply that subrule at least once, then continue to apply it until the next group of tokens no longer matches the first set of the subrule. We won't deal with error-checking here because everyone knows that programmers never make errors. That means that we parse the subrule, then check to see if the next token is in the first set of the subrule. If it is, we repeat the parse and check; otherwise, we move on.

Graphically, it looks like this:

		                 -----------
		---> ------------| subrule |--------------->
		          /|\    -----------       |
		           |                       |
		           -------------------------

Following the arrows, we see that we have to go through the subrule at least once, but can then leave or loop back to go through it again. Thus, we execute the subrule one or more times.

This is equivalent to a do-while loop whose body is the subrule and whose condition is the type of the next token on the input stream. The general form is:


	do {
		Code to recognize subrule
	} while ( next_token_on_input_stream == any_token_in_first_set_of_subrule );

From this we see that we parse the subrule at least once, and as many times as it appears. Note that this means that the subrule's first set must differ from that of whatever follows this construct.

Suppose that the first set of declaration is {INTEGER, FLOAT}. Then the code for the subrule (declaration)* becomes:


	do {
		declaration();
	} while (( next_token_on_input_stream == INTEGER) ||
		 ( next_token_on_input_stream == FLOAT));

Whenever we see a subrule enclosed by parentheses and followed by a '*', we need to apply that subrule until the next group of tokens no longer matches that subrule. We won't deal with error-checking here because everyone knows that programmers never make errors. That means that we first check to see if the next token is in the first set of the subrule. If it is, we parse the subrule then repeat the check; otherwise, we move on.

Graphically, it looks like this:


		      -----------------------------------
		      |                                 |
		      |          -----------           \|/
		---> ------------| subrule |--------------->
		          /|\    -----------       |
		           |                       |
		           -------------------------

Following the arrows, we see that we can completely skip over the subrule and leave, otherwise, we have to go through the subrule at least once, after which we can leave or loop back to go through it again. Thus, we execute the subrule zero or more times. Examining these diagrams makes it clear that "(subrule)*" is equivalent to "{(subrule)+}".

This is equivalent to a while loop whose body is the subrule and whose condition is the type of the next token on the input stream. The general form is:


        while ( next_token_on_input_stream == any_token_in_first_set_of_subrule ) {
                Code to recognize subrule
        }

From this we see that we parse the subrule only as many times as it appears. Again, note that this means that the subrule's first set must differ from that of whatever follows this construct.

Suppose that the first set of function is {READ, WRITE}. Then the code for the subrule (function)+ becomes:


	while ( ( next_token_on_input_stream == READ ) ||
		( next_token_on_input_stream == WRITE ) ) {
			function();
	}

Putting it all together

Now we need to go back and use these constructs to build a C function for the above rule. Let's look at the rule again:


	program: { K_PROGRAM } (declaration)* (function)+;

We see that this is really just the concatenation of the subrules we have discussed, so all we need to do is place them, in order, in the body of the function program. The result is:



	void program (void)
	{
		if ( next_token_on_input_stream == K_PROGRAM )
		{
			match (K_PROGRAM);
		}

		do {
			declaration();
		} while (( next_token_on_input_stream == INTEGER) ||
			 ( next_token_on_input_stream == FLOAT));

		while ( ( next_token_on_input_stream == READ ) ||
			( next_token_on_input_stream == WRITE ) ) {
				function();
		}
	}

Because these parts are sequential parts of the rule (i.e. not alternatives), we can make a syntax diagram for program by just connecting their diagrams serially:



		           -------------------------
		           |                       |
		           |     -------------    \|/
	program	---> ------------| K_PROGRAM |--------------->----
		                 -------------                   |
	                                                         |
	------<----------------------------------------------<----
	|
	|	      -----------------------------------
	|	      |                                 |
	|	      |        ---------------         \|/
	-----------> ----------| declaration |------------->------
		          /|\  ---------------     |             |
		           |                       |             |
		           -------------------------             |
	                                                         |
	------<----------------------------------------------<----
	|
	|	                ------------
	-----------> -----------| function |--------------->-------------->
		          /|\   ------------       |
		           |                       |
		           -------------------------

Alternates for the same rule are written in the grammar as "alt1 | alt2". To write C code for this, we can simply use either an if/else. More generally, for a rule such as (alt1 | alt2 | ... | altN), we can use an if/else-if/else construct or a switch statement with one case per alternative.

The general format for the if/else-if/else version is:


	if ( next_token_on_input_stream == any_token_in_first_set_of_atl1 ) {

                Code to recognize alt1

	} else if ( next_token_on_input_stream == any_token_in_first_set_of_atl2 ) {

                Code to recognize alt2

	...

	} else if ( next_token_on_input_stream == any_token_in_first_set_of_atlN ) {

                Code to recognize altN

	}

The general format for the switch version is:


	switch ( next_token_on_input_stream ) {

		case first_token_in_first_set_of_alt1:
		case second_token_in_first_set_of_alt1:
		...
		case M1th_token_in_first_set_of_alt1:

			Code to recognize alt1
			break;

		case first_token_in_first_set_of_alt2:
		case second_token_in_first_set_of_alt2:
		...
		case M2th_token_in_first_set_of_alt2:

			Code to recognize alt2
			break;

		...

		case first_token_in_first_set_of_altN:
		case second_token_in_first_set_of_altN:
		...
		case MNth_token_in_first_set_of_altN:

			Code to recognize altN
			break;
	}

Graphically, the rule looks like this with the syntax diagrams for the alternates connected in parallel:


		                  --------
		            ------| alt1 |------
		            |     --------     |
		            |        .        \|/
		rule ---> --|        .         X-------->
		            |        .        /|\
		            |     --------     |
		            ------| altN |------
		                  --------

Following the arrows, we see that we have can choose to go through alt1 once and leave or go through alt2 once and leave, or go through any one of the other alternates once and then leave. Thus, we execute exactly one of the alternatives, no more.

Well, hopefully that's enough updating for this year...

This page was last modified .