Reading a PCCTS Grammar

A PCCTS grammar is made up of a set of "rules" or "productions". Each rule starts with the name of a "non-terminal", which is followed by a colon (:), and consists of one or more "alternatives". The alternatives are presented after the colon, and are separated by "vertical pipes" (|). The rule is ended after, the last alternative, by a semicolon (;).

Before I explain the structure of an alternative, lets look at some examples of what we have talked about so far. Each of the following is a valid rule:


1.	prog: routine;

2.	routine: function | procedure;

3.	function : decl
		 | stat
		 ;

4.	procedure: decl
		 | stat
		 ;

The numbers at the beginning of the line are there just to make it clear where the rules begin for this tutorial -- they are not part of the grammar.

First, note that whitespace and newlines are ignored. Now let's look at each of the rules. Rule 1, describes the nonterminal "prog" and has only one alternative which is named "routine". This indicates that in the language described by this grammar, a program (prog) is made up of one routine (routine). Rule 2 shows that a routine is, in turn, made up of either a single "function", or a single "procedure". Finally, rules 3 and 4 indicate that a "function" or "routine" consists of either a "decl" or a "stat".

Okay, I think you probably get the basic idea here, so let's move on to terminals. A "terminal" is a specific "token" rather than a reference to a grammar rule. Terminals consist of things like keywords, numbers, identifiers, and puctuation marks. Let's add some terminals to the above language to make it more interesting, and also add parentheses whose meaning I'll go over in a second:


	prog: "Program" routine;

	routine: function | procedure;

	function : "function"
		   ( decl
		    |stat
		   )
		   "return"
		 ;

	procedure: "procedure"
		   ( decl
		    |
		     stat
		   )
		 ;

Now, a program starts with the keyword "Program" followed by a single routine. The routine can be either a single function or a single procedure. Functions start with the keyword "function" and consist of a "decl" or a "stat", followed by the keyword "return". Procedures start with the keyword "procedure" and also consist of a "decl" or a "stat".

The parentheses change the grouping of a rule by creating a "subrule". In the rule for "procedure", we have `"procedure" (decl|stat)'. This is interpreted as the keyword "procedure" followed by one of the two alternates, "decl" or "stat". If we were to remove the parentheses from this rule, we would have `"procedure" decl | stat', which would be interpreted as the keyword "procedure" followed by a "decl", OR a single "stat" (not a "procedure" followed by a "stat").

Tokens are not always written or represented literally in a grammar. Sometimes we have a "token name" which represents one or more possible lexemes which are all classified as the same type of token. Suppose that our lexer provides #defined codes to the parser which represent the token type. We might then see the following grammar instead of the previous one:


	prog: K_PROG routine;

	routine: function | procedure;

	function : K_FUNC
		   ( decl
		    |stat
		   )
		   K_RET
		 ;

	procedure: K_PROC
		   ( decl
		    |
		     stat
		   )
		 ;

Where the lexer and parser have a mutual agreement that the token types are defined as follows:



	K_PROG	"Program"
	K_FUNC	"function"
	K_PROC	"procedure"
	K_RET	"return"

This grammar describes exactly the same language that the previous grammar describes. This grammar, though, uses the token types rather than an explicit token description to describe that language.

If you haven't noticed, this grammar is incomplete: there are no definitions for "decl" or "stat". In order to have a complete language definition, we must provide these definitions. Let's define them as follows:


	decl: type WORD SEMI;

	stat: WORD ASSIGN expr SEMI;

	type: K_FLOAT | K_INT;

	expr: WORD | NUM;

Okay, so it's a stupid language, but so is English. I've also added the token types:



	WORD	"[a-zA-z_][a-zA-Z_0-9]*"	/* A C identifier */
	SEMI	";"				/* A semicolon */
	ASSIGN	"="				/* An equals sign */
	K_FLOAT "float"				/* The keyword "float" */
	K_INT	"int"				/* The keyword "int" */
	NUM	"[0-9]*"			/* A number */

Okay, it's a complete language now, but not very interesting. So far, we only allow one statement in each function or procedure, and only one function or procedure in each program. We can change this by adding some of the symbols '*', '+', and '{' and '}' after a subrule to indicate that it is optional or that it can be applied multiple times. Let's look at each of these in turn.

'*' indicates that the subrule can be applied 0 or more times. '+' indicates that the subrule must be applied 1 or more times. Each of these must follow the subrule definition, which is enclosed in a parentheses pair. Braces can be used to replace the parentheses around a subrule to indicate that it can be applied exactly zero or one times.

Let's redefine the above language to allow more flexibility:


	prog: K_PROG (routine)*;

	routine: function | procedure;

	function : K_FUNC (decl)* (stat)+ K_RET;

	procedure: K_PROC (decl)* (stat)+;

	decl: type WORD {LBRACK NUM RBRACK} SEMI;

	stat: WORD {LBRACK expr RBRACK} ASSIGN expr SEMI;

	type: K_FLOAT | K_INT;

	expr: WORD {LBRACK expr RBRACK} | NUM;

A program written in the new language still starts with the keyword "Program", but it may otherwise be empty, or contain one or more "routines". A single routine is still made up of exactly one function or procedure, but now functions and procedures are made up of zero or more "decls" followed by one or more "stats". Notice that we've removed the choice between the two.

Finally, we have redefined "decl", "stat", and "expr" to allow the use of arrays. In "decl", we have added an optional subrule to allow the size of the array to be described as in C. In "stat", we have added an optional subrule to allow an array element to be specified in the left-hand-side (LHS) of an assignment statement. In "expr", we have added an optional subrule to allow an array element to be specified in the right-hand-side (RHS) of an assignment statement. Notice that this requires that we define the tokens LBRACK and RBRACK. Let's define them as follows:


	LBRACK "["				/* A left-bracket */
	RBRACK "]"				/* A right-bracket */

This should be enough for you to figure out how to read a simple PCCTS grammar. For more information, you should see Terence Parr's book on PCCTS.

This page was last modified .