The compiler should be easily broken down into the following functional modules:
One fundamental decision to be made is the number of passes to be used in compilation. Pascal has been explicitly designed to allow one pass compilation, however, this generally produces rather poor code. Hence we would suggest a minimum of two passes, the first being lexical and syntactic analysis and the second semantic analysis and code generation. Additional passes can improve the code optimization.
Another primary concern is the output object code. A production compiler should obviously produce directly relocatable loader code. On the other hand, for many systems programming functions, and while the compiler is being tested and refined, assembly language output may be more suitable.
The operating system interface modules should do whatever is necessary to open, create, and setup the following functional files.
The listing and error files are often mapped onto the same file. All input is one character at a time. Output for the listing and error files is also 1 character, or a sequence of characters, at a time. The object file may be character (assembly language) or word (binary loader input) oriented.
This is the key to a successful Pascal compiler. The symbol table is used to collect and organize all information in the program. Each symbol table entry would typically include,
The key to a symbol table for a Pascal compiler is the type information. In addition to a few intrinsic types, users are allowed to define types. Types are treated just as other symbols, that is, they are put into the symbol table, but identified as type symbols, not variable name symbols. Thus, the type field of a symbol is a pointer to another symbol table entry, which is its type. The symbol table entry for a type depends upon the kind of type as follows:
This basic approach
may require a fair amount of pointer following (e.g. to generate an
array reference, we follow the variable type pointer to its type,
then the index pointer to the subrange to find the lower bounds, then
the type pointer to find the type of the component item.), but allows
a general mechanism for creating and describing user defined types.
See Figure 1 for an example of the structure of the following
type t1 = 0..10; r2 = record x: t1; y: t1; end; a3 = array [t1] of r2; var x: array [0..10] of char; y: array[char] of a3;
Notice that some types are declared implicitly, that is they are not explicitly named and given a type. These are "constant" types and can be treated as types with no name.
Symbol names can be treated in two ways. The traditional approach is to fix a maximum number of characters which are stored. If the maximum is selected at 5, then only the first 5 characters are stored and used to identify symbols. One could then limit all identifiers to 5 characters or less, or allow longer names, but only use the first 5. Another approach uses the first 5 characters plus the actual length of the input symbol; this distinguishes between "currentlevel" and "currentprocedure" but not between "currentl" and "currentp".
A better approach is to save the entire variable name. This is easy to do by using a string table. The string table is an array of characters. each symbol table entry contains simply the length of the name and a pointer to the string table. the extra programming caused by this extra level of indirection is minimal.
The block structure of Pascal poses the standard problems of scope and identical variables defined at multiple levels. Although
it is possible, when exitting a procedure block at level i, to delete all symbol table entries at level i or greater, leaving only the entries at lower levels, the better approach is to simply use the stack nature of the blocks. Allocate all space in the symbol table at the end of the table. Whenever the level increases, save the current free space pointer. Upon block exit merely reet the free space pointer to the saved value. if a string table is used, a pointer will be needed for both. This is much easier and less error prone than individually deleting each element from the highest level.
Notice that some symbol table entries will normally not be findable by referring to them by name, but must be in the symbol table anyway. For example, field names in records should not be found unless quantified by a record name. Unnamed types can not be referred to by name. This implies the existence of a search structure which differs from the symbol table. We would suggest that an extra pointer be included in each symbol table entry which allows each entry to be linked into the search structure. Unnamed types are not linked into the search structure. Field names are not linked. The WITH statement implementation however would link them into the search structure for the duration of the WITH.
For searching, we suggest a chained hash table. A hash function of the variable name gives an index into a small hash table, each entry of which is the head of a chain of entries, ending with a null pointer. Entries are put on the chain by a stack mechanism (put on the front of the chain) so all searches automatically find the most recent version of the variable name.
Formal procedure parameters are another example of the use of a separate search structure and symbol table. When the procedure is being processed, formal parameter names are linked into the search structure, acting as normal variables, in terms of scope. When the procedure is complete, the formal parameter names are deleted from the search structure, but reman in the symbol table attached to the procedure name symbol table entry. Thus on procedure calls, the number, type and order of parameters can be checked against the formal declaration.
On the basis of this discussion, we see the following operations will be needed.
Classical lexical analysis techniques can be used. The lexical analyzer needs to return the following information to ther syntax analysis routine:
Here several analysis techniques appear to be reasonable. The general function is of course to convert the token stream of the lexical scanner into a parse of the input program. The parse should then be communicated to the next stage, semantic analysis.
The traditional approach for Pascal is top-down recursive descent, with procedures based on the syntax diagrams. This is easy, intuitive and works reasonably well.
Another approach is
to use one of the bottom-up parsing methods. Pascal is not simple
precedence parsable because of conflicts with certain symbols, and
particularly with the equal sign and constant. The relation between
"=" and <constant> can be, =
<identifier> = <constant>
<relop> <expression> and <relop> => "="
while <expression> => <constant>
<identifier> = <subrange> and <subrange>
=> <constant> .. <constant>.
const <identifier> = <constant>
<expression> <relop> <expression> and <relop> => "=" while <expression> => <constant>
type <identifier> = <subrange> and <subrange> => <constant> .. <constant>.
Pascal is SLR(1) parsable or LALR parsable. These parsers may be small, fast and machine generated.
Most of semantic analysis involves type checking, and the generation of some intermediate code for alter code generation and optimization phases.
Type checking can be
done at two levels. The first can be restricted to mainly checking
the type pointers in the symbol table fields for equality. This means
a: array[0..1] of char; b: array[0..1] of char;
are of different types since each will have a pointer to a different (unnamed) type entry in the symbol table.
The obvious improvement is to check for type compatibility, not type identity. This requires checking that types are of the same structure, and that any subtypes are compatible. If a single routine COMPATIBLE(t1,t2) returns true or false if type t1 is or is not compatible with type t2, then the complexity of this routine can be hidden within this one routine and need not affect the remainder of the compiler. This is another example of the need for information hiding in the compiler.
A further problem with types is to provide information for operations which may have operands of differing types (e.g. adding integers, reals, or a mixture; comparing any two types). In addition, these operations must identify the correct output type. It can be quite useful to limit expression types as much as possible to improve code generation. For example, the addition of two variables in the subrange 0..10 could result in a sum which is typed as INTEGER or, in the most restrictive case, 0..20. The INTEGER sum would need to be stored in a full word register, while the 0..20 type could potentially be stored in a small register, like an index register. This could be important in the code generation for a machine like the CDC 6600 or MIX computers with long and short registers.
The major problem in semantic analysis is the generation of appropriate intermediate code for later code generation. Several forms have been suggested, such as trees, reverse Polish, triples, quadruples, and so on. There does not appear to be any good description of these different techniques at the practical level. However, from the discussion in Aho and Ullman, "Principles of Compiler Design", we would lean towards the use of quadruples.
Code generation is the process of converting the intermediate code from the semantic analysis routines into output code with is then formatted and output by the code emitting routines. The code generation phase must worry about such things as memory allocation, register allocation and instruction selection.
This is just a brief statement of some thoughts on writing a Pascal compiler. Hopefully, we will shortly have a compiler which satisfies most of the design criteria listed here.