





































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The implementation of a portable compiler for the programming language C. The compiler has been designed to produce assembly-language code for most register-oriented machines with only minor recoding. The machine-dependent information used in code generation is contained in a set of tables which are constructed automatically from a machine description provided by the implementer. chapters on modeling the target machine, generating code for an abstract machine, and more.
Typology: Lecture notes
1 / 77
This page cannot be seen from the preview
Don't miss anything!






































































S)
c/
00
rH
N
e
O
157070
MAC TR-
A PORTABLE COMPILER FOR
THE LANGUAGE C
Alan Snyder
Rtproi-'uced by
U S D«paMment o* Conim«fCt SpringflfldVA 22151
B ^
Work reported herein was supported in part by the Bell Telephone Laboratories, Inc., the National Science Foundation Research Grant GJ-34671, IBM Funds for research in Computer Science and by the Advanced Research Projects Agency of the Department of Defense under ARPA order no. 2095, ARPA Contract No Number N000K-70-A-0362-0006 and ONRTask No. NR-049-189.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
PROJECT MAC
i MASSACHUSETTS 02139
76
"i " ■ - ■ 'i" " ii ^mm^-^m^mm P^^^^^r—^^-^«^
MAC TR-
A PORTABLE COMPILER FOR THE LANGUAGE C
May 1975
MASSACHUSETTS INSTITUTE OF TECHNOLOGY PROJECT MAC
CAMBRIDGE MASSACHUSETTS 02139
HMMMMU^MMM. tttthmtutwäiätiä MHMMM
r ■■■iai|ii ii (^) HI i" iimmm^mmi^r "• ' _"w~—mimm\ mi^am^mmrm^m^^mm .K_ m i _wmv^i^^^_*
A PORTABLE COMPILER FOR THE LANGUAGE C
by
Alan Snyder __
ABSTRACT
This paper describes the implementation of a compiler for the programming language C. The compiler has been designed to be capable of producing assembly-language code for most 'egisler-orienled machines with only minor recoding. Most of the machine-dependent information used in code generation is contained in a set of tables which are constructed automatically from a machine doscription provided by the imolementer. In the machine description, the implementer models the target machine by defining a machine-dependant abstract machine for which the code generator produces intermediate code. The abstract machine is abstract in that it is a C machine: its registers and memory are defined in terms of primitive C data types and its instructions perform basic C operations. The abstract machine is machine- dependent in that »here ;s a close correspondence between the registers of the abstract machine and,' those of the target maciine, and between the behavior of the abstract machine instructions and the corresponding target machine instructions or instruction sequences. The implemonter defines the translation from an abstract machine program to a target machine program by providing in the machine description a set of simple macro definitions for the abstract machine instructions. In addition, macro definitions may be provided in the form of C routines where additional processing capability is needed.
_wmi^ermmmt_* (^) -—" mmi« 'ii »in I»IWP^^»P«<""«»^»W»P^HPI (^) I»" • ■«
FIGURE 1 The GCOS Control Cards
APPENDIX I The Machine Description
I. Definition Statements
1.1 The TYPENAMES Statement 1.2 The REGNAMES Statement 1.3 The MEMNAMES Statement 1.4 The S1ZF. Statement 1.5 The UIGN Statement 1.6 The CLASS Statement 1.7 The CONFLICT Statement 1.8 The SAVEAREASIZE Statement 1.9 The POINTER Statement 1.10 The OFFSETRANGE Statement 1.11 The RETURNREG Statement 1.12 The TYPE Statement
APPENDIX II The Intermediate Language: AMOPs
APPENDIX III The Intermediate Language: Keyword Macros
APPENDIX IV The HIS -6000 Machine Description
APPENDIX V The HIS-6000 C Routine Macro Definitions
APPENDIX VI Overall Desertion of the Compiler
(^111) ^^m^m ■"■^ ■■ ■ "^w
1. Introduction
This paper describes the implementation of a compiler for the programming language C [1,2], an implementation language developed at Bell Laboratories and a descendant of the language BCPL [3]. The compiler has been designed to be capable of producing assembly-language code for most register- oriented machines with only minor receding. Versions of the compiler exist for the Honeywell H1S- and Digital Equipment Corporation PDP-10 'omputers.
C is a procedure-oriented language. It has four primitive data types (integers, characters, a.id smgle- and double-precision floating-point), four data type constructors (pointers, arrays, f.'notions, and records), and a small but conveniei t set of control structures which encourage goto 'ess programming. An important characteristic of C is the minimal run-time support needed. Although C supports recursive procedures, C does not have built-in functions, I/O statements, block structure, string operations, dynamic arrays, dynam.c storage allocation, «r run-time type checking. The only run-time data structure is the stack of nrocedure activation records. Of course, to run any usef-jl programs, an interface to the operating system is required, and a standard set of 1/0 routines has been defined in order to encourage portability. But the implementation of these routines is optional and separate from the task of i nplementing a C compiler which produces code for a given machine.
The compiler described in this paper was designed to be portable, that is, to be capable of generating code for many target machines with a minimum of receding When considering portability, three classes of machines can be defined:
Machines for which the compiler can produce reasonably efficient code: This class of machines is clearly a subset of tho first class; the size of the subset is again determined by one's definition of reasonable. The better the correspondence between the target machine and the machine model implicit in the compiler, the better will be the object code produced. On the other hand, if the correspondence is poor, the compiler may be able to produce only threaded code or instructions to be interpreted by software.
This paper concentrates on the second class of machines, those for which the compiler can produce
MMMBM _~m^mmmm—m_*
m^^^mmmm^^^-^mmmmmmm^mm
1.2 Background
A compiler can be considered to consist of two logical phases, analysis and generation. The analysis phase performs lexical and syntactic analysis of the source program, producing as output some convenient internal representation of the program, along with a set of tables containing lexical information and other information derived from the declarative statements of the program. The generation phase then transforms the internal representation into an object language program, using the information contained in the tables produced by the analysis phase. One can confine the machine (object language) dependencies Of a compiler to the generation phase by a suitable choice of internal representation, i.e. one which is machine-independent. On the other hand, it is not practical to also confine the source language dependencies of a compiler to the analysis phase since this would make the internal representation a universal language. Thus the generation phase of a compiler is both source-language-depondent and machine-dependent.
Most portable compilers require that the generation phase be completely rewritten for each target machine [7,8]. This effort may represent onl> about one-fifth of the effort needed to rewrite the entire compiler [8]. In the case of the BCPL compiler [9], for example, moving the compiler may require only three to four weeks under ideal conditions (but otherwise may require up to five months). However, it would be desirable if the amount of recoding necessary to generate code for a new machine could be reduced.
On© approach is that advocated by Poole and Waite for writing portable programs [10,11]. They advocate that before writing a program to solve a particular problem, one define an abstract machine for which the program is then written. With this approach, in order to move the program to a new machine, one need only implement the abstract machine on the target machine, typically via a macro processor. The desired qualities of the abstract machine are that it contain operations and data objects convenient for expressing the problem solution, that it be sufficiently close to the target machines o(^ interest sc that acceptable code can easily be generated, and that the tools for implementing the abstr.-d machine be easily obtainable on the target machines.
This technique can be applied to portable compilers by considering the problem to be the implementation Of an arbitrary source language program. The operations and data objects convenient for expressing the problem solution are then those which are basic to the source language. With this technique, a compiler would be broken into two parts: a machine-independent translator from tK? source language to the abstract machine language and a machine-dependent translator from the abstract machine language to the target machine language. The translator from the absi act machine language to the target machine language should be smaller and simpler than the conventional generation phase wou'd be; typically, it consists of a set of macro definitions which map each abstract machine instruction into the correspondir^ target machine instruction or instruction sequence. Moving the compiler to a new machine bimply requires rewriting the macro definitions.
The major difficulty with the abstract machine approach to portable software is in determining the appropriate abstract machine. If the abstract machine is of a high level (i.e., very problem-orenled), then the program will be very portable but the implementation of the abstract machine will be difficult. On the other hand, if the abstract machine is of a low level (i.e., more machine-onented), then, un'-ss it corresponds closely to the target machine, either the code produced will be inefficient or the implementation will be complicated by optimization code.
The solution to this difficulty proposed by Poole ard Waite is to define a hierarchy of abstract machines, ranging from a high-level problem-oriented abstract machine to a low-level, machine-oriented, and easy- to-implement abstract machine. In this solution, the h'gh«r-lovel abstract machines are imolemented in terms of the lower-level abstract machines, and o^ly the lowest-level abstract machine need be implemented on a target machine in order to trans^e' fh*» program; once it is transferred, higher-level abstract machine« nay be implemented directly in terms of the target machine in order to improve efficiency. While ths technique may be useful for transferring particular programs, if is jrlikely that it
MMaaa^
1
will be acceptable in practical terms as a compilation technique because of the need for additional translation steps. An experiment by Brown [12] indicates that one may implement and then optimize a low-level abstract machine in about the same time as it takes to implement a higher-level abstract machine and that the resulting implementations are similarly efficient. Thus an alternative solution is to use a low-level abstract machine, but allow the implementer to optimize as desired; this solution rs more likely to be acceptable as a compilation technique. A third solution will be advocated in this paper.
The technique of rewriting the generation phase requires that a non-trivial translator from >he internal representation to the target machine language be written for each new target machine. Similarly, the abstract machine approach requires that a translator from the abstract machine language to the target machine language be written for each new target machine; if reasonably efficient code is desired and the abstract machine does not correspond very closely to the target machine, then this translator will also be non-trivial.
A more desirable goal for a portable compiler is that it have a generation phase which can be modified to produce code for a new target machine by a process which is largely automatic. Implicit in this goal is the requirement that the modification process obtain its knowledge about a target machine from a (non- procedural) description of the machine. An early effort in this direction was the SLANG system [13], which attacked the problem of describing a machine-dependent process (code generation) in a machine- independent way. In the SLANG system, source language constructs are translated ir.to a set of basic operations called EMILs; the EMILs are translated into absolute machine code using macro definitions and instruction format definitions. The approach is s:milar to the abstract machine approach in that the EMILs can be considered to be the instructions o an abstract machine; the difference is that the code generation algorithm uses information contained in a machine description in order to tailor the EM1L program to the target machine. The EMILs differ^ from the instructions of a Poole and Waite abstract machine in that they are machine-oriented, rather than problem (source-language) oriented. In addition, the code generator does not seem to know about registers other than index registers, which implies that one will not be able to achieve the desired dose correspondence between the abstract machine and most register-oriented machines. Nevertheless, the method of describing the instructions of a machine by providing simple instruction sequences which interpret the abstract machine instructions seems to be a good compromise between the desire to minimize coding and the difficulty of matnematically defining a machine and utilizing such a definit'On in generating code.
Wore recently. Miller [14] has explored the problem of constructing? code generator from a machine description. Miller proposes that a generation phase be constructed in two steps. In the first step, the language designer sptiifies the language-dependent part of the generation phase by writing a set of procedural machine-independent macro definitions for the operations of the internal representation produced by the analysis phase. These macro definitions define the operationa of the internal representation, such as addition, in terms of machine-independent (i.e., language-oriented) primitives, su^h as integer addition, which are created by the language designer. In the second s'ep, the implementer provides a description of the target machine which is osed by an automatic code generation system named DMACS (Descriptive Macro System) in order to fill out the macro definitions of the first step and thereby produce a code generator for tht target machine. As was the case with the SLANG system, the DMACS machine description defines the primitive operations by giving target machine code seque .es which interpret them. In addition, however, the permitkd locations of the operands (in terms of their being in memory or in particular renters) are specified as are the corresponding result locations. Thus the primitives can be made to correspc d very closely to the instructions of the target machine so that the code sequences in the machine description are simpler and the resulting object code is more efficient.
Both the SLANG system and DMACS are intended to be general in that they are not designed for a specific source language. However, true generality is difficult to obtain and the systems do reflect preconceived notions about source languages. It is believed that, since there are much more significant Vai lations among languages than among machines, a practical implementation of a compiler for any interesting language requires that the system be designed specifically for that language. This idea was recognized to some extent HI DMACS where the primitives are created by the language designer as
—mmmmm ■ --— ■ - ■ ■
■■ '■ '
A code generation algorithm, if it is to he machine-independent, requires a mode' of a machine with which to work. This model may express such notions as memory, registers, addressing, operations, and hardware data types. In the machine description, the implementer defines hi. U.get machine in terms of this model and also specifies the form of the object language. The class of machines for which the code generator can produce acceptable code directly corrssponds to the generality of the machine model.
The machine model used by the C compiler is a C machine: a machine whose registers and memory are described in terms of the primitive C data types and whose operations are primitive C operations. The implementer models the target machine in terms of a C machine, producing an abstract machine. The abstract machine may be very similar to or very different from the target machine, depending upon how closely the target machine fits the machine model. The code generation algorithm, using its machine model, produces code for the abstract machine. The "assembly" language of the abstract macliine is called the intermediate language; an intermediate language program, which is in the form of a series of macro calls, is translated into the target machine assembly language using a set of macro definitions, provided by the implementer in the machine description. Assembly language was chosen over machine language for the output of the compiler because it is far easier to describe and produce in a machine-independent manner than machine code- or object modules.
The abstract C machine plays the same role in the C compiler as would a Poole and Waite abstract machine. The difference is that instead of there being one fixed abstract machine, there is a class of abstract machir ^s, corresponding to the variability in the machine model. This variability allows the implementer to define a particular abstract machine which more closely resembles his target machine. The result is that the translation from the abstract machine language to the target machine language becomes simpler, and more efficient code is produced.
The process of modeling the target machine is described in chapter two. A detailed discussion of the code generation algorithm is presented in chapter three. Conclusions are presented in chapter four.
MB MM Mi
lim itMimw mViOT (^) ~ ~ '■• ' ■ pi ■ • ■ im ^
- u -
The code generator's model of a machine is an abstract C machine, a machine whose instructions perform the primitive operations of the C language. The data types of the abstract machine are the primitive C data types (characters, integers, and single- and doubL'-precision floating point), supplemented by one or more pointer classes which are distinguished by their ability to resolve addresses. The basic addressable unit of the abstraU machine memory is the byte, which holds a single character value (characters are the smallest C data type). Values of the other abstract machine data types occupy an integral number of bytes, possibly aligned in larger units of memory. Th:« ^Dstract machine has a set of registers which may- be used to hold the operands of the abstract ma.- .« instructions. Each abstract machine register is capable of holding values of some subset of the tract machine data types. The instructions of the abstract machine are three-address instructions. .' .h address may specify an abstract machine register or a location in memory; the mechanisms for re(e;.v<cing a memory location correspond to the primitive addressing modes in C.
In the machine description, the impiementer describes the target machine in terms of this machine model by defining a particular abstract machine for which the code generator produces intermediate code. The impiementer specifies the sizes and alignments of the primitive C data types and defines pointer classes as convenient. The impiementer defines the abstract machine registers, which generally correspond to those registers of the target machine which are to be used in the evaluation of expressions. The
. p'ementer also specifies the registers which may hold values of each of the abstract machine data /pes. In addition, the implemerter may specify that any two abstract machine registers conflict in the target machine, meaning that oily one may hold a value at any one time. The impiementer defines the abstract machine instructions n terms of their operand/result locations and possible side-effects on other registers. In addition, the inplementer provides a set of macro definitions which implement the abstract machine instructions on the target machine
2.1 The Irter »ediate Language
The interr.odr-.te language is the assembly language of the abstract machine. Using the information contai. -»d in the tables constructed from the machine description, the code gensrator produces a translatioi. cf the source program m the intermediate language. An intermediate language program consists of a sequence of macro calls, each of which is expanded into one or more object language statements using the macro definitions provided in the machine descriptior.. There are two types of macros in the interrr.edialo language: The first type are macros which represent the three-address abstract machine instructions. The second type are keyword macros which correspond to either assembly-language pseudo-operations or instructions implementing the primitive C control structures.
2.1.1 Abstract Machine Instructions
The abstract machine instructions are three-address instructions which perform the evaluation of C expressions. The operators of the abstract machine instructions are called abstract machine operators (AMOPs), the addresses are called references (REFs).
2.1.1.1 AMOPs
AMOPs are basic C operations which are qualified by the specific abstract machine data types of their operands. For example, in the HIS-6000 implementation there are four AMOPs corresponding to the C operator V:
♦i integer addition ♦d double-precision floating-point addition ♦pO addition of an integer io a pomte' to a byte-aligned object ♦pi addition of an integer^ to a pointer to? word-aligned object
MMMM
mmm •m'mmmmm^.^ ,,^ ■ —^ wwmm
13
macro function
HEAD ENTRY EXTRN
ADCONn STRCON EQU ZERO STATIC STRING ALIGN LN LABCON LABDEF
prvduce header statements, if needed defne an entry point define an external reference define an integer constant define a character constant define a floating-point constant define a negative floating-point constant define a double-precision float coistant define a negative double-precision constant define a class V pointer constant define a pointer referencing a string constant define a symbol det'ne an area of storage initialized to zero define a static variable define the string constants force an alignment of the location counter define a line-number symbol define a label constant define an internal label transla* J an miernal identifier number into the corresponding assembler symbol produce an end statement, if needed
produce the prolog code of a C function produce the epilog code of a C function produce a function call produce code for a return statement produce a jump to a label expression produce a switch jump (list version) produce a switch jump (table version)
The actual macro names which appear in an intermediate language program are abbreviations of the names listed above.
2.2 The Maohinj Description
The machine description is a "program" written in a special-purpose language from which is constructed the machine-dependent tables of the generation phase. The machine description has two functions: (1) it defines the particular abstract machine for which the code generator produces intermediate code, and (2) it specifies the translation from an intermediate language program to the corresponding object language program.
The abstract machine is defined in two sections of the machine descriotion. First, a set of definition statements defines the registers and memory of the ?Utract machine. Second, in the OPLOC section, the AMOPs are defined in terms of their operand/result locations. The translation from the intermediate language to the object language is specified by a set of macro definitions in the macro section of the machine description. More information on the writing of a machine description may be found in Appendix I; the machine description used in the HIS-6000 implementation is listed in Appendix IV.
2.2.1 Defining the Abstract Machine
In the machine description, the implementer first defines the registers of the abstract machine. For example, the statement
MtM
■ ^^"" ■■■ •"• ■■ ^i mi i i^m^mmammmr^^^mm^^^m^mrnmmmm^mmi
- 14-
misnm
" ' • ll1^ I! "Uli Ml
representation. For example, the macro definition for Vi' (integer addition) in the HIS- implementation is
♦•: " ADaR aS"
If the first operand location (which is also the result location) is the A register and the second operand is en external variable "X", then the code produced by this macro definition is
ADA X
which adds the contents of "X" to the A register. A macro definition can also contain character strings whose inclusion in the expansion of a ir.^ro call is conditional upon the locations of the operands end/or result. An example is the HIS-6000 mecro de'irition 'o' '«' (left shift)
«: (.intlit,): (.-intlit,):
which produces different code sequences depending upon whether or not the second operand (the number of bit-positions to shift) is an integer constant. A macro definition may include references to the arguments of the macro call using the character sequences «0, «1, „. «9; a macro definition may include embedded macro calls, such as the •lo(a'S)" in the last example, which returns the value of the integer constant.
A macro definition may also be specified in the .orm of a C routine. C routine macro definitions are used when processing is needed which is beyond the capabilities of the simple macro scheme so far described. C routine macro definitions may define global variables, perform arithmetic and logical operations, and select code sequences on conditions other than operand locations. In the present implementation, however, C routine r.:ro definitions are unable to interact with the code generation algorithm. In the HIS-6000 implementation, C routine macro definitions are used to translate REFs into GMAP symbols, to translate the source language representations of identifiers and floating-point constants into GMAP, to define cha. acter string constants, and to buffer characters while defining storage for variables (GMAP does not have a byte location counter, as is assumed in the intermediate language). The C routine macro definitions used in the HIS-6000 implementation are 'xAed in Appendix V.
LXL5 (^) •S
■■-"♦••. fggagttmmm
17
3. Generating Code for an Abstract Machine
The most interesting pirt of the compiler is the code generator since, unlike most code generators which produce code for a fixed target language, the code generator of the C compiler is designed to produce code for a class of abstract machines.
3.1 Functions of the Code Generator
The code generation process consists of three fairly distinct functions. First, there is the generation of intermediate language statements to define and initialize static data areas and constants. Second, there is the translation of source language control structures into labels and branches. Third, there is the translation of source language expressions into sequences of abstract machine operations.
The C compiler is designed to produce assembly language code for conventional machines; thus, the intermediate language statements (or defying and initializing static data areas directly correspond to assembly language statements which define symbols, define constants, and align the location counter. The only complication is that the code generator must use the size and alignment information from the machine description in order to specify the sizes and alignments of data areas. More information and redundancy could be added to the intermediate language in order to accomodate a larger class of target languages; see [16] for examples. Another possible improvement would be to emit segment specifying instructions so that the output could be segregated into different segments according to whether it is code, pure data, impure data, or •minitialize'l data.
The process of translating source language control structures in'o labels and branches is rather straightfoward. The only complications come when emitting conditional branches which test the value of an expression; these problems are covered m the next section.
3.2 Generating Code for Expressions
The generation of code for expressions is the most difficult part of the problem. The code generator must generate a correct sequence of abstract machine instructions to carry out the indicated operations. The operand and result locations it specifies in the abstract machine instructions must conform to the location definitions provided in the machine description. Moreover, the code generator must Keep track of the locations of all intermediate results and correctly administer the abstract machine registers and temporary locations.
The generation of code for expressions is performed in two steps, semantic interpretation and code generation.
3.2.1 Semantic Interpretation
The code generator receives expressions in the term of syntax trees whose interior nodes are source l?nguage operators and whose leaf nodes are identifiers and constants. Thus, an expression can be considered to consist of a "top-level" operator along with zero or more operand expressions. The first step in the processing of an express'on consists of translating a tree in this form to a more descriptive form whose interior nodes are AMOPs. This translation involves checking the data types of operands, inserting conversion operators where necessary, and choosing the appropriate AMOPs to express the semantics of the source language operators. The selection of an AMOP to replace a source language operator is based primarily on the data types of the operands. For example, on this basis, an addition operator may be translated into either integer addition, double-precision floating-point addition, or one of a number of pointer addition AMOPs. However, it is useful to be able to choose AMOPs also on the basis of what is provided in the machine description. The basic idea is that of defaults. If the semantics of a particular AMOP can be expressed in terms of a composition of more basic AMOPs, then the AMOP can be left undefined in the machine description; the code generator can use the equivalent composition of AMOPs instead. The advantage of havir.g optional AMOPs is that the implementer need define one of
HMMMgi (^) -—MMHOi