Lab 4: Compiler Design for Language L4, Study notes of Compiler Design

The lab 4 for the Compiler Design course, which aims to implement a complete compiler for the language L4. The syntax of L4 is defined by a grammar and lexical tokens, and extends L3 with pointers, arrays, and structs. The document also discusses the challenges of parsing L4 syntax, which is no longer context-free, and presents two approaches to handle the ambiguity. The lab includes test programs, a checkpoint, and a final compiler due on different dates in November 2020.

Typology: Study notes

2019/2020

Uploaded on 05/11/2023

alopa
alopa 🇺🇸

4.2

(19)

255 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
15-411 Compiler Design, Fall 2020
Lab 4
Seth and co.
Test Programs Due: 11:59 pm, Tuesday, November 3, 2020
Checkpoint Due: 11:59 pm, Thursday, November 5, 2020
Compilers Due: 11:59 pm, Thursday, November 12, 2020
1 Introduction
The goal of the lab is to implement a complete compiler for the language L4. This language extends
L3 with pointers, arrays, and structs. With the ability to store global state, you should be able to
write a wide variety of interesting programs. As always, correctness is paramount, but you should
take care to make sure your compiler runs in reasonable time.
2 L4 Syntax
The lexical specification of L4 is changed by adding [’, ]’, .’, and -> as lexical tokens; see
Figure 1. Whitespace, token delimiting, and comments are unchanged from languages L1, L2, and
L3.
The syntax of L4 is defined by the (no longer context-free!) grammar in Figure 2. Ambiguities
in this grammar are resolved according to the operator precedence table in Figure 3and the rule
that an else provides the alternative for the most recent eligible if. Note that according to this
precedence table, *x++ parses as *(x++), which isn’t valid syntax. While we could allow *x++
because *x is unambiguously an lvalue and (*x)++ is unambiguously a statement, we disallow it.
This is both to match the precedence rules and also to avoid confusion with the different semantics
that statement has in C.
L4 syntax is not context-free
As noted above, the grammar presented for L4 is no longer context free. Consider, for example,
the statement
foo * bar;
If foo is a type name, then this is a declaration of a foo pointer named bar. If, however, foo is
not a type name, then this is a multiplication expression used as a statement.
For those of you using parser combinator libraries, you will be able to backtrack from a parse
decision based on whether an identifier is a type name, so this case should not be a problem.
1
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Lab 4: Compiler Design for Language L4 and more Study notes Compiler Design in PDF only on Docsity!

15-411 Compiler Design, Fall 2020

Lab 4

Seth and co.

Test Programs Due: 11:59 pm, Tuesday, November 3, 2020

Checkpoint Due: 11:59 pm, Thursday, November 5, 2020

Compilers Due: 11:59 pm, Thursday, November 12, 2020

1 Introduction

The goal of the lab is to implement a complete compiler for the language L4. This language extends

L3 with pointers, arrays, and structs. With the ability to store global state, you should be able to

write a wide variety of interesting programs. As always, correctness is paramount, but you should

take care to make sure your compiler runs in reasonable time.

2 L4 Syntax

The lexical specification of L4 is changed by adding ’[’, ’]’, ’.’, and ’->’ as lexical tokens; see

Figure 1. Whitespace, token delimiting, and comments are unchanged from languages L1, L2, and

L3.

The syntax of L4 is defined by the (no longer context-free!) grammar in Figure 2. Ambiguities

in this grammar are resolved according to the operator precedence table in Figure 3 and the rule

that an else provides the alternative for the most recent eligible if. Note that according to this

precedence table, *x++ parses as *(x++), which isn’t valid syntax. While we could allow *x++

because x is unambiguously an lvalue and (x)++ is unambiguously a statement, we disallow it.

This is both to match the precedence rules and also to avoid confusion with the different semantics

that statement has in C.

L4 syntax is not context-free

As noted above, the grammar presented for L4 is no longer context free. Consider, for example,

the statement

foo * bar;

If foo is a type name, then this is a declaration of a foo pointer named bar. If, however, foo is

not a type name, then this is a multiplication expression used as a statement.

For those of you using parser combinator libraries, you will be able to backtrack from a parse

decision based on whether an identifier is a type name, so this case should not be a problem.

ident ::= [A-Za-z_][A-Za-z0-9_]* num ::= 〈decnum〉 | 〈hexnum〉

〈decnum〉 ::= 0 | [1-9][0-9]* 〈hexnum〉 ::= 0[xX][0-9a-fA-F]+

〈special characters〉 ::=! ~ - + * / % << >> < > >= <= == != & ^ | && || = += -= *= /= %= <<= >>= &= |= ^= ->. -- ++ ( | ) [ ] , ;? :

〈reserved keywords〉 ::= struct typedef if else while for continue break return assert true false NULL alloc alloc_array int bool void char string

Terminals referenced in the grammar are in bold. Other classifiers not referenced within the

grammar are in 〈angle brackets and in italics〉. ident, 〈decnum〉, and 〈hexnum〉 are described

using regular expressions.

Figure 1: Lexical Tokens

However, backtracking raises the specter of a performance bug, so you will need to closely consider

performance.

However, those of you using parser generators will have a harder time—the decision whether to

shift or reduce might be made well ahead of when an identifier is determined to be a typename or

not. Solving this ambiguity is a bit tricky; below, we describe two approaches.

One way to handle this is to perform an ambiguous parse: use one rule to parse both the

declaration form and the expression form. Then undo an incorrect decision during elaboration. This

approach will almost certainly involve some other adjustment to various pieces of your grammar

and lexer. Looking over grammars and parsers from past years, we cannot honestly recommend

this technique—results seem to be largely contorted with low confidence in correctness.

Another option is to prevent incorrect decisions from being made. New type identifiers are

introduced at the top level (as a 〈gdecl〉), and so the parser can update mutable state to record

type identifiers as such. With this approach, the lexer can produce different tokens for type and

non-type identifiers. This was the solution intended by the designers of C, if one considers the

relevant footnote in their book.

This solves the parsing problem, but raises another. The lexer performs a lookahead in order

to find the longest match. This affects the lexing of a token which is used immediately after it is

introduced—consider:

typedef int foo;

foo func();

In this case, if the parser parses typedef int foo; the lexer may already have lexed the foo

at the beginning of the next line, so be careful! Despite this potential issue, we recommend this

approach because the grammar will continue to look natural and follow the actual understanding

of the language syntax.

Operator Associates Meaning

() [] ->. n/a explicit parentheses, array subscript,

field dereference, field select

! ~ - * ++ -- right logical not, bitwise not, unary minus,

pointer dereference, increment, decrement

* / % left integer times, divide, modulo

+ - left integer plus, minus

<< >> left (arithmetic) shift left, right

< <= > >= left integer comparison

== != left overloaded equality, disequality

& left bitwise and

^ left bitwise exclusive or

| left bitwise or

&& left logical and

|| left logical or

? : right conditional expression

&= ^= |= <<= >>= right assignment operators

Figure 3: Precedence of operators, from highest to lowest

3 L4 Semantics

The static and dynamic semantics for Lab 4 are described in the lecture notes discussing static

semantics, dynamic semantics, mutable store, and structs. Some specific points:

  • Elaboration will need to deal with the fact that A[f(x)] += 3 cannot be elaborated into

assign(A[f (x)], A[f (x)] + 3), because calling f (x) might have an effect, like printing or writing

to a pointer.

  • You will have to preserve at least some size information during elaboration to generate cor-

rect code. We suggest you refer to the current (third) edition of Bryant and O’Hallaron’s

Computer Systems: A Programmer’s Perspective, or to the current semester’s 15-213 notes.

Pay particular attention to the effects of 32 bit operators in the upper 32 bits of the 64 bit

registers.

  • Struct definitions obey scoping rules of the other global declarations, that is, they are available

only after their point of definition. However, structs may be declared implicitly; see Section

2 of the notes.

  • The rules for struct declarations and definitions, mostly inherited from C, are carefully engi-

neered so that it should be possible to compute the size and field offsets of each struct without

referring to anything found later in the file. You probably want to store the sizes and field

offsets in global tables.

  • Like type definitions, struct declarations and definitions can appear in external files.
  • We expect your generated code to explicitly capture memory errors, rather than counting on

the operating system to notice and raise SIGSEGV (11). In order to enforce that, the signal

associated with memory errors will be SIGUSR2 (12). You can raise this signal explicitly with

the standard raise(sig) function.

The default library for this lab, 15411-l4.h0, is a modification of the previous library that

uses 8-byte floating point values stored in pointers rather than 4-byte floating point values stored

in integers. We encourage you to use this library in some of your test cases!

For this lab, you do not need to lay out structs in a way that is compatible with C, but you

are encouraged to do so. You should respect the machine’s alignment requirements so that integers

and booleans are aligned at least 0 modulo 4 and addresses at least 0 modulo 8, but beyond that,

we will not require strict adherence. One reason for this flexibility is that we allow you to represent

structs and arrays containing boolean values however you want. Here is what we will potentially

test:

  • Integers must be stored in memory as 4 continuous bytes (little-endian, as usual on x86-64)
  • Pointers must be stored in memory as 8 continuous bytes (little-endian, as usual on x86-64)
  • Structs and arrays which contain only ints (and other structs which contain only ints, and so

on) must store the ints continuously, in order, where a, the address of the struct or value of

the array, is the address of the first struct field or array element, a + 4 is the address of the

second struct field or array element, and so on.

  • Ditto for structs and arrays which contain only pointers, except that a + 8 is the address of

the second struct field or array element.

4 Project Requirements

For this project, you are required to hand in test cases and a complete working compiler for L

that produces correct target programs written in Intel x86-64 assembly language.

We also require that you document your code. Documentation includes both inline documenta-

tion and a README document which explains the design decisions underlying the implementation

along with the general layout of the sources. If you use publicly available libraries, you are required

to indicate their use and source in the README file. If you are unsure whether it is appropriate

to use external code, please discuss it with course staff.

When we grade your work, we will use the gcc compiler to assemble and link the code you

generate into executables using the provided runtime environment on the lab machines.

Your compiler and test programs must be formatted and handed in as specified below. For this

project, you must also write and hand in at least 20 test programs, at least two of which must fail

to compile, at least two of which must generate a runtime error, and at least two of which must

execute correctly and return a value.

% make lab

should generate the appropriate files so that

% bin/c0c

will run your L4 compiler. The command

% make clean

should remove all binaries, heaps, and other generated files.

Runtime Environment

Your compiler should accept a single, optional command line argument -l which must be given the

name of a file as an argument. For instance, we will be calling your compiler using the following

command: bin/c0c -l ../runtime/15411-l4.h0 $test.l4. Here, 15411-l4.h0 is the header

file mentioned above. You may not assume that the header file parses and typechecks correctly.

The 15411-l4.h0 header file describes a library for manipulating double-precision floating-point

values. The implementation of this library can be found in lab4/runtime/run411.c. You should

assume the types of the implementation match the types in the header file.

The GNU compiler and linker will be used to link your assembly to the implementations of the

external functions, so you need not worry much about the details of calling to external functions.

You should ensure that the code you generate adheres to the C ABI for Linux on x86-64. As a

reminder from lab 3, in order for the linking to work, you must adhere to the following conventions:

  • External functions must be called as named.
  • Non-external functions with name name must be called _c0_name. This ensures that non-

external function names do not accidentally conflict with names from standard library which

could cause assembly or linking to fail.

  • Non-external functions must be exported from (declared to be global in) the assembly file you

generate, so that our test harness can call them and verify your adherence to the ABI.

  • You may notice that the functions c0 alloc and c0 alloc array are implemented in run411.c.

This is because run411.c is a modified version of the c0 reference runtime, which needs these

functions since the reference compiler targets C, rather than assembly. Please do not use

these functions. You must implement alloc and alloc array yourself. In practice, this

means you should be calling calloc in your generated assembly.

The runtime environment defines a function main() which calls a function _c0_main() your

assembly code should provide and export. Your compiler will be tested in the standard Linux

environment on the lab machines; the produced assembly must conform to this environment.

What to Turn In

You may turn in code and have it autograded as many times as you like, without penalty. In fact,

we encourage you to hand in to verify that the autograder agrees with the driver results that you

use for development, and also as insurance against a last-minute rush. The submission with the

highest grade will count.

You will submit:

Before Tuesday, November 3, 11:59 pm At least 20 test cases, at least two of which generate

an error, at least two of which raise a runtime exception, and at least two of which return

a value. You will submit to the Test 4 assesment on Notolab. The directory tests should

only contain your test files. The autograder will test your test files and notify you if there is

a discrepancy between your answer and the outcome of the reference implementation. If you

feel the reference implementation is in error, please notify the instructors.

Before Thursday, November 5, 11:59 pm A compiler which can typecheck the language of

lab 4. You will submit to the Lab 4 Checkpoint assessment on Notolab, containing the

same files as the full Lab 4 (see below). The autograder will build your compiler, run it on all

existing test files using the “-t” switch, and log which files successfully compiled and which

did not. If a test fails to compile unexpectedly (or the reverse), we will mark the test as

failed. The checkpoint will grade typechecking only.

Before Thursday, November 12, 11:59 pm The complete compiler. You will submit to the

Lab 4 assessment on Notolab. The directory compiler/lab4 should contain only the sources

for your compiler and be submitted as described above. The autograder will build your

compiler, run it on all existing test files, link the resulting assembly files against our runtime

system (if compilation is successful), execute the binaries (each with a 6 second time limit),

and finally compare the actual with the expected results.

Please note that similar to previous labs, we will count your highest submission across all

submissions. Also, any submission past 11:59 pm on the due dates for either the tests or the

compiler will result in the usage of late days.

Checkpoint Scoring

Your compiler will be graded against the test cases, and your score is computed as follows. Note

that the checkpoint will only see whether a given test compiles successfully (or not), and will mark

tests as passed/failed accordingly.

20 * (% passed of new) +

60 * (% passed of large & only & basic)

= checkpoint subtotal

penalty = min(40, (1 point per failure) + (0.5 points per timeout))

checkpoint total = checkpoint subtotal - penalty

Lab Scoring

Your compiler will be graded against the test cases, and your score is computed as follows.

20 * (% passed of new) +

60 * (% passed of large & only & basic)

= lab subtotal

penalty = min(40, (1 point per failure) + (0.5 points per timeout))

lab total = lab subtotal - penalty