Download Generating Compilers with Coco R-Making a Compiler-Lecture Slides and more Slides Compiler Construction in PDF only on Docsity!
Generating Compilers with Coco/R
1. Compilers
2. Grammars
3. Coco/R Overview
4. Scanner Specification
5. Parser Specification
6. Error Handling
7. LL(1) Conflicts
8. Case Study
Compilation Phases
character stream v a l = 1 0 * v a l + i
lexical analysis (scanning)
token stream^1
(ident) "val"
(assign)
(number) 10
(times)
(ident) "val"
(plus)
(ident) "i"
token number
token value
syntax analysis (parsing)
syntax tree
ident = number * ident + ident
Term
Expression
Statement
Structure of a Compiler
parser &
sem. processing
scanner
symbol table
code generation
provides tokens from
the source code
maintains information about
declared names and types
generates machine code
"main program"
directs the whole compilation
uses
data flow
Generating Compilers with Coco/R
1. Compilers
2. Grammars
3. Coco/R Overview
4. Scanner Specification
5. Parser Specification
6. Error Handling
7. LL(1) Conflicts
8. Case Study
EBNF Notation
Extended Backus-Naur form
for writing grammars
John Backus : developed the first Fortran compiler Peter Naur : edited the Algol60 report
Statement = "write" ident "," Expression ";".
literal
terminal symbol
nonterminal symbol
terminates a production
left-hand side right-hand side
Productions
Metasymbols
[...]
separates alternatives
groups alternatives
optional part
iterative part
a | b | c a or b or c
a (b | c) ab | ac
[a] b ab | b
{a}b b | ab | aab | aaab | ...
by convention
- terminal symbols start with lower-case letters
- nonterminal symbols start with upper-case letters
Example: Grammar for Arithmetic Expressions
Productions
Expr = ["+" | "-"] Term {("+" | "-") Term}. Term = Factor {("*" | "/") Factor}. Factor = ident | number | "(" Expr ")".
Expr
Term
Factor
Terminal symbols
simple TS:
terminal classes:
(just 1 instance)
ident, number
(multiple instances)
Nonterminal symbols
Expr, Term, Factor
Start symbol
Expr
Coco/R - Compiler Compiler / Recursive Descent
- Generates a scanner and a parser from an attributed grammar
- scanner as a deterministic finite automaton (DFA)
- recursive descent parser
- Developed at the University of Linz (Austria)
- There are versions for C#, Java, C/C++, VB.NET, Delphi, Modula-2, Oberon, ...
- Gnu GPL open source: http://ssw.jku.at/Coco/
Facts
How it works
Coco/R
scanner
parser
main
user-supplied classes
(e.g. symbol table)
csc
attributed
grammar
A Very Simple Example
Assume that we want to parse one of the following two alternatives
red apple
We invoke Coco/R to generate a scanner and a parser
>coco Sample.atg Coco/R (Aug 22, 2006) checking parser + scanner generated 0 errors detected
orange
We write a grammar ...
Sample = "red" "apple" | "orange".
COMPILER Sample PRODUCTIONS Sample = "red" "apple" | "orange". END Sample.
file Sample.atg
and embed it into a Coco/R compiler description
Generated Parser
class Parser { ... void Sample () { if (la.kind == 1) { Get(); Expect(2); } else if (la.kind == 3) { Get(); } else SynErr(5); } ... Token la ; // lookahead token void Get () { la = Scanner.Scan(); ... } void Expect (int n) { if (la.kind == n) Get(); else SynErr(n); } public void Parse () { Get(); Sample(); } ... }
Grammar
Sample = "red" "apple" | "orange".
token codes returned by the scanner
A Slightly Larger Example
Parse simple arithmetic expressions
calc 34 + 2 + 5
calc 2 + 10 + 123 + 3
Coco/R compiler description
COMPILER Sample CHARACTERS digit = '0'..'9'. TOKENS number = digit {digit}. IGNORE '\r' + '\n' PRODUCTIONS Sample = {"calc" Expr}. Expr = Term {'+' Term}. Term = number. END Sample.
file Sample.atg
The generated scanner and parser will
check the syntactic correctness of the input
**>coco Sample.atg
csc Compile.cs Scanner.cs Parser.cs Compile Input.txt**
Generated Parser
class Parser { ... void Sample () { int n; while (la.kind == 2) { Get(); Expr(out n); Console.WriteLine(n); } } void Expr (out int n) { int n1; Term(out n); while (la.kind == 3) { Get(); Term(out n1); n = n + n1; } } void Term (out int n) { Expect(1); n = Convert.ToInt32(t.val); } ... }
Token codes
1 ... number 2 ... "calc" 3 ... '+'
**>coco Sample.atg
csc Compile.cs Scanner.cs Parser.cs Compile Input.txt**
calc 1 + 2 + 3
calc 100 + 10 + 1
Compile
Sample (. int n; .) = { "calc" Expr (. Console.WriteLine(n); .) }. ...
Structure of a Compiler Description
[UsingClauses]
"COMPILER" ident
[GlobalFieldsAndMethods]
ScannerSpecification
ParserSpecification
"END" ident "."
using System; using System.Collections;
int sum; void Add(int x) { sum = sum + x; }
ident denotes the start symbol of the grammar (i.e. the topmost nonterminal symbol)
Structure of a Scanner Specification
ScannerSpecification =
["IGNORECASE"]
["CHARACTERS" {SetDecl}]
["TOKENS" {TokenDecl}]
["PRAGMAS" {PragmaDecl}]
{CommentDecl}
{WhiteSpaceDecl}.
Should the generated compiler be case-sensitive?
Which character sets are used in the token declarations?
Here one has to declare all structured tokens
(i.e. terminal symbols) of the grammar
Pragmas are tokens which are not part of the grammar
Here one can declare one or several kinds of comments
for the language to be compiled
Which characters should be ignored (e.g. \t, \n, \r)?
Character Sets
Example
CHARACTERS
digit = "0123456789". hexDigit = digit + "ABCDEF". letter = 'A' .. 'Z'. eol = '\r'. noDigit = ANY - digit.
the set of all digits the set of all hexadecimal digits the set of all upper-case letters the end-of-line character any character that is not a digit
Valid escape sequences in character constants and strings
\ backslash \r carriage return \f form feed ' apostrophe \n new line \a bell " quote \t horizontal tab \b backspace \0 null character \v vertical tab \uxxxx hex character value