Download Generating Compilers with Coco/R: A Comprehensive Guide and more Slides Compiler Construction in PDF only on Docsity!
Semantic Actions
Arbitrary C# code between (. and .)
IdentList (. int n; .) = ident (. n = 1; .) { ',' ident (. n++; .) } (. Console.WriteLine(n); .) .
local semantic declaration
semantic action
Semantic actions are copied to the generated parser without being checked by Coco/R
Global semantic declarations
using System.IO; COMPILER Sample Stream s; void OpenStream(string path) { s = File.OpenRead(path); ... } ... PRODUCTIONS Sample = ... (. OpenStream("in.txt"); .) ... END Sample.
global semantic declarations
(become fields and methods of the parser)
import of namespaces
semantic actions can access global declarations
as well as imported classes
Attributes
For nonterminal symbols
output attributes
pass results of a production
to the "caller"
... = ... Expr ... Expr = ... ... = ... List[ ... List][ = ...]
actual attributes formal attributes
For terminal symbols
no explicit attributes;
values are returned
by the scanner
Number = number (. n = Convert.ToInt32(t.val); .).
adapter nonterminals necessary
Ident = ident (. name = t.val; .).
Parser has two global token variables
Token t ; // most recently recognized token Token la ; // lookahead token (not yet recognized)
input attributes
pass values from the "caller"
to a production
... = ... IdentLIst ... IdentList = ...
Frame Files
Scanner spec
Parser spec
Sample.atg
Scanner.frame
Parser.frame
Scanner.cs
Parser.cs
Coco/R
Scanner.frame snippet
public class Scanner { const char EOL = '\n'; const int eofSym = 0; -->declarations ... public Scanner (Stream s) { buffer = new Buffer(s, true); Init(); } void Init () { pos = -1; line = 1; … -->initialization ... }
- Coco/R inserts generated parts at positions
marked by "-->..."
- Users can edit the frame files for adapting
the generated scanner and parser to their needs
- Frame files are expected to be in the same directory
as the compiler specification (e.g. Sample.atg )
Interface of the Generated Parser
public class Parser { public Scanner scanner ; // the scanner of this parser public Errors errors ; // the error message stream public Token t ; // most recently recognized token public Token la ; // lookahead token public Parser (Scanner scanner); public void Parse (); public void SemErr (string msg); }
public class MyCompiler {
public static void Main (string[] arg) { Scanner scanner = new Scanner(arg[0]); Parser parser = new Parser(scanner); parser.Parse(); Console.WriteLine(parser.errors.count + " errors detected"); } }
Parser invocation in the main program
Syntax Error Handling
Syntax error messages are generated automatically
For invalid terminal symbols
production S = a b c.
input a x c
error message -- line ... col ...: b expected
For invalid alternative lists
production S = a (b | c | d) e.
input a x e
error message -- line ... col ...: invalid S
Error message can be improved by rewriting the production
productions S = a T e.
T = b | c | d.
input a x e
error message -- line ... col ...: invalid T
Syntax Error Recovery
The user must specify synchronization points where the parser should recover
Statement = SYNC ( Designator "=" Expr SYNC ';' | "if" '(' Expression ')' Statement ["else" Statement] | "while" '(' Expression ')' Statement | '{' {Statement} '}' | ... }.
synchronization points
What are good synchronization points?
Locations in the grammar where particularly "safe" tokens are expected
- start of a statement: if, while, do, ...
- start of a declaration: public, static, void, ...
- in front of a semicolon
while (la.kind is not accepted here ) { la = scanner.Scan(); }
- parser reports the error
- parser continues to the next synchronization point
- parser skips input symbols until it finds one that is expected at the synchronization point
What happens if an error is detected?
Errors Class
public class Errors { public int count = 0; // number of errors detected public TextWriter errorStream = Console.Out; // error message stream public string errMsgFormat = "-- line {0} col {1}: {2}"; // 0=line, 1=column, 2=text // called by the programmer (via Parser.SemErr) to report semantic errors public void SemErr (int line, int col, string msg) { errorStream.WriteLine(errMsgFormat, line, col, msg); count++; }
Coco/R generates a class for error message reporting
// called automatically by the parser to report syntax errors public void SynErr (int line, int col, int n) { string msg; switch (n) { case 0: msg = "..."; break; case 1: msg = "..."; break; ... } errorStream.WriteLine(errMsgFormat, line, col, msg); count++; }
syntax error messages generated by Coco/R
Generating Compilers with Coco/R
1. Compilers
2. Grammars
3. Coco/R Overview
4. Scanner Specification
5. Parser Specification
6. Error Handling
7. LL(1) Conflicts
8. Case Study
Terminal Successors of Nonterminals
Those terminal symbols that can follow a nonterminal in the grammar
Expr = ["+" | "-"] Term {("+" | "-") Term}. Term = Factor {("*" | "/") Factor}. Factor = ident | number | "(" Expr ")".
Follow(Expr) = ")", eof
Follow(Term) = "+", "-", Follow(Expr)
= "+", "-", ")", eof
Follow(Factor) = "*", "/", Follow(Term)
= "*", "/", "+", "-", ")", eof
Where does Expr occur on the
right-hand side of a production?
What terminal symbols can
follow there?
LL(1) Condition
For recursive descent parsing a grammar must be LL(1)
(parseable from L eft to right with L eftcanonical derivations and 1 lookahead symbol)
Definition
1. A grammar is LL(1) if all its productions are LL(1).
2. A production is LL(1) if all its alternatives start with different terminal symbols
S = a b | c.
LL(1)
First(a b) = {a} First(c) = {c}
S = a b | T. T = [a] c.
not LL(1)
First(a b) = {a} First(T) = {a, c}
In other words
The parser must always be able to select one of the alternatives by looking at the lookahead token.
S = (a b | T).
if the parser sees an "a" here it cannot decide which alternative to select
How to Remove Left Recursion
Left recursion is always an LL(1) conflict and must be eliminated
IdentList = ident | IdentList "," ident.
For example
can always be replaced by iteration
IdentList = ident {"," ident}.
(both alternatives start with ident )
generates the following phrases
IdentList
ident IdentList "," ident
ident "," ident IdentList "," ident "," ident
ident "," ident "," ident IdentList "," ident "," ident "," ident
Hidden LL(1) Conflicts
EBNF options and iterations are hidden alternatives
S = a [b]. First(b) Follow(S) must be {}
S = a {b}. First(b) Follow(S) must be {}
S = [a] b. S = a b | b. a and b are arbitrary EBNF expressions
S = {a} b. S = b | a b | a a b | ....
S = [a] b. First(a) First(b) must be {}
S = {a} b. First(a) First(b) must be {}
Rules
Dangling Else
If statement in C# or Java
Statement = "if" "(" Expr ")" Statement ["else" Statement] | ....
This is an LL(1) conflict!
First("else" Statement) Follow(Statement) = {"else"}
It is even an ambiguity which cannot be removed
if (expr1) if (expr2) stat1; else stat2;
Statement
Statement
Statement
Statement
We can build 2 different syntax trees!
Can We Ignore LL(1) Conflicts?
An LL(1) conflict is only a warning
The parser selects the first matching alternative
S = a b c | a d.
if the lookahead token is a the parser selects this alternative
if (expr1) if (expr2) stat1; else stat2;
Statement
Statement
Luckily this is what we want here.
Statement = "if" "(" Expr ")" Statement [ "else" Statement ] | ....
If the lookahead token is "else" here
the parser starts parsing the option;
i.e. the "else" belongs to the innermost "if"
Example: Dangling Else