Download Compiler Design: Bottom-Up Parsing & Intermediate Code and more Summaries Compiler Design in PDF only on Docsity!
COMPILER DESIGN
LECTURE NOTES
ON
COMPILER DESIGN
Prepared by
Dr. Subasish Mohapatra
Welcome
Department of Computer Science and Application
College of Engineering and Technology, Bhubaneswar
Biju Patnaik University of Technology, Odisha
CONTENTS
Lecture- 1 Introduction to compiler & its phases
Lecture- 2 Overview of language processing system
Lecture- 3 Phases of a Compiler
Lecture- 4 Languages
Lecture- 5 Converting RE to NFA (Thomson Construction)
Lecture- 6 Lexical Analysis
Lecture- 7 Lexical Analyzer Generator
Lecture- 8 Basics of Syntax Analysis
Lecture- 9 Context-Free Grammar
Lecture- 10 Left Recursion
Lecture- 11 YACC
Lecture- 12 Top-down Parsing
Lecture- 13 Recursive Predictive Parsing
Lecture- 14 Non-recursive Predictive Parsing-LL(1)
Lecture- 15 LL(1) Grammar
Lecture- 16 Basics of Bottom-up parsing
Lecture- 17 Conflicts during shift-reduce parsing
Lecture- 18 Operator precedence parsing
Lecture- 19 LR Parsing
Letcure- 20 Construction of SLR parsing table
Lecture- 21 Construction of canonical LR(0) collection
Lecture- 22 Shift-Reduce & Reduce-Reduce conflicts
Lecture- 23 Construction of canonical LR(1) collection
Lecture- 24 Construction of LALR parsing table
Lecture- 25 Using ambiguous grammars
Lecture- 26 SYNTAX-DIRECTED TRANSLATION
Lecture- 27 Translation of Assignment Statements
Lecture- 28 Generating 3-address code for Numerical Representation of Boolean expressions
Lecture- 29 Statements that Alter Flow of Control
Lecture- 30 Postfix Translations
Lecture- 31 Array references in arithmetic expressions
Lecture- 32 SYMBOL TABLES
Lecture- 33 Intermediate Code Generation
Lecture- 34 Directed Acyclic Graph
Lecture- 35 Flow of control statements with Jump method
Lecture- 36 Backpatching
Lecture- 37 RUN TIME ADMINISTRATION
Lecture- 38 Storage Organization
Lecture- 39 ERROR DETECTION AND RECOVERY
Lecture- 40 Error Recovery in Predictive Parsing
Lecture- 41 CODE OPTIMIZATION
Lecture- 42 Local Optimizations
Module- 1
Lecture
INTRODUCTION TO COMPILERS AND ITS PHASES
A compiler is a program takes a program written in a source language and translates it into an equivalent program in a target language. The source language is a high level language and target language is machine language. Source program - > COMPILER - > Target program Necessity of compiler Techniques used in a lexical analyzer can be used in text editors, information retrieval system, and pattern recognition programs. Techniques used in a parser can be used in a query processing system such as SQL. Many software having a complex front-end may need techniques used in compiler design. A symbolic equation solver which takes an equation as input. That program should parse the given input equation. Most of the techniques used in compiler design can be used in Natural Language Processing (NLP) systems. Properties of Compiler a) Correctness i) Correct output in execution. ii) It should report errors iii) Correctly report if the programmer is not following language syntax. b) Efficiency c) Compile time and execution. d) Debugging / Usability. Compiler Interpreter
- It translates the whole program at a time.
- Compiler is faster.
- Debugging is not easy.
- Compilers are not portable.
- It translate statement by statement.
- Interpreter is slower.
- Debugging is easy.
- Interpreter are portable. **Types of compiler
- Native code compiler** A compiler may produce binary output to run /execute on the same computer and operating system. This type of compiler is called as native code compiler. 2) Cross Compiler A cross compiler is a compiler that runs on one machine and produce object code for another machine. 3) Bootstrap compiler If a compiler has been implemented in its own language. self-hosting compiler.
Lecture
OVERVIEW OF LANGUAGE PROCESSING SYSTEM
A source program may be divided into modules stored in separate files. Preprocessor – collects all the separate files to the source program.
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short
hands for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more
modern flow-of-control and data structuring facilities.
3. Language Extensions: These preprocessor attempts to add capabilities to the language
by certain amounts to build-in macro
ASSEMBLER
Programmers found it difficult to write or read programs in machine language. They
begin to use a mnemonic (symbols) for each machine instruction, which they would
subsequently translate into machine language. Such a mnemonic machine language is
now called an assembly language. Programs known as assembler were written to
automate the translation of assembly language in to machine language. The input to an
assembler program is called source program, the output is a machine language translation
(object program).
INTERPRETER
An interpreter is a program that appears to execute a source program as if it were
machine language
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA
also uses interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Synatx analysis
3. Semantic analysis
4. Direct Execution
Advantages
Modification of user program can be easily made and implemented as
execution proceeds.
Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a program used
for interpretation.
The interpreter for the language makes it machine independent.
Disadvantages
The execution of the program is slower .
Memory consumption is more.
Loader and Linker
Once the assembler procedures an object program, that program must be placed into
memory and executed. The assembler could place the object program directly in memory
and transfer control to it, thereby causing the machine language program to be
execute. This would waste core by leaving the assembler in memory while the user’s
program was being executed. Also the programmer would have to retranslate his program
with each execution, thus wasting translation time. To over come this problems of wasted
translation time and memory. System programmers developed another component called
Loader
“A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into object form the
loader could” relocate” directly behind the user’s program. The task of adjusting programs o
they may be placed in arbitrary core locations is called relocation. Relocation loaders
perform four functions.
Lecture
Phases of a Compiler
Each phase transforms the source program from one representation into another representation. They communicate with error handlers and the symbol table. Lexical Analyzer Lexical Analyzer reads the source program character by character and returns the tokens of the source program. A token describes a pattern of characters having same meaning in the source program. (such as identifiers, operators, keywords, numbers, delimiters and so on) Example: In the line of code newval := oldval + 12 , tokens are: newval (identifier) := (assignment operator) oldval (identifier) + (add operator) 12 (a number)
- Puts information about identifiers into the symbol table.
- Regular expressions are used to describe tokens (lexical constructs).
- A (Deterministic) Finite State Automaton can be used in the implementation of a lexical analyzer. Syntax Analyzer
- A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the given program.
- A syntax analyzer is also called a parser.
- A parse tree describes a syntactic structure. Example: For the line of code newval := oldval + 12 , parse tree will be: assignment identifier := expression newval expression + expression identifier number oldval 12
- The syntax of a language is specified by a context free grammar (CFG).
- The rules in a CFG are mostly recursive.
- A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or not.
- If it satisfies, the syntax analyzer creates a parse tree for the given program. Example: CFG used for the above parse tree is: assignment-> identifier := expression expression - > identifier expression - > number expression - > expression + expression
- Depending on how the parse tree is created, there are different parsing techniques.
- These parsing techniques are categorized into two groups:
- Top-Down Parsing,
- Bottom-Up Parsing
- Top-Down Parsing:
- Construction of the parse tree starts at the root, and proceeds towards the leaves.
- Efficient top-down parsers can be easily constructed by hand.
- Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
- Bottom-Up Parsing:
- Construction of the parse tree starts at the leaves, and proceeds towards the root.
- Normally efficient bottom-up parsers are created with the help of some software tools.
- Bottom-up parsing is also known as shift-reduce parsing.
- Operator-Precedence Parsing – simple, restrictive, easy to implement
- LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR Semantic Analyzer A semantic analyzer checks the source program for semantic errors and collects the type information for the code generation. Type-checking is an important part of semantic analyzer. Normally semantic information cannot be represented by a context-free language used in syntax analyzers. Context-free grammars used in the syntax analysis are integrated with attributes (semantic rules). The result is a syntax-directed translation and Attribute grammars Example: In the line of code newval := oldval + 12 , the type of the identifier newval must match with type of the expression (oldval+12). Intermediate Code Generation A compiler may produce an explicit intermediate codes representing the source program. These intermediate codes are generally machine architecture independent. But the level of intermediate codes is close to the level of machine codes. Example:
Phases of a compiler are the sub-tasks that must be performed to complete the compilation process. Passes refer to the number of times the compiler has to traverse through the entire program.
Symbol Table Management:
A symbol table is a data structure that contains a record for each identifier with field for
attributes of the identifier.
The type information about the identifier is detected during the lexical analysis phases and
is entered into the symbol table.
Position= initial + rate*60;
Address Symbol Location attributes
1 Position 1000 id, float
2 Intial 2000 id, float
3 Rate 3000 id, float
4 60 4000 constant, int
Error Detection and Reporting:
Each phase detects/encounters errors after detecting errors.
This phase must deal with errors to continue with the process of compilation.
The following are some errors encountered in each phase:
i) Lexical Analyzer- Miss spell token.
ii) Semantic Analyzer- Type Mismatch.
iii) Syntax Analyzer-Missing parenthesis , less no. of operands.
iv) Intermediate code generation – In compatible operands for an operand.
v) Code optimization- Unreachable statement.
vi) Code Generation- Memory restriction to store a variable.
Lecture
Languages
Terminology
- Alphabet : a finite set of symbols (ASCII characters)
- String : finite sequence of symbols on an alphabet
- Sentence and word are also used in terms of string
- ε is the empty string
- |s| is the length of string s.
- Language: sets of strings over some fixed alphabet
- ∅ the empty set is a language.
- {ε} the set containing empty string is a language
- The set of all possible identifiers is a language.
- Operators on Strings:
- Concatenation : xy represents the concatenation of strings x and y. s ε = s ε s = s
- sn^ = s s s .. s ( n times) s^0 = ε Operations on Languages
- Concatenation: L 1 L 2 = { s 1 s 2 | s 1 ∈ L 1 and s 2 ∈ L 2 }
- Union: L 1 ∪ L 2 = { s | s ∈ L 1 or s ∈ L 2 }
- Exponentiation: L^0 = {ε} L^1 = L L^2 = LL
- Kleene Closure: L* =
- Positive Closure: L+^ = Examples:
- L 1 = {a,b,c,d} L 2 = {1,2}
- L 1 L 2 = {a1,a2,b1,b2,c1,c2,d1,d2}
- L 1 ∪ L 2 = {a,b,c,d,1,2} 3
- L 1
- L *
- L + = all strings with length three (using a,b,c,d} = all strings using letters a,b,c,d and empty string = doesn’t include the empty string 1 1
Both deterministic and non-deterministic finite automaton recognize regular sets. Which one?
- deterministic – faster recognizer, but it may take more space
- non-deterministic – slower, but it may take less space
- Deterministic automatons are widely used lexical analyzers. First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical analyzer for our tokens. Non-Deterministic Finite Automaton (NFA) A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
- S - a set of states
- Σ - a set of input symbols (alphabet)
- move - a transition function move to map state-symbol pairs to sets of states.
- s 0 - a start (initial) state
- F- a set of accepting states (final states) ε-^ transitions are allowed in NFAs. In other words, we can move from one state to another one without consuming any symbol. A NFA accepts a string x, if and only if there is a path from the starting state to one of accepting states such that edge labels along this path spell out x. Example: Transition Graph 0 is the start state s {2} is the set of final states F Σ = {a,b} S = {0,1,2} Transition Function: a b 0 {0,1} {0} 1 {} {2} 2 {} {} The language recognized by this NFA is (a|b)*ab
Deterministic Finite Automaton (DFA) A Deterministic Finite Automaton (DFA) is a special form of a NFA. No state has ε- transition For each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition function is from pair of state-symbol to state (not set of states) Example: The DFA to recognize the language (a|b)* ab is as follows. 0 is the start state s {2} is the set of final states F Σ = {a,b} S = {0,1,2} Transition Function: a B 0 1 0 1 1 2 2 1 0 Note that the entries in this function are single value and not set of values (unlike NFA).
Example: For a RE (a|b) * a, the NFA construction is shown below. Converting NFA to DFA (Subset Construction) We merge together NFA states by looking at them from the point of view of the input characters: From the point of view of the input, any two states that are connected by an -transition may as well be the same, since we can move from one to the other without consuming any character. Thus states which are connected by an - transition will be represented by the same states in the DFA. If it is possible to have multiple transitions based on the same symbol, then we can (^) regard a transition on a symbol as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol). Thus these states will be combined into a single DFA state. To perform this operation, let us define two functions:
- The - closure function takes a state and returns the set of states reachable from it based on (one or more) - transitions. Note that this will always include the state tself. We should be able to get from a state to any state in its - closure without consuming any input.
- The function move takes a state and a character, and returns the set of states reachable by one transition on this character.
We can generalize both these application to individual states. functions to apply to sets of states by taking the union of the For Example, if A, B and C are states, move ({A,B,C},a') = move(A,a') U move(B, a') U move (C,a'). The Subset Construction Algorithm is a follows: put ε-closure({s0}) as an unmarked state into the set of DFA (DS) while (there is one unmarked S1 in DS) do begin End mark S for each input symbol a do begin S2<-ε-closure(move(S1,a)) if (S2 is not in DS) then add S2 into DS as an unmarked state transfunc[S1,a]<-S end