Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Complete Compiler Design Notes, Study notes of Compiler Design

University of Petroleum and Energy Studies Compiler Design

Full syllabus: Compiler design – Language Translation, Compilers, Lexical Analysis (Scanning), syntax analysis (parsing), top-down parsing, bottom-up parsing, Semantic analysis, Intermediate Code Generation, Symbol Tables, Runtime Environment, code optimisation, Control flow and Data flow analysis, Object code generation.

Typology: Study notes

2025/2026

Available from 04/16/2026

cheshtaa-khanna 🇮🇳

5 documents

1 / 153

This page cannot be seen from the preview

Don't miss anything!

INDEX

UNIT NO

TOPIC

PAGE NO

Language Translation

01 – 03

Compilers

03 – 08

Lexical Analysis (Scanning)

09 – 14

Syntax Analysis (Parsing)

15 – 17

Top down parsing

18 – 33

Bottom up parsing

34 – 58

III

Semantic analysis

59 – 65

Intermediate Code Generation

66 – 90

Symbol Tables

91 – 106

Runtime Environment

107 – 122

Code optimization

122 - 134

Control flow and Data flow analysis

135 - 141

Object code generation

142 - 152

Discover Study notes of Compiler Design University of Petroleum and Energy Studies

Partial preview of the text

Download Complete Compiler Design Notes and more Study notes Compiler Design in PDF only on Docsity!

INDEX

UNIT NO TOPIC PAGE NO

I

Language Translation 01 – 03

Compilers 03 – 08

Lexical Analysis (Scanning) 09 – 14

Syntax Analysis (Parsing) 15 – 17

II

Top down parsing 18 – 33

Bottom up parsing 34 – 58

III

Semantic analysis 59 – 65

Intermediate Code Generation 66 – 90

Symbol Tables 91 – 106

IV

Runtime Environment 107 – 122

Code optimization 122 - 134

V

Control flow and Data flow analysis 135 - 141

Object code generation 142 - 152

Target Program

UNIT-I

INTRODUCTION TO LANGUAGE PROCESSING:

As Computers became inevitable and indigenous part of human life, and several languages

with different and more advanced features are evolved into this stream to satisfy or comfort the user

in communicating with the machine , the development of the translators or mediator Software‘s

have become essential to fill the huge gap between the human and machine understanding. This

process is called Language Processing to reflect the goal and intent of the process. On the way to

this process to understand it in a better way, we have to be familiar with some key terms and

concepts explained in following lines.

LANGUAGE TRANSLATORS :

Is a computer program which translates a program written in one (Source) language to its

equivalent program in other [Target]language. The Source program is a high level language where as

the Target language can be any thing from the machine language of a target machine (between

Microprocessor to Supercomputer) to another high level language program.

 Two commonly Used Translators are Compiler and Interpreter

1. Compiler : Compiler is a program, reads program in one language called Source Language and translates in to its equivalent program in another Language called Target Language, in addition to this its presents the error information to the User.

 If the target program is an executable machine-language program, it can then be called by the users to process inputs and produce outputs.

Input Output

Figure1.1: Running the target Program

Compiler

Loader/Linker

Assembler

In addition to these translators, programs like interpreters, text formatters etc., may be used in language processing system. To translate a program in a high level language program to an executable one, the Compiler performs by default the compile and linking functions.

Normally the steps in a language processing system includes Preprocessing the skeletal Source program which produces an extended or expanded source program or a ready to compile unit of the source program, followed by compiling the resultant, then linking / loading , and finally its equivalent executable code is produced. As I said earlier not all these steps are mandatory. In some cases, the Compiler only performs this linking and loading functions implicitly.

The steps involved in a typical language processing system can be understood with following diagram.

Source Program [ Example: filename.C ]

Preprocessor

Modified Source Program [ Example: filename.C ]

Target Assembly Program

Relocatable Machine Code [ Example: filename.obj ]

Library files Relocatable Object files Target Machine Code [ Example: filename. exe ] Figure1.3 : Context of a Compiler in Language Processing System

TYPES OF COMPILERS:

Based on the specific input it takes and the output it produces, the Compilers can be classified into the following types;

Traditional Compilers(C, C++, Pascal): These Compilers convert a source program in a HLL into its equivalent in native machine code or object code.

Interpreters(LISP, SNOBOL, Java1.0): These Compilers first convert Source code into intermediate code, and then interprets (emulates) it to its equivalent machine code.

Cross-Compilers: These are the compilers that run on one machine and produce code for another machine.

Incremental Compilers: These compilers separate the source into user defined–steps; Compiling/recompiling step- by- step; interpreting steps in a given order

Converters (e.g. COBOL to C++): These Programs will be compiling from one high level language to another.

Just-In-Time (JIT) Compilers (Java, Micosoft.NET): These are the runtime compilers from intermediate language (byte code, MSIL) to executable code or native machine code. These perform type – based verification which makes the executable code more trustworthy

Ahead-of-Time (AOT) Compilers (e.g., .NET ngen): These are the pre-compilers to the native code for Java and .NET

Binary Compilation: These compilers will be compiling object code of one platform into object code of another platform.

PHASES OF A COMPILER:

Due to the complexity of compilation task, a Compiler typically proceeds in a Sequence of compilation phases. The phases communicate with each other via clearly defined interfaces. Generally an interface contains a Data structure (e.g., tree), Set of exported functions. Each phase works on an abstract intermediate representation of the source program, not the source program text itself (except the first phase)

Compiler Phases are the individual modules which are chronologically executed to perform their respective Sub-activities, and finally integrate the solutions to give target code.

It is desirable to have relatively few phases, since it takes time to read and write immediate files. Following diagram (Figure1.4) depicts the phases of a compiler through which it goes during the compilation. There fore a typical Compiler is having the following Phases:

Lexical Analyzer (Scanner), 2. Syntax Analyzer (Parser), 3.Semantic Analyzer, 4.Intermediate Code Generator(ICG), 5.Code Optimizer(CO) , and 6.Code Generator(CG)

In addition to these, it also has Symbol table management , and Error handler phases. Not all the phases are mandatory in every Compiler. e.g, Code Optimizer phase is optional in some

All of these phases of a general Compiler are conceptually divided into The Front-end ,

and The Back-end. This division is due to their dependence on either the Source Language or

the Target machine. This model is called an Analysis & Synthesis model of a compiler.

The Front-end of the compiler consists of phases that depend primarily on the Source

language and are largely independent on the target machine. For example, front-end of the

compiler includes Scanner, Parser, Creation of Symbol table, Semantic Analyzer, and the

Intermediate Code Generator.

The Back-end of the compiler consists of phases that depend on the target machine, and

those portions don‘t dependent on the Source language, just the Intermediate language. In this we

have different aspects of Code Optimization phase, code generation along with the necessary

Error handling, and Symbol table operations.

LEXICAL ANALYZER (SCANNER): The Scanner is the first phase that works as interface between the compiler and the Source language program and performs the following functions:

 Reads the characters in the Source program and groups them into a stream of tokens in which each token specifies a logically cohesive sequence of characters, such as an identifier , a Keyword , a punctuation mark, a multi character operator like :=.

 The character sequence forming a token is called a lexeme of the token.

 The Scanner generates a token-id, and also enters that identifiers name in the Symbol table if it doesn‘t exist.

 Also removes the Comments, and unnecessary spaces.

The format of the token is < Token name, Attribute value>

SYNTAX ANALYZER (PARSER): The Parser interacts with the Scanner, and its subsequent phase Semantic Analyzer and performs the following functions:

 Groups the above received, and recorded token stream into syntactic structures, usually into a structure called Parse Tree whose leaves are tokens.

 The interior node of this tree represents the stream of tokens that logically belongs together.

 It means it checks the syntax of program elements.

SEMANTIC ANALYZER: This phase receives the syntax tree as input, and checks the semantically correctness of the program. Though the tokens are valid and syntactically correct, it

may happen that they are not correct semantically. Therefore the semantic analyzer checks the semantics (meaning) of the statements formed.

 The Syntactically and Semantically correct structures are produced here in the form of a Syntax tree or DAG or some other sequential representation like matrix.

INTERMEDIATE CODE GENERATOR(ICG): This phase takes the syntactically and semantically correct structure as input, and produces its equivalent intermediate notation of the source program. The Intermediate Code should have two important properties specified below:

 It should be easy to produce,and Easy to translate into the target program. Example intermediate code forms are:

 Three address codes,

 Polish notations, etc.

CODE OPTIMIZER: This phase is optional in some Compilers, but so useful and beneficial in terms of saving development time, effort, and cost. This phase performs the following specific functions:

 Attempts to improve the IC so as to have a faster machine code. Typical functions include – Loop Optimization, Removal of redundant computations, Strength reduction, Frequency reductions etc.

 Sometimes the data structures used in representing the intermediate forms may also be changed.

CODE GENERATOR: This is the final phase of the compiler and generates the target code, normally consisting of the relocatable machine code or Assembly code or absolute machine code.

 Memory locations are selected for each variable used, and assignment of variables to registers is done.

 Intermediate instructions are translated into a sequence of machine instructions.

The Compiler also performs the Symbol table management and Error handling throughout the compilation process. Symbol table is nothing but a data structure that stores different source language constructs, and tokens generated during the compilation. These two interact with all phases of the Compiler.

LEXICAL ANALYSIS:

As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and produce as output tokens for each lexeme in the source program. This stream of tokens is sent to the parser for syntax analysis. It is common for the lexical analyzer to interact with the symbol table as well.

When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that lexeme into the symbol table. This process is shown in the following figure.

Figure 1.6 : Lexical Analyzer

. When lexical analyzer identifies the first token it will send it to the parser, the parser receives the token and calls the lexical analyzer to send next token by issuing the getNextToken() command. This Process continues until the lexical analyzer identifies all the tokens. During this process the lexical analyzer will neglect or discard the white spaces and comment lines.

TOKENS, PATTERNS AND LEXEMES:

A token is a pair consisting of a token name and an optional attribute value. The token name is an abstract symbol representing a kind of lexical unit, e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes. In what follows, we shall generally write the name of a token in boldface. We will often refer to a token by its token name.

A pattern is a description of the form that the lexemes of a token may take [ or match]. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings.

A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token.

Example: In the following C language statement ,

printf ("Total = %d\n‖, score) ;

both printf and score are lexemes matching the pattern for token id , and "Total = %d\n ‖ is a lexeme matching literal [or string].

Figure 1.7: Examples of Tokens

LEXICAL ANALYSIS Vs PARSING:

There are a number of reasons why the analysis portion of a compiler is normally separated into lexical analysis and parsing (syntax analysis) phases.

 1. Simplicity of design is the most important consideration. The separation of Lexical and Syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and whitespace as syntactic units would be considerably more complex than one that can assume comments and whitespace have already been removed by the lexical analyzer.

 2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly.

 3. Compiler portability is enhanced : Input-device-specific peculiarities can be restricted to the lexical analyzer.

Once the next lexeme is determined, forward is set to the character at its right end. Then,

after the lexeme is recorded as an attribute value of a token returned to the parser, 1exemeBegin

is set to the character immediately after the lexeme just found. In Fig, we see forward has passed

the end of the next lexeme, ** (the FORTRAN exponentiation operator), and must be retracted

one position to its left.

Advancing forward requires that we first test whether we have reached the end of one

of the buffers, and if so, we must reload the other buffer from the input, and move forward to

the beginning of the newly loaded buffer. As long as we never need to look so far ahead of the

actual lexeme that the sum of the lexeme's length plus the distance we look ahead is greater

than N, we shall never overwrite the lexeme in its buffer before determining it.

Sentinels To Improve Scanners Performance:

If we use the above scheme as described, we must check, each time we advance forward,

that we have not moved off one of the buffers; if we do, then we must also reload the other

buffer. Thus, for each character read, we make two tests: one for the end of the buffer, and one

to determine what character is read (the latter may be a multi way branch). We can combine the

buffer-end test with the test for the current character if we extend each buffer to hold a sentinel

character at the end. The sentinel is a special character that cannot be part of the source program,

and a natural choice is the character eof. Figure 1.8 shows the same arrangement as Figure 1.7,

but with the sentinels added. Note that eof retains its use as a marker for the end of the entire

input.

Figure1.8 : Sentential at the end of each buffer

Any eof that appears other than at the end of a buffer means that the input is at an end. Figure 1. summarizes the algorithm for advancing forward. Notice how the first test, which can be part of

a multiway branch based on the character pointed to by forward, is the only test we make, except in the case where we actually are at the end of a buffer or the end of the input.

switch ( *forward++ )

{

case eof : if ( forward is at end of first buffer ) { reload second buffer; forward = beginning of second buffer; } else if (forward is at end of second buffer ) {

break;

}

reload first buffer; forward = beginning of first buffer; } else /* eof within a buffer marks the end of input */ terminate lexical analysis;

Figure 1.9: use of switch-case for the sentential

SPECIFICATION OF TOKENS:

Regular expressions are an important notation for specifying lexeme patterns. While they cannot express all possible patterns, they are very effective in specifying those types of patterns that we actually need for tokens.

LEX the Lexical Analyzer generator

Lex is a tool used to generate lexical analyzer, the input notation for the Lex tool is referred to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the Lex compiler transforms the input patterns into a transition diagram and generates code, in a file called lex .yy .c, it is a c program given for C Compiler, gives the Object code. Here we need to know how to write the Lex language. The structure of the Lex program is given below.

then {return(THEN) ; }

else {return(ELSE) ; }

(id) {yylval = (int) installID(); return(1D);}

(number) {yylval = (int) installNum() ; return(NUMBER) ; }

‖ < ‖ {yylval = LT; return(REL0P) ; )}

― <=‖ {yylval = LE; return(REL0P) ; }

―=‖ {yylval = EQ ; return(REL0P) ; }

―<>‖ {yylval = NE; return(REL0P);}

―<‖ {yylval = GT; return(REL0P);)}

―<=‖ {yylval = GE; return(REL0P);}

int installID0() {/* function to install the lexeme, whose first character is pointed to by yytext,

and whose length is yyleng, into the symbol table and return a pointer thereto */

int installNum() {/* similar to installID, but puts numerical constants into a separate table */}

Figure 1.10 : Lex Program for tokens common tokens

SYNTAX ANALYSIS (PARSER)

THE ROLE OF THE PARSER:

In our compiler model, the parser obtains a string of tokens from the lexical analyzer, as shown in the below Figure, and verifies that the string of token names can be generated by the grammar for the source language. We expect the parser to report any syntax errors in an intelligible fashion and to recover from commonly occurring errors to continue processing the remainder of the program. Conceptually, for well-formed programs, the parser constructs a parse tree and passes it to the rest of the compiler for further processing.

Figure2.1: Parser in the Compiler

During the process of parsing it may encounter some error and present the error information back to the user

Syntactic errors include misplaced semicolons or extra or missing braces; that is, ―{" or "}." As another example, in C or Java, the appearance of a case statement without an enclosing switch is a syntactic error (however, this situation is usually allowed by the parser and caught later in the processing, as the compiler attempts to generate code).

Based on the way/order the Parse Tree is constructed, Parsing is basically classified in to following two types:

Top Down Parsing : Parse tree construction start at the root node and moves to the children nodes (i.e., top down order).
Bottom up Parsing: Parse tree construction begins from the leaf nodes and proceeds towards the root node (called the bottom up order).

IMPORTANT (OR) EXPECTED QUESTIONS

What is a Compiler? Explain the working of a Compiler with your own example?
What is the Lexical analyzer? Discuss the Functions of Lexical Analyzer.
Write short notes on tokens, pattern and lexemes?
Write short notes on Input buffering scheme? How do you change the basic input buffering algorithm to achieve better performance?
What do you mean by a Lexical analyzer generator? Explain LEX tool.

Department of Computer Science & Engineering Course File : Compiler Design

UNIT-II

TOP DOWN PARSING:

 Top-down parsing can be viewed as the problem of constructing a parse tree for the given input string, starting from the root and creating the nodes of the parse tree in preorder (depth-first left to right).

 Equivalently, top-down parsing can be viewed as finding a leftmost derivation for an input string.

It is classified in to two different variants namely; one which uses Back Tracking and the other is Non Back Tracking in nature.

Non Back Tracking Parsing: There are two variants of this parser as given below.

1. Table Driven Predictive Parsing :

i. LL (1) Parsing

2. Recursive Descent parsing

Back Tracking

1. Brute Force method

NON BACK TRACKING:

LL (1) Parsing or Predictive Parsing

LL (1) stands for, left to right scan of input, uses a Left most derivation, and the parser takes 1 symbol as the look ahead symbol from the input in taking parsing action decision.

A non recursive predictive parser can be built by maintaining a stack explicitly, rather than implicitly via recursive calls. The parser mimics a leftmost derivation. If w is the input that has been matched so far, then the stack holds a sequence of grammar symbols a such that

The table-driven parser in the figure has

 An input buffer that contains the string to be parsed followed by a $ Symbol, used to indicate end of input.  A stack, containing a sequence of grammar symbols with a $ at the bottom of the stack, which initially contains the start symbol of the grammar on top of $.

 A parsing table containing the production rules to be applied. This is a two dimensional array M [Non terminal, Terminal].

 A parsing Algorithm that takes input String and determines if it is conformant to Grammar and it uses the parsing table and stack to take such decision.

Figure 2.2: Model for table driven parsing

The Steps Involved In constructing an LL(1) Parser are:

Write the Context Free grammar for given input String
Check for Ambiguity. If ambiguous remove ambiguity from the grammar
Check for Left Recursion. Remove left recursion if it exists.
Check For Left Factoring. Perform left factoring if it contains common prefixes in more than one alternates.
Compute FIRST and FOLLOW sets
Construct LL(1) Table
Using LL(1) Algorithm generate Parse tree as the Output

Context Free Grammar (CFG): CFG used to describe or denote the syntax of the

programming language constructs. The CFG is denoted as G, and defined using a four tuple notation.

Let G be CFG, then G is written as, G= (V, T, P, S)

Where

 V is a finite set of Non terminal; Non terminals are syntactic variables that denote sets of strings. The sets of strings denoted by non terminals help define the language generated by the grammar. Non terminals impose a hierarchical structure on the language that is key to syntax analysis and translation.

 T is a Finite set of Terminal; Terminals are the basic symbols from which strings are formed. The term "token name" is a synonym for '"terminal" and frequently we will use the word "token" for terminal when it is clear that we are talking about just the token name. We assume that the terminals are the first components of the tokens output by the lexical analyzer.  S is the Starting Symbol of the grammar, one non terminal is distinguished as the start symbol, and the set of strings it denotes is the language generated by the grammar. P is finite set of Productions; the productions of a grammar specify the manner in which the

Complete Compiler Design Notes, Study notes of Compiler Design

Related documents

Partial preview of the text

Download Complete Compiler Design Notes and more Study notes Compiler Design in PDF only on Docsity!

INDEX

UNIT NO TOPIC PAGE NO

I

II

III

IV

V

UNIT-I

INTRODUCTION TO LANGUAGE PROCESSING:

TYPES OF COMPILERS:

PHASES OF A COMPILER:

LEXICAL ANALYSIS:

SYNTAX ANALYSIS (PARSER)

THE ROLE OF THE PARSER:

IMPORTANT (OR) EXPECTED QUESTIONS

UNIT-II

TOP DOWN PARSING:

Non Back Tracking Parsing: There are two variants of this parser as given below.

Back Tracking

NON BACK TRACKING:

Context Free Grammar (CFG): CFG used to describe or denote the syntax of the