Languages, Regular Expression - Compiler Construction - Lecture Notes, Study notes of Compiler Construction

Languages, Alphabet, Set of strings of charaters, Finite sequence of character, Regular expression, Finite automation, Set of transitions, Set of accepting states are the points from this lecture. You can find series of lecture notes for compiler construction here.

Typology: Study notes

2011/2012

Uploaded on 11/06/2012

asim.amjid
asim.amjid 🇵🇰

4.4

(47)

41 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Sohail Aslam Compiler Construction Notes
1
L
Le
ec
ct
tu
ur
re
e
6
6
How to Describe Tokens?
Regular Languages are the most popular for specifying tokens because
These are based on simple and useful theory,
Are easy to understand and
Efficient implementations exist for generating lexical analysers based on such
languages.
Languages
Let Σ ?be a set of characters. Σ is called the alphabet. A language over Σ is set of strings
of characters drawn from Σ.? Here are some examples of languages:
Alphabet = English characters
Language = English sentences
Alphabet = ASCII
Language = C++, Java, C# programs
Languages are sets of strings (finite sequence of characters). We need some notation for
specifying which sets we want. For lexical analysis we care about regular languages.
Regular languages can be described using regular expressions. Each regular expression is
a notation for a regular language (a set of words). If A is a regular expression, we write
L(A) to refer to language denoted by A.
Regular Expression
A regular expression (RE) is defined inductively
a ordinary character from Σ
ε the empty string
R|S either R or S
RS R followed by S (concatenation)
R* concatenation of R zero or more times (R* = ε|R|RR|RRR...)
Regular expression extensions are used as convenient notation of complex RE:
R? ε | R (zero or one R)
R+ RR* (one or more R)
(R) R (grouping)
[abc] a|b|c (any of listed)
[a-z] a|b|....|z (range)
[^ab] c|d|... (anything but ‘a’‘b’)
pf3

Partial preview of the text

Download Languages, Regular Expression - Compiler Construction - Lecture Notes and more Study notes Compiler Construction in PDF only on Docsity!

Le Leccttuurree 6 6

How to Describe Tokens?

Regular Languages are the most popular for specifying tokens because

  • These are based on simple and useful theory,
  • Are easy to understand and
  • Efficient implementations exist for generating lexical analysers based on such languages.

Languages

Let Σ ?be a set of characters. Σ is called the alphabet. A language over Σ is set of strings of characters drawn from Σ.?Here are some examples of languages:

  • Alphabet = English characters Language = English sentences
  • Alphabet = ASCII Language = C++, Java, C# programs

Languages are sets of strings (finite sequence of characters). We need some notation for specifying which sets we want. For lexical analysis we care about regular languages. Regular languages can be described using regular expressions. Each regular expression is a notation for a regular language (a set of words). If A is a regular expression, we write L(A) to refer to language denoted by A.

Regular Expression

A regular expression ( RE ) is defined inductively a ordinary character from Σ ε the empty string

R|S either R or S RS R followed by S (concatenation) R* concatenation of R zero or more times (R* = ε|R|RR|RRR...)

Regular expression extensions are used as convenient notation of complex RE:

R? ε | R (zero or one R) R+^ RR* (one or more R) (R) R (grouping) [abc] a|b|c (any of listed) [a-z] a|b|....|z (range) [^ab] c|d|... (anything but ‘a’‘b’)

Here are some Regular Expressions and the strings of the language denoted by the RE.

RE Strings in L(R) a “a” ab “ab” a|b “a” “b” (ab)* “” “ab” “abab” ... (a|ε)b “ab” “b”

Here are examples of common tokens found in programming languages.

digit ‘0’|’1’|’2’|’3’|’4’|’5’|’6’|’7’|’8’|’9’ integer digit digit* identifier [a-zA- Z_][a- zA-Z0-9_]*

Finite Automaton

We need mechanism to determine if an input string w belongs to L(R), the language denoted by regular expression R. Such a mechanism is called an acceptor.

input

string

language

w

L

acceptor

yes, if w ε L

no, if w ε L

The acceptor is based on Finite Automata (FA). A Finite Automaton consists of

  • An input alphabet Σ
  • A set of states
  • A start (initial) state
  • A set of transitions
  • A set of accepting (final) states

A finite automaton accepts a string if we can follow transitions labeled with characters in the string from start state to some accepting state. Here are some examples of FA.

  • A FA that accepts only “1”
  • A FA that accepts any number of 1’s followed by a single 0