Download java programming as a case study and more Study notes Java Programming in PDF only on Docsity!
Implementation of Lexical Analysis
Compiler Design 1 (2011) 2
Outline
- Specifying lexical structure using regular expressions
- Finite automata
- Deterministic Finite Automata (DFAs)
- Non-deterministic Finite Automata (NFAs)
- Implementation of regular expressions RegExp ⇒ NFA ⇒ DFA ⇒ Tables
Notation
- For convenience, we use a variation (allow user- defined abbreviations) in regular expression notation
- Union: A + B ≡ A | B
- Option: A + ε ≡ A?
- Range: ‘a’+’b’+…+’z’ ≡ [a-z]
- Excluded range: complement of [a-z] ≡ [^a-z]
Regular Expressions in Lexical Specification
- Last lecture: a specification for the predicate s ∈ L(R)
- But a yes/no answer is not enough!
- Instead: partition the input into tokens
- We will adapt regular expressions to this goal
Compiler Design 1 (2011) 5
Regular Expressions ⇒ Lexical Spec. (1)
- Select a set of tokens
- Integer, Keyword, Identifier, OpenPar, ...
- Write a regular expression (pattern) for the lexemes of each token
- Integer = digit +
- Keyword = ‘if’ + ‘else’ + …
- Identifier = letter (letter + digit)*
- OpenPar = ‘(‘
- …
Compiler Design 1 (2011) 6
Regular Expressions ⇒ Lexical Spec. (2)
- Construct R, matching all lexemes for all tokens
R = Keyword + Identifier + Integer + … = R 1 + R 2 + R 3 + …
Facts: If s ∈ L(R) then s is a lexeme
- Furthermore s ∈ L(Ri ) for some “i”
- This “i” determines the token that is reported
Regular Expressions ⇒ Lexical Spec. (3)
- Let input be x 1 …xn
- (x 1 ... x (^) n are characters)
- For 1 ≤ i ≤ n check x 1 …x (^) i ∈ L(R)?
- It must be that x 1 …x (^) i ∈ L(Rj ) for some j (if there is a choice, pick a smallest such j)
- Remove x 1 …x (^) i from input and go to previous step
How to Handle Spaces and Comments?
- We could create a token Whitespace Whitespace = (‘ ’ + ‘\n’ + ‘\t’)+
- We could also add comments in there
- An input “ \t\n 5555 “ is transformed into Whitespace Integer Whitespace
- Lexer skips spaces (preferred)
- Modify step 5 from before as follows: It must be that xk ... x (^) i ∈ L(Rj ) for some j such that x1 ... x (^) k-1 ∈ L(Whitespace)
- Parser is not bothered with spaces
Compiler Design 1 (2011) 13
Regular Languages & Finite Automata
Basic formal language theory result :
Regular expressions and finite automata both
define the class of regular languages.
Thus, we are going to use:
- Regular expressions for specification
- Finite automata for implementation (automatic generation of lexical analyzers)
Compiler Design 1 (2011) 14
Finite Automata
A finite automaton is arecognizer for the
strings of a regular language
A finite automaton consists of
- A finite input alphabet Σ
- A set of states S
- A start state n
- A set of accepting states F ⊆ S
- A set of transitions state →input^ state
Finite Automata
- Transition s 1 →a^ s 2
- Is read In state s 1 on input “a” go to state s 2
- If end of input (or no transition possible)
- If in accepting state ⇒ accept
- Otherwise ⇒ reject
Finite Automata State Graphs
- A state
- The start state
- An accepting state
- A transition
a
Compiler Design 1 (2011) 17
A Simple Example
- A finite automaton that accepts only “1”
1
Compiler Design 1 (2011) 18
Another Simple Example
- A finite automaton accepting any number of 1’s followed by a single 0
- Alphabet: {0,1}
0
1
And Another Example
- Alphabet {0,1}
- What language does this recognize?
0
1 0
1
0
1
And Another Example
- Alphabet still { 0, 1 }
- The operation of the automaton is not completely defined by the input - On input “11” the automaton could be in either state
1
1
Compiler Design 1 (2011) 25
NFA vs. DFA (1)
- NFAs and DFAs recognize the same set of languages (regular languages)
- DFAs are easier to implement
- There are no choices to consider
Compiler Design 1 (2011) 26
NFA vs. DFA (2)
- For a given language the NFA can be simpler than the DFA (^1 ) 0
0
1 0 1
0
1
NFA
DFA
- DFA can be exponentially larger than NFA
Regular Expressions to Finite Automata
Regular expressions
NFA
DFA
Lexical Specification
Table-driven Implementation of DFA
Regular Expressions to NFA (1)
- For each kind of reg. expr, define an NFA
- Notation: NFA for regular expression M
M
Compiler Design 1 (2011) 29
Regular Expressions to NFA (2)
A ε B
A
B ε
ε
ε
ε
Compiler Design 1 (2011) 30
Regular Expressions to NFA (3)
ε^ A
ε
ε
Example of Regular Expression → NFA conversion
- Consider the regular expression (1+0)*
- The NFA is
ε
C 1 E
D 0 F
ε ε
B ε
ε G
ε
ε
ε
A H I 1 J
NFA to DFA. The Trick
- Simulate the NFA
- Each state of DFA = a non-empty subset of states of the NFA
- Start state = the set of NFA states reachable through ε-moves from NFA start state
- Add a transition S →a^ S’ to DFA iff
- S’ is the set of NFA states reachable from any state in S after seeing the input a - considering ε-moves as well
Implementation (Cont.)
- NFA → DFA conversion is at the heart of tools such as lex, ML-Lex or flex
- But, DFAs can be huge
- In practice, lex/ML-Lex/flex-like tools trade off speed for space in the choice of NFA and DFA representations
Theory vs. Practice
Two differences:
- DFAsrecognize lexemes. A lexer must return
atype of acceptance (token type) rather than
simply an accept/reject indication.
- DFAs consume the complete string and accept
or reject it. A lexer mustfind the end of the
lexeme in the input stream and then find the
next one, etc.