# Search in the document preview

Chapter Seven: Regular Expressions

Docsity.com

*The first time a young student sees the mathematical constant *�*, it
looks like just one more school artifact: one more arbitrary symbol
whose definition to memorize for the next test. Later, if he or she
*

*persists, this perception changes. In many branches of mathematics
and with many practical applications, *�* keeps on turning up. "There
*

*it is again!" says the student, thus joining the ranks of
mathematicians for whom mathematics seems less like an artifact
*

*invented and more like a natural phenomenon discovered.
*

*So it is with regular languages. We have seen that DFAs and
NFAs have equal definitional power. It turns out that regular
*

*expressions also have exactly that same definitional power: they can
be used to define all the regular languages, and only the regular
*

*languages. There it is again!
*

Docsity.com

Outline

• 7.1 Regular Expressions, Formally Defined • 7.2 Regular Expression Examples • 7.3 For Every Regular Expression, a Regular

Language • 7.4 Regular Expressions and Structural

Induction • 7.5 For Every Regular Language, a Regular

Expression

Docsity.com

Concatenation of Languages

• The concatenation of two languages *L*1 and
*L*2 is *L*1*L*2 = {*xy* | *x* ∈ *L*1 and *y* ∈ *L*2}

• The set of all strings that can be constructed by concatenating a string from the first language with a string from the second

• For example, if *L*1 = {*a*, *b*} and *L*2 = {*c*, *d*} then
*L*1*L*2 = {*ac*, *ad*, *bc*, *bd*}

Docsity.com

Kleene Closure of a Language

• The Kleene closure of a language *L* is
*L** = {*x*1*x*2 ... *xn* | *n* ≥ 0, with all *xi* ∈ *L*}

• The set of strings that can be formed by concatenating
any number of strings, each of which is an element of *L*

• Not the same as {*xn* | *n* ≥ 0 and *x* ∈ *L*}
• In *L**, each *xi* may be a different element of *L
*• For example, {*ab*, *cd*}* = {ε, *ab*, *cd*, *abab*, *abcd*, *cdab*,

*cdcd*, *ababab*, ...}
• For all *L*, ε ∈ *L*
*• For all *L* containing at least one string other than ε,

*L** is infinite

Docsity.com

Regular Expressions

• A regular expression is a string *r* that denotes
a language *L*(*r*) over some
alphabet Σ

• Regular expressions make special use of the symbols ε, ∅, +, *, and parentheses

• We will assume that these special symbols are not included in Σ

• There are six kinds of regular expressions…

Docsity.com

The Six Regular Expressions • The six kinds of regular expressions, and the

languages they denote, are:
– Three kinds of *atomic* regular expressions:

• Any symbol *a* ∈ Σ, with *L*(*a*) = {*a*}
• The special symbol ε, with *L*(ε) = {ε}
• The special symbol ∅, with *L*(∅) = {}

– Three kinds of *compound* regular expressions built from
smaller regular expressions, here called *r*, *r*1, and *r*2:

• (*r*1 + *r*2), with *L*(*r*1 + *r*2) = *L*(*r*1) ∪ *L*(*r*2)
• (*r*1*r*2), with *L*(*r*1*r*2) = *L*(*r*1)*L*(*r*2)
• (*r*)*, with *L*((*r*)***) = (*L*(*r*))***

• The parentheses may be omitted, in which case * has highest precedence and + has lowest

Docsity.com

Other Uses of the Name

• These are classical regular expressions • Many modern programs use text patterns

also called *regular expressions*:
– Tools like awk, sed and grep
– Languages like Perl, Python, Ruby, and PHP
– Language libraries like those for Java and the

.NET languages • All slightly different from ours and each other • More about them in a later chapter

Docsity.com

Outline

• 7.1 Regular Expressions, Formally Defined • 7.2 Regular Expression Examples • 7.3 For Every Regular Expression, a Regular

Language • 7.4 Regular Expressions and Structural

Induction • 7.5 For Every Regular Language, a Regular

Expression

Docsity.com

*ab*

• Denotes the language {*ab*}
• Our formal definition permits this because

– *a* is an atomic regular expression denoting {*a*}
– *b* is an atomic regular expression denoting {*b*}
– Their concatenation (*ab*) is a compound
– Unnecessary parentheses can be omitted

• Thus any string *x* in Σ* can be used by itself
as a regular expression, denoting {*x*}

Docsity.com

*ab*+*c*

• Denotes the language {*ab*,*c*}
• We omitted parentheses from the fully

parenthesized form ((*ab*)+*c*)
• The inner pair is unnecessary because + has

lower precedence than concatenation • Thus any finite language can be defined

using a regular expression • Just list the strings, separated by +

Docsity.com

*ba**

• Denotes the language {*ban*}: the set of strings
consisting of *b* followed by zero or more *a*s

• Not the same as (*ba*)*, which denotes {(*ba*)n}
• * has higher precedence than concatenation
• The Kleene star is the only way to define an

infinite language using regular expressions

Docsity.com

(*a*+*b*)*

• Denotes {*a*,*b*}*: the whole language of strings
over the alphabet {*a*,*b*}

• The parentheses are necessary here, because * has higher precedence than +

• *a*+*b** denotes {*a*} ∪ {*b*}*
• Reminder: not "zero or more copies…"
• That would be *a**+*b**, which denotes

{*a*}* ∪ {*b*}*

Docsity.com

*ab*+ε

• Denotes the language {*ab*,ε}
• Occasionally, we need to use the atomic

regular expression ε to include ε in the language

• But it's not needed in (*a*+*b*)*+ε, because ε is
already part of every Kleene star

Docsity.com

∅

• Denotes {} • There is no other way to denote the empty set

with regular expressions • That's all you should ever use ∅ for • It is not useful in compounds:

– *L*(*r*∅) = *L*(∅*r*) = {}
– *L*(*r+*∅) = *L*(∅+*r*) = *L*(*r*)
– *L*(∅*) = {ε}

Docsity.com

More Examples

• (*a*+*b*)(*c*+*d*)
– Denotes {*ac*, *ad*, *bc*, *bd*}

• (*abc*)*
– Denotes {(*abc*)*n*} = {ε, *abc*, *abcabc*, …}

• *a***b**
– Denotes {*anbm*} = {*xy* | *x* ∈ {*a*}* and *y* ∈ {*b*}*}

Docsity.com

More Examples

• (*a*+*b*)**aa*(*a*+*b*)*
– Denotes {*x* ∈ {*a,b*}* | *x* contains at least 2 consecutive *a*s}

• (*a*+*b*)**a*(*a*+*b*)**a*(*a*+*b*)*
– Denotes {*x* ∈ {*a,b*}* | *x* contains at least 2 *a*s}

• (*a***b**)*
– Denotes {*a*,*b*}*, same as the simpler (*a*+*b*)*
– Because *L*(*a***b**) contains both *a* and *b*, and that's enough: we

already have *L*((*a*+*b*)*) = {*a*,*b*}*
– In general, whenever Σ ⊆ *L*(*r*), then *L*((*r*)*) = Σ*

Docsity.com

Outline

• 7.1 Regular Expressions, Formally Defined • 7.2 Regular Expression Examples • 7.3 For Every Regular Expression, a Regular

Language • 7.4 Regular Expressions and Structural

Induction • 7.5 For Every Regular Language, a Regular

Expression

Docsity.com

Regular Expression to NFA

• Goal: to show that every regular expression defines a regular language

• Approach: give a way to convert any regular expression to an NFA for the same language

• Advantage: large NFAs can be composed from smaller ones using ε-transitions

Docsity.com

Standard Form

• To make them easier to compose, our NFAs will all have the same standard form: – Exactly one accepting state, not the start state

• That is, for any regular expression *r*, we will
show how to construct an NFA *N* with *L*(*N*) =
*L*(*r*), pictured like this:

*r
*

Docsity.com

Composing Example

• That form makes composition easy
• For example, given NFAs for *L*(*r*1) and *L*(*r*2),

we can easily construct one for *L*(*r*1+*r*2):

• This new NFA still has our special form

*r*1

*r*2

Docsity.com

Lemma 7.3

• Proof sketch: – There are six kinds of regular expressions – We will show how to build a suitable NFA for each kind

If *r* is any regular expression, there is some NFA *N*
that has a single accepting state, not the same as the
start state, with *L*(*N*) = *L*(*r*).

Docsity.com

Proof Sketch: Atomic Expressions

• There are three kinds of atomic regular expressions
– Any symbol *a* ∈ Σ, with *L*(*a*) = {*a*}

– The special symbol ε, with *L*(ε) = {ε}

– The special symbol ∅, with *L*(∅) = {}

*a a* ∈ :

:

∅:

Docsity.com

Proof: Compound Expressions

• There are three kinds of *compound* regular expressions:
– (*r*1 + *r*2), with *L*(*r*1 + *r*2) = *L*(*r*1) ∪ *L*(*r*2)

*r*1

*r*2

(*r*1 + *r*2):

Docsity.com

– (*r*1*r*2), with *L*(*r*1*r*2) = *L*(*r*1) *L*(*r*2)

– (*r*1)*, with *L*((*r*1)***) = (*L*(*r*1))***

*r*1

*r*2

(*r*1*r*2):

*r*1

(*r*1)*:

Docsity.com