Regular Expressions: An Alternative Specification Method for Regular Languages, Study notes of Computer Science

The concept of regular expressions as an alternative specification method for regular languages. Regular expressions are easier to construct and understand than automata, and they are commonly used in computer applications to describe patterns in texts. The formal definition of regular expressions, their benefits, and examples of regular expressions over the ascii alphabet.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-9xi
koofers-user-9xi 🇺🇸

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ECS 120 Lesson 7 Regular Expressions, Pt. 1
Oliver Kreylos
Friday, April 13th, 2001
1 Outline
Thus far, we have been discussing one way to specify a (regular) language:
Giving a machine that reads a word and tells whether it is in the language or
not. Though this is a valid and unambiguous specification, it is sometimes
not a very helpful one. Specifying languages by automata has two major
shortcomings: First, when given a language, it is often difficult to construct
an automaton that accepts it; second, when given an automaton, it is often
difficult to understand which language it accepts. Regular expressions are
an alternative specification method for regular languages: They are easier to
construct, and it is easier to see which language they describe by just looking
at the expression. Both benefits stem from the fact that regular expressions
describe the structure of words contained in a language, rather than giving
a machine that must be “run” in order to decide a word.
Regular expressions are very common in computer applications because
they are a powerful way to describe patterns in texts. Text editors (in their
search and replace functions), programming languages such as PERL, and
UNIX utilities such as grep, awk and lex all use regular expressions to describe
patterns. Programming language compilers typically use regular expressions
to define the lowest-level constructs of program source code (“tokens”), and
the stage of the compiler responsible for recognizing tokens (“parser”) is
automatically constructed from those regular expressions using the lex utility.
1
pf3
pf4
pf5

Partial preview of the text

Download Regular Expressions: An Alternative Specification Method for Regular Languages and more Study notes Computer Science in PDF only on Docsity!

ECS 120 Lesson 7 – Regular Expressions, Pt. 1

Oliver Kreylos

Friday, April 13th, 2001

1 Outline

Thus far, we have been discussing one way to specify a (regular) language: Giving a machine that reads a word and tells whether it is in the language or not. Though this is a valid and unambiguous specification, it is sometimes not a very helpful one. Specifying languages by automata has two major shortcomings: First, when given a language, it is often difficult to construct an automaton that accepts it; second, when given an automaton, it is often difficult to understand which language it accepts. Regular expressions are an alternative specification method for regular languages: They are easier to construct, and it is easier to see which language they describe by just looking at the expression. Both benefits stem from the fact that regular expressions describe the structure of words contained in a language, rather than giving a machine that must be “run” in order to decide a word. Regular expressions are very common in computer applications because they are a powerful way to describe patterns in texts. Text editors (in their search and replace functions), programming languages such as PERL, and UNIX utilities such as grep, awk and lex all use regular expressions to describe patterns. Programming language compilers typically use regular expressions to define the lowest-level constructs of program source code (“tokens”), and the stage of the compiler responsible for recognizing tokens (“parser”) is automatically constructed from those regular expressions using the lex utility.

2 Regular Expressions

Regular expressions over an alphabet Σ define languages over Σ by describing the structure of words in a language. They are based on the three regular operations: Union, concatenation and Kleene Star. They are very similar to arithmetic expressions like 3 + (4 · 5): They consist of constants and operators, and they construct complex expressions from simpler building blocks. As opposed to arithmetic expressions, their values are not numbers, but languages. Examples:

  • hello specifies the language consisting of the single word hello.
  • hello ∪ world specifies the language consisting of the two words hello and world.
  • (aa)∗^ specifies the language of all words consisting of an even number of as.
  • a∗^ ◦bb, often written just as a∗bb, specifies the set of all words consisting of any number of as followed by two bs.

Since every regular expression defines one language, we will write L(R) to denote the language defined by regular expression R.

3 Formal Definition of Regular Expressions

Regular expressions over an alphabet Σ are defined in a recursive fashion, very similarly to arithmetic expressions. We start by defining the simplest regular expressions, and then define operations to create more complex ones from simpler building blocks:

  1. ∅ is a regular expression defining the empty language, L(∅) = ∅ ⊂ Σ∗.
  2.  is a regular expression defining the language consisting only of the empty word, L() = {} ⊂ Σ∗.
  3. If a ∈ Σ is a character, then a is a regular expression over Σ defining the language consisting of the single one-character word a, L(a) = {a} ⊂ Σ∗.

• R 1 ∪ R 2 ∗^ :=

R 1 ∪ (R 2 ∗)

. Kleene Star has precedence over union.

We also define the following shorthand notation:

  • If A = {a 1 , a 2 ,... , an} ⊂ Σ ∪ {} is a set of characters from Σ or the symbol , then A is a shorthand for a 1 ∪ a 2 ∪ · · · ∪ an, the regular expression denoting the language L(A) = {a 1 , a 2 ,... , an} ⊂ Σ∗. Here are some more relevant examples for regular expressions over the ASCII alphabet. In the following, let L := {A,... , Z, a,... , z} be the set of letters, and D := { 0 ,... , 9 } the set of decimal digits.
  • DD∗^ describes all words starting with a digit, followed by any number of digits. This is the set of all positive integers in decimal notation.
  • {+, - , }DD∗^ describes the language of all integer constants with an optional sign.
  • {+, - , }(DD∗^ ∪DD∗.D∗^ ∪D∗.DD∗)

({E, e}{+, - , }DD∗)∪

describes the language of all floating-point constants with an optional sign and exponential part, as recognized by the C programming language.

  • (L ∪ )(L ∪ D ∪ )∗^ describes the language of all valid identifiers in the C programming language (not taking reserved words into account).

4 Equivalence of Regular Expressions and Fi-

nite State Machines

Earlier we have claimed that the class of languages that can be described by regular expressions is exactly the class of regular languages. We are now going to prove this statement. First we will show that the language L(R) generated by any regular expression R is accepted by some NFA M. Second, we show that the language L(M ) accepted by any automaton M is generated by some regular expression R.

5 Construction of Automata from Regular Ex-

pressions

We will prove the existence of an automata that accepts the language gen- erated by a regular expression by structural induction. This means, we will

q 0

Figure 1: An NFA M accepting the empty language, L(M ) = ∅.

q 0

Figure 2: An NFA M accepting the language consisting only of the empty word, L(M ) = {}.

follow the recursive definition of regular expressions and construct automata accepting the languages generated by simple regular expressions first, and will then show how to combine those automata to accept the languages gen- erated by more complex ones. For all the following constructions, we will assume that all regular expressions are over some alphabet Σ.

5.1 Case 1: R = ∅

If R = ∅, then L(R) = ∅. The empty language is accepted by the NFA M∅ :=

{q 0 }, Σ, δ, q 0 , ∅

where δ(q 0 , a) = ∅ for all a ∈ Σ. A transition diagram for this automaton is shown in Figure 1.

5.2 Case 2: R = 

If R = , then L(R) = {}. The language consisting only of the empty word is accepted by the NFA M :=

{q 0 }, Σ, δ, q 0 , {q 0 }

where δ(q 0 , a) = ∅ for all a ∈ Σ. A transition diagram for this automaton is shown in Figure 2.

5.3 Case 3: R = a

If R = a for some character a ∈ Σ, then L(R) = {a}. The language consisting only of the word a is accepted by the NFA Ma :=

{q 0 , q 1 }, Σ, δ, q 0 , {q 1 }

where

∀q ∈ {q 0 , q 1 }, x ∈ Σ : δ(q, x) =

{q 1 }, if q = q 0 and x = a ∅ otherwise