String Matching-Algorithm Design and Analysis for Strings-Lecture Slides, Slides of Design and Analysis of Algorithms

This lecture is part of lecture series for Design and Analysis of Algorithms course. This course was taught by Dr. Bhaskar Sanyal at Maulana Azad National Institute of Technology. It includes: String, Matching, Pattern, Exact, Searching, Keywords, Database, Sunstring, Subsequence, Brute-Force, Algorithm

Typology: Slides

2011/2012

Uploaded on 07/11/2012

dharmadaas
dharmadaas 🇮🇳

4.3

(55)

262 documents

1 / 27

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
3 -1
String Matching
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b

Partial preview of the text

Download String Matching-Algorithm Design and Analysis for Strings-Lecture Slides and more Slides Design and Analysis of Algorithms in PDF only on Docsity!

3 -

String Matching

3 -

String Matching Problem

 Given a text string T of length n and a pattern

string P of length m , the exact string matching

problem is to find all occurrences of P in T.

 Example: T=“AGCTTGA” P=“GCT”

 Applications:

 Searching keywords in a file

 Searching engines (like Google and Openfind)

 Database searching (GenBank)

3 -

A Brute-Force Algorithm

Time: O( mn ) where m =| P | and n =| T |.

3 -

Two-phase Algorithms

 Phase 1:Generate an array to indicate the

moving direction.

 Phase 2:Make use of the array to move and

match the string

 KMP algorithm:

 Proposed by Knuth, Morris and Pratt in 1977.

 Boyer-Moore Algorithm:

 Proposed by Boyer-Moore in 1977.

3 -

Second Case for KMP Algorithm

 The first symbol of P appears in P again.

 T

7

 P

7 in (a). We have to slide to T 6 , since P 6

= P

1

= T

6

3 -

Third Case for KMP Algorithm

 The prefix of P appears in P again.

 T

8

 P

8 in (a). We have to slide to T 6 , since P 6,

= P

1,

= T

6,

3 -

Definition of the Prefix Function

f ( j ) =k

f ( j )=largest k < j such that P

1, k

=P

j–k +1 ,j

f ( j ) = 0 if no such k

3 -

Calculation of the Prefix Function

determine f ( 5 )

Because , we get ( 5 ) 0

If , then we check if ;

If , then we get ( 5 ) ( 4 ) 1 ;

( 4 ) 1 , thus

5 1

5 2 5 1

5 2

4 1

P P f

P P P P

P P f f

f P P

3 -

Calculation of the Prefix Function

f ( 4 )  1 9 (^91 )^14

f ( 9 ) 4 because P P P

f

 

( 4 ) 1 because "A"

4 ( 4 1 ) 1 1

 

f P P P

f

"T"

( 10 ) 2 because "T" "C"

(^10) ( 10 1 ) 1 ( ( 10 1 )) 1 ( 4 ) 1 2

10 ( 10 1 ) 1 5

   ^ 

 

P P P P P

f P P P

f f f^ f

f

To determine f (10):

Pattern Matching 14

Computing the Failure

Function

 The failure function can be

represented by an array and

can be computed in O ( m ) time

 The construction is similar to

the KMP algorithm itself

 At each iteration of the while-

loop, either

i increases by one, or

 the shift amount ij

increases by at least one

(observe that F ( j  1) < j )

 Hence, there are no more

than 2 m iterations of the

while-loop

Algorithm failureFunction ( P )

F [ 0 ]  0

i  1

j  0

while i < m

if P [ i ]  P [ j ]

{we have matched j + 1 chars}

F [ i ]j + 1

ii  1

jj  1

else if j > 0 then

{use failure function to shift P }

jF [ j  1]

else

F [ i ]  0 { no match }

ii  1

3 -

An Example for KMP Algorithm

Phase 1

Phase 2

f (4–1)+1= f (3)+1=0+1=

f (12)+1= 4+1=

matched

3 -

Time Complexity of KMP Algorithm

 Time complexity : O ( m + n ) (analysis omitted)

 O ( m ) for computing function f

 O ( n ) for searching P

3 -

 A suffix Tree for S=“ATCACATCATCA”

Suffix Trees

3 -

Properties of a Suffix Tree

 Each tree edge is labeled by a substring of S.

 Each internal node has at least 2 children.

 Each S

( i )

has its corresponding labeled path

from root to a leaf, for 1 i  n.

 There are n leaves.

 No edges branching out from the same

internal node can start with the same

character.