Suffix Trees and String Matching, Slides of Computer Science

Suffix trees and their application in string matching. Suffix trees are a data structure used to efficiently determine if a given pattern is a substring of a larger string. Various types of suffix trees, including suffix arrays and enhanced suffix arrays, and their use in substring matching. It also mentions the knuth-morris-pratt algorithm as an alternative string matching method and its comparison to suffix trees. Examples of suffix trees for different strings and explains how to construct them.

Typology: Slides

2012/2013

Uploaded on 03/19/2013

dharmakeerti
dharmakeerti 🇮🇳

4.2

(27)

89 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Suffix Trees
Suffix trees
Linearized suffix trees
Virtual suffix trees
Suffix arrays
Enhanced suffix arrays
Suffix cactus, suffix vectors, …
Suffix Trees
String … any sequence of characters.
Substring of string S… string composed of
characters ithrough j, i <= j of S.
S = cater => ate is a substring.
car is not a substring.
Empty string is a substring of S.
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Suffix Trees and String Matching and more Slides Computer Science in PDF only on Docsity!

Suffix Trees

Suffix trees

  • Linearized suffix trees
  • Virtual suffix trees

Suffix arrays

  • Enhanced suffix arrays
  • Suffix cactus, suffix vectors, …

Suffix Trees

  • String … any sequence of characters.
  • Substring of string S … string composed of

characters i through j, i <= j of S.

 S = cater => ate is a substring.

 car is not a substring.

 Empty string is a substring of S.

Subsequence

  • Subsequence of string S … string composed

of characters i

1

< i

2

< … < i

k

of S.

 S = cater => ate is a subsequence.

 car is a subsequence.

 The empty string is a subsequence.

String/Pattern Matching

  • You are given a source string S.
  • Answer queries of the form: is the string p i

a

substring of S?

  • Knuth-Morris-Pratt (KMP) string matching.

 O(|S| + | p i

|) time per query.

 O(n|S| + Σ i

| p i

|) time for n queries.

  • Suffix tree solution.

 O(|S| + Σ

i

| p i

|) time for n queries.

String Matching & Suffixes

  • p i

is a substring of S iff p

i

is a prefix of some

suffix of S.

  • Nonempty suffixes of S = sleeper are:

 sleeper

 leeper

 eeper

 eper

 per, er, and r.

  • Which of these are substrings of S?

 leep, eepe, pe, leap, peel

Last Character Of S Repeats

  • When the last character of S appears more

than once in S, S has at least one suffix that

is a proper prefix of another suffix.

  • S = creeper

 creeper, reeper, eeper, eper, per, er, r

  • When the last character of S appears more

than once in S, use an end of string

character # to overcome this problem.

  • S = creeper#

 creeper#, reeper#, eeper#, eper#, per#, er#, r#, #

Suffix Tree For S = abbbabbbb#

abbb

b

abbbb# b#

abbbb# #

b

abbbb#

abbbb#

b

b#

Suffix Tree For S = abbbabbbb#

abbb

b

abbbb# b#

abbbb# #

b

abbbb#

abbbb#

b

b#

abbbabbbb#

Suffix Array

  • Array that contains the start position of suffixes in

lexicographic order.

  • abbbabbbb#

 Assume # < a < b

 # < abbbabbbb# < abbbb# < b# < babbbb# <

bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb#

 SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6]

 LCP = length of longest common prefix

between adjacent entries of SA.

 LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -]

Suffix Array

  • Less space than suffix tree
  • Linear time construction
  • Can be used to solve several of the problems

solved by a suffix tree with same asymptotic

complexity.

 Substring matching  binary search for p using SA.

 O(|p| log |S|).

O(|p

i

|) Time Substring Matching

babb abbba baba

abbb

b

abbbb# b#

abbbb# #

b

abbbb#

abbbb#

b

b#

abbbabbbb#

Find All Occurrences Of p

i

  • Search suffix tree for p i
  • Suppose the search for p i

is successful.

  • When search terminates at an element node, p i

appears exactly once in the source string S.

Search Terminates At Branch Node

abbb

b

abbbb# b#

abbbb# #

b

abbbb#

abbbb#

b

b#

abbbabbbb#

ab

Find All Occurrences Of p

i

  • To find all occurrences of p i

in time linear in

the length of p

i

and linear in the number of

occurrences of p

i

, augment suffix tree:

 Link all element nodes into a chain in inorder.

 Each branch node keeps a pointer to the left most

and right most element node in its subtree.

Augmented Suffix Tree

abbb

b

abbbb# b#

abbbb# #

b

abbbb#

abbbb#

b

b#

abbbabbbb#

b

Longest Repeating Substring

  • Find longest substring of S that occurs more

than m > 1 times in S.

  • Label branch nodes with number of element

nodes in subtree.

  • Find branch node with label >= m and max

char# field.

Longest Common Substring

  • Let $ be a new symbol.
  • Construct the suffix tree for the string U = S$T#.

 U = carport$airports#

 No repeating substring includes $.

 Find longest repeating substring that is both to left and

right of $.

  • Find branch node that has max char# and has at

least one element node in its subtree that

represents a suffix that begins in S as well as at

least one that begins in T.