Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Suffix Trees and String Matching, Slides of Computer Science

Suffix trees and their application in string matching. Suffix trees are a data structure used to efficiently determine if a given pattern is a substring of a larger string. Various types of suffix trees, including suffix arrays and enhanced suffix arrays, and their use in substring matching. It also mentions the knuth-morris-pratt algorithm as an alternative string matching method and its comparison to suffix trees. Examples of suffix trees for different strings and explains how to construct them.

Typology: Slides

2012/2013

Uploaded on 03/19/2013

dharmakeerti
dharmakeerti 🇮🇳

4.2

(26)

91 documents

1 / 13

Toggle sidebar

Related documents


Partial preview of the text

Download Suffix Trees and String Matching and more Slides Computer Science in PDF only on Docsity! Suffix Trees Suffix trees • Linearized suffix trees • Virtual suffix trees Suffix arrays • Enhanced suffix arrays • Suffix cactus, suffix vectors, … Suffix Trees • String … any sequence of characters. • Substring of string S … string composed of characters i through j, i <= j of S.  S = cater => ate is a substring.  car is not a substring.  Empty string is a substring of S. Docsity.com Subsequence • Subsequence of string S … string composed of characters i1 < i2 < … < ik of S.  S = cater => ate is a subsequence.  car is a subsequence.  The empty string is a subsequence. String/Pattern Matching • You are given a source string S. • Answer queries of the form: is the string pi a substring of S? • Knuth-Morris-Pratt (KMP) string matching.  O(|S| + | pi |) time per query.  O(n|S| + Σi | pi |) time for n queries. • Suffix tree solution.  O(|S| + Σi | pi |) time for n queries. Docsity.com Suffix Tree For S = abbbabbbb# abbb b # abbbb# b# #abbbb# b #abbbb# #abbbb# b b# 1 2 3 4 5 Suffix Tree For S = abbbabbbb# abbb b # abbbb# b# #abbbb# b #abbbb# #abbbb# b b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 1 2 3 4 5 Docsity.com Suffix Tree For S = abbbabbbb# abbb b # abbbb# b# #abbbb# b #abbbb# #abbbb# b b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 1 1 4 8 2 1 5 2 3 4 Suffix Tree Construction • See Web write up for algorithm. • Time complexity  |S| = n, alphabet size = r.  O(nr) using array nodes.  This is O(n) for r a constant (or r <= c).  O(n) expected time using a hash table.  O(n) time algorithm for large r in reference cited in Web write up. Docsity.com Suffix Array • Array that contains the start position of suffixes in lexicographic order. • abbbabbbb#  Assume # < a < b  # < abbbabbbb# < abbbb# < b# < babbbb# < bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb#  SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6]  LCP = length of longest common prefix between adjacent entries of SA.  LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -] Suffix Array • Less space than suffix tree • Linear time construction • Can be used to solve several of the problems solved by a suffix tree with same asymptotic complexity.  Substring matching  binary search for p using SA.  O(|p| log |S|). Docsity.com Search Terminates At Branch Node abbb b # abbbb# b# #abbbb# b #abbbb# #abbbb# b b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 ab Find All Occurrences Of pi • To find all occurrences of pi in time linear in the length of pi and linear in the number of occurrences of pi, augment suffix tree:  Link all element nodes into a chain in inorder.  Each branch node keeps a pointer to the left most and right most element node in its subtree. Docsity.com Augmented Suffix Tree abbb b # abbbb# b# #abbbb# b #abbbb# #abbbb# b b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 b Longest Repeating Substring • Find longest substring of S that occurs more than m > 1 times in S. • Label branch nodes with number of element nodes in subtree. • Find branch node with label >= m and max char# field. Docsity.com Longest Repeating Substring abbb b # abbbb# b# #abbbb# b #abbbb# #abbbb# b b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 m = 2 2 3 5 7 m = 5 10 Longest Common Substring • Given two strings S and T. • Find the longest common substring. • S = carport, T = airports  Longest common substring = rport  Longest common subsequence = arport • Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. • Longest common substring may be found in O(|S|+|T|) time using a suffix tree. Docsity.com