Download Suffix Trees and String Matching and more Slides Computer Science in PDF only on Docsity! Suffix Trees Suffix trees • Linearized suffix trees • Virtual suffix trees Suffix arrays • Enhanced suffix arrays • Suffix cactus, suffix vectors, … Suffix Trees • String … any sequence of characters. • Substring of string S … string composed of characters i through j, i <= j of S. S = cater => ate is a substring. car is not a substring. Empty string is a substring of S. Docsity.com Subsequence • Subsequence of string S … string composed of characters i1 < i2 < … < ik of S. S = cater => ate is a subsequence. car is a subsequence. The empty string is a subsequence. String/Pattern Matching • You are given a source string S. • Answer queries of the form: is the string pi a substring of S? • Knuth-Morris-Pratt (KMP) string matching. O(|S| + | pi |) time per query. O(n|S| + Σi | pi |) time for n queries. • Suffix tree solution. O(|S| + Σi | pi |) time for n queries. Docsity.com Suffix Tree For S = abbbabbbb# abbb b # abbbb# b# #abbbb# b #abbbb# #abbbb# b b# 1 2 3 4 5 Suffix Tree For S = abbbabbbb# abbb b # abbbb# b# #abbbb# b #abbbb# #abbbb# b b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 1 2 3 4 5 Docsity.com Suffix Tree For S = abbbabbbb# abbb b # abbbb# b# #abbbb# b #abbbb# #abbbb# b b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 1 1 4 8 2 1 5 2 3 4 Suffix Tree Construction • See Web write up for algorithm. • Time complexity |S| = n, alphabet size = r. O(nr) using array nodes. This is O(n) for r a constant (or r <= c). O(n) expected time using a hash table. O(n) time algorithm for large r in reference cited in Web write up. Docsity.com Suffix Array • Array that contains the start position of suffixes in lexicographic order. • abbbabbbb# Assume # < a < b # < abbbabbbb# < abbbb# < b# < babbbb# < bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb# SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6] LCP = length of longest common prefix between adjacent entries of SA. LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -] Suffix Array • Less space than suffix tree • Linear time construction • Can be used to solve several of the problems solved by a suffix tree with same asymptotic complexity. Substring matching binary search for p using SA. O(|p| log |S|). Docsity.com Search Terminates At Branch Node abbb b # abbbb# b# #abbbb# b #abbbb# #abbbb# b b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 ab Find All Occurrences Of pi • To find all occurrences of pi in time linear in the length of pi and linear in the number of occurrences of pi, augment suffix tree: Link all element nodes into a chain in inorder. Each branch node keeps a pointer to the left most and right most element node in its subtree. Docsity.com Augmented Suffix Tree abbb b # abbbb# b# #abbbb# b #abbbb# #abbbb# b b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 b Longest Repeating Substring • Find longest substring of S that occurs more than m > 1 times in S. • Label branch nodes with number of element nodes in subtree. • Find branch node with label >= m and max char# field. Docsity.com Longest Repeating Substring abbb b # abbbb# b# #abbbb# b #abbbb# #abbbb# b b# abbbabbbb# 12345678910 1 5 4 3 2 6 7 8 9 10 m = 2 2 3 5 7 m = 5 10 Longest Common Substring • Given two strings S and T. • Find the longest common substring. • S = carport, T = airports Longest common substring = rport Longest common subsequence = arport • Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. • Longest common substring may be found in O(|S|+|T|) time using a suffix tree. Docsity.com