Suffix Trees: A Fast Data Structure for String Operations - Prof. Fawzi Philip Emad, Study notes of Computer Science

Suffix trees are a data structure used for efficient string operations, such as searching for a query string, finding repeated strings and palindromes. After preparing a base text, suffix trees allow for searching a query string of size k in time k. They are essentially a trie where each node can have several children, and the child you go to next is determined by the next character. Suffix trees can be compressed to save space by eliminating internal nodes with only one child and labeling internal nodes with the prefix of all strings beneath them.

Typology: Study notes

Pre 2010

Uploaded on 07/29/2009

koofers-user-wqr
koofers-user-wqr 🇺🇸

10 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Suffix Trees
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Suffix Trees: A Fast Data Structure for String Operations - Prof. Fawzi Philip Emad and more Study notes Computer Science in PDF only on Docsity!

Suffix Trees

Suffix trees

  • A data structure used to allow for very fast string operations
  • After preparing a base text of size n
  • Can search for a query string of size k in time k
  • Can also do other things, such as find repeated strings and palindromes

Suffix tree for abbab$

a b $ b b $ a b $ a b $ b $ a b $

That $ character

  • String is appended with a unique character (typically $)
  • Ensures that no String is a prefix of any other
  • Ensures that each suffix ends in a leaf
    • makes a number of the algorithms simpler

Compressed Suffix trees

  • No internal node has only one child
  • Internal nodes are labeled with the prefix of all strings beneath them
  • From an internal node n , the edge labeled c gives the tree with all strings consisting of the common prefix followed by c

Building Suffix trees

  • There are fancy algorithms for building suffix trees that require only O( n ) time to build a suffix tree for a string consisting of n characters. - we won’t be using one of those
  • Simply insert suffixes one at a time, inserting new internal nodes as needed - not too bad in practice

Useful data

  • Each node needs to store:
    • common prefix (String)
    • pointers to children
      • use an array
  • Useful to store
    • total number of suffixes stored in this tree

What will we be doing

  • Find the length of the maximum repeated String
  • Find all strings with a minimum length and a minimum frequency - can leave out Strings that are prefixes of other answers
  • Find all suffixes that start with a given String

Implementation

  • Only the methods of SuffixTree will be tested - you can implement it however you like, so long as you are actually building a suffix tree - as opposed to, for example, storing stuff as a String and just using very expensive algorithms

Implementation Suggestion

  • An interface is provided
    • Node
  • You write subclasses
  • For example, EmptyNode and NonEmptyNode

Performance measurement

  • You are asked not to directly compare strings
  • instead, use the SuffixTree. lengthCommonPrefix method, which records the number of characters compared