Suffix Trees and Suffix Arrays: Efficient String Searching Algorithms, Lecture notes of Bioinformatics

Notes of Introduction to Bioinformatics

Typology: Lecture notes

2016/2017

Uploaded on 11/21/2017

dr-maqsood-hayat
dr-maqsood-hayat 🇵🇰

5 documents

1 / 38

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Suffix Trees and
Suffix
Trees
and
Suffix Arra
y
s
y
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26

Partial preview of the text

Download Suffix Trees and Suffix Arrays: Efficient String Searching Algorithms and more Lecture notes Bioinformatics in PDF only on Docsity!

Suffix Trees andSuffix

Trees and

Suffix Arrays

y

Some problems

p

  • Given a pattern P = P[1..m], find all

p

[^

]

occurrences of P in a text S = S[1..n]

  • Another problem:• Another problem:
    • Given two strings S

[1..n 1

] and S 1

[1..n 2

] find their 2

l^

t^

b t i

longest common substring.• find i, j, k such that S

[i .. i+k-1] = S 1

[j .. j+k-1] and k is^2

l^

ibl

as large as possible.

  • Any solutions? How do you solve these

problems (efficiently)?

App

lications in Bioinformatics

pp

  • Multiple genome alignment

p

g

g

  • Michael Hohl et al. 2002– Longest common substring problem– Longest common substring problem– Common substrings of more than two strings
    • Selection of signature oligonucleotides for

microarrays

y

  • Kaderali and Schliep, 2002Identification of sequence repeats
    • Identification of sequence repeats
      • Kurtz and Schleiermacher, 1999

Suffix trees• Any string of length m can be degenerated

i t

ffi

into m suffixes.– abcdefgh (length: 8)

g

(^

g

)

  • 8 suffixes:
    • h gh fgh efgh defgh cdefgh bcefgh abcdefgh

h, gh, fgh, efgh, defgh, cdefgh, bcefgh, abcdefgh

  • The suffixes can be stored in a suffix-tree and

thi

t^

b

t d i

O( ) ti

this tree can be generated in O(

n

)^

time

  • A string pattern of length

m

can be searched

g p

g

in this suffix tree in O(

m

) time.

Whereas a regular sequential search would take

  • Whereas, a regular sequential search would take

O(

n

) time.

Definition of a suffix tree• Let

S

S

[1..

n

]^

be a string of length

n

over a

[^

]^

g

g

fixed alphabet

. A suffix tree for

S

is a tree

with

n

leaves (representing

n

suffixes) and

with

n

leaves (representing

n

suffixes) and

the following properties:

Every internal node other than the root has at least 2

  • Every internal node other than the root has at least 2

children

  • Every edge is labeled with a nonempty substring of

S

  • Every edge is labeled with a nonempty substring of

S

.

  • The edges leaving a given node have labels starting with

different letters.different letters.

  • The concatenation of the labels of the path from the root

to leaf

i^

spells out the

i-

th suffix

S

[ i ..

n ]

of

S

. We denote

p^

[^

]

S

[i..n] by

S

. i

A

n example suffix tree

p

  • The suffix tree for string: 1 2 3 4 5 6

g

x a b x a c

Does a suffix treealways exist?always exist?

Problem• Note that if a suffix is a prefix of another suffix

p

we cannot have a tree with the propertiesdefined in the previous slidesdefined in the previous slides.– e.g.

xabxa

The fourth suffix

xa

or the fifth suffix

a

won’t be

represented by a leaf node.

Solution: the terminal character $• Note that if a suffix is a prefix of another suffix

p

we cannot have a tree with the propertiesdefined in the previous slidesdefined in the previous slides.– e.g.

xabxa

The fourth suffix

xa

or the fifth suffix

a

won’t be

represented by a leaf node.

  • Solution: insert a special terminal character at

the end such as $ Therefore xa$ will not be athe end such as $. Therefore xa$ will not be aprefix of the suffix xabxa.

Suffix tree construction•^

Start with a root and a leaf numbered 1, connectedby an edge labeled

S

$

.

-^

Enter suffixes

S

[2..

n

]$;

S

[3...

n

]$; ... ;

S

[ n

]$ into the

[^

] ;

[^

] ;

;^

[ ]

tree as follows:

-^

To insert

K

=i

S

[ i

n

]$ follow the path from the root

To insert

K

i^

S

[ i

..

n

]$, follow the path from the root

matching characters of

K

until the first mismatch ati

character

K

[^ i

j^

] (which is bound to happen)

character

K

[^ i

j^

]^

(which is bound to happen)

(a) If the matching cannot continue from a node, denotethat node by

w

that node by

w

(b) Otherwise the mismatch occurs at the middle of anedge, which has to be splitedge, which has to be split

Suffix tree construction - 2• If the mismatch occurs at the middle of an

edge

e

S

[ u

v

]

  • let the label of that edge be

a

1

a

l

let the label of that edge be

a

... 1

a

l

  • If the mismatch occurred at character

a

, then k

create a new node

w

and replace

e

by two edges

create a new node

w

, and replace

e

by two edges

S[u ... u+k-1] and S[u+k ... v] labeled by

a

... 1

a

k and

a

a

a

k+

...

a

l

  • Finally, in both cases (a) and (b), create a

new leaf numbered

i

, and connect

w

to it by

an edge labeled with

K

[ i

j^

K

|] i

g

[ i j^

|^

|] i

Example contd...

p

  • Inserting the fourth suffix xac$ will cause the

g

first edge to be split:

$

$

$

$

S

thi

h

f^

th

d

d

  • Same thing happens for the second edge

when ac$ is inserted.

Example contd...

p

  • After inserting the remaining suffixes the tree

g

g

will be completed:

Storing the edge labels efficiently

g

g

y

  • Note that, we do not store the actual

substrings

S

[ i

j

] of

S

in the edges, but only

their start and end indices (

i^

j )

their start and end indices (

i ,

j

  • Nevertheless we keep thinking of the edge

l b l

b t i

f^

S

labels as substrings of

S

  • This will reduce the space complexity to O(

n

p

p

y

(^

Suffix tree applet

pp

  • http://pauillac.inria.fr/~quercia/documents-

p

p

q

info/Luminy-98/albert/JAVA+html/SuffixTreeGrow html98/albert/JAVA html/SuffixTreeGrow.html