Lecture 8: Bioinformatic Algorithms - Sequence Alignment and Suffix Trees, Study notes of Computer Science

A part of the lecture notes for cmsc423: bioinformatic algorithms, databases and tools, fall 2008. It covers the topics of sequence alignment, exact and inexact, and the use of suffix trees for matching and substring searches. The document also discusses suffix links and their applications, such as finding repeats and longest common substrings.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-e86-1
koofers-user-e86-1 🇺🇸

10 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CMSC423 Fall 2008 1
CMSC423: Bioinformatic Algorithms,
Databases and Tools
Lecture 8
Sequence alignment:
exact alignment
inexact alignment
dynamic programming, gapped
alignment
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Lecture 8: Bioinformatic Algorithms - Sequence Alignment and Suffix Trees and more Study notes Computer Science in PDF only on Docsity!

CMSC423: Bioinformatic Algorithms,

Databases and Tools

Lecture 8

Sequence alignment:

exact alignment

inexact alignment

dynamic programming, gapped

alignment

Suffix trees for matching

• Suffix trees use O(n) space

• Suffix trees can be constructed in O(n) time

• Is CAT part of ATCATG?

• Match from root, char by char

• If run out of query – found match

• otherwise, there is no match

• intuition: CAT is the prefix

of some suffix

AT1, G$6,7 T2,2 CATG$3,

G$6,7 CATG$3,7 (^) G$ 6,

CATG$3,

4 1

6

5 2

3

$7, 7

Other uses

• Finding repeats

  • internal nodes with multiple children – DNA that occurs in multiple places in the genome

• Longest common substring of two strings

  • build suffix tree of both strings. Find lowest internal node that has leaves from both strings
  • or: build suffix tree on one string and use suffix links to find longest match

• Note: running time for matching is O(|Pattern|),

not O(|Pattern| + |Text|)

(though O(|Text|) was spent in pre-processing

Why do we care?

• Suffix trees are used for

  • mapping reads to a genome (e.g. personal genomics)
  • comparing genomes (comparative genomics)
  • finding repeats
  • identifying genome signatures

• Exact matching – what to expect on exams

  • build a suffix tree for a string
  • answer some questions about one of the algorithms, e.g. for Z algorithm – is it necessary j be the farthest reaching Z- value or just any Z value extending past i?
  • do something with the help of some of the algorithms (e.g. look for repeats that occur exactly twice, etc.)

Suffix arrays and compression

• Burrows-Wheeler transform

BANANA

BANANA$

ANANA$B

NANA$BA

ANA$BAN

NA$BANA

A$BANAN

$BANANA

$ BANANA

A$ BANAN

ANA$ BAN

ANANA$ B

BANANA$

NA$ BANA

NANA$ BA

sort (^) ANNB$AA compress

character before the suffix BWT

Note: characters in last column occur in same order as in first column Useful for matching within BWT

BWT – string matching

  • Look for “BANA”
  • Start at end (match right to left)
  • Find character in rightmost column
  • Identify corresponding range in first column
  • Switch back to last column
  • ...
  • How do we know the first A in the pattern is the 2nd/3rd from the top of the matrix?
  • Note: add'l data needed:

    of times each letter appears

    before every pos'n
  • Running time? O(len(P)) operations. Each may cost O(log(len(T)))

ABN$

$ BANANA

A$ BANAN

ANA$ BAN

ANANA$ B

BANANA$

NA$ BANA

NANA$ BA

A N

A

A (^) B