Divide and Conquer Multiple Sequence Alignment | CSC 8910, Papers of Computer Science

Material Type: Paper; Class: COMPUTER SCIENCE TOPICS SEMINR; Subject: COMPUTER SCIENCE; University: Georgia State University; Term: Unknown 1997;

Typology: Papers

Pre 2010

Uploaded on 08/31/2009

koofers-user-d0t
koofers-user-d0t 🇺🇸

10 documents

1 / 144

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Universit¨
at
Bielefeld
Forschungsbericht der
Technischen Fakultat
Abteilung Informationstechnik
Divide-and-Conquer Multiple Sequence Alignment
Jens Stoye
Report 97-02
Universitat Bielefeld
Postfach 10 01 31
33501 Bielefeld
FRG
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Divide and Conquer Multiple Sequence Alignment | CSC 8910 and more Papers Computer Science in PDF only on Docsity!

Universit¨at

Bielefeld

Forschungsb ericht der

Technischen Fakultat

Abteilung Informationstechnik

DivideandConquer Multiple Sequence Alignment

Jens Stoye

Rep ort 

Universitat Bielefeld Postfach     Bielefeld FRG

Impressum Herausgeb er Rob ert Giegerich Alois Knoll Peter Ladkin Helge Ritter Gerhard Sagerer Ipke Wachsmuth

Technische Fakultat der Universitat Bielefeld Abteilung Informationstechnik Postfach     Bielefeld FRG ISSN  

Contents

Intro duction  Preliminaries                                  Overview                                     Acknowledgements                             

 Analysis of Dierences Sequence Alignment   Intro duction                                   Global Sequence Alignment                          Pairwise Sequence Alignment                     Multiple Sequence Alignment                    Alignment Scores                                Single Letter Substitutions                       Pairwise Alignment Score                       Multiple Sequence Alignment Score                  The Problem                                  Sequence Weighting                             

 Calculating Distances Dynamic Programming   Optimal Alignment of Two Sequences                     General Gap Functions                             Optimal Alignment of More Than Two Sequences              The Approach of Carrillo and Lipman                    NP Completeness of Multiple Sequence Alignment            

 Basic DivideandConquer Alignment   Intro duction                                   Additional Cost Matrix of Two Sequences                  C Optimal Families of Slicing Positions                    The Algorithm                                Variations of the Basic Algorithm                      Stopping Criteria                            Windowing                               Relaxing c                             

i

ii CONTENTS

 Rapid Simultaneous Three Way Alignment              Combination with Fragment Based Metho ds           

 Reducing Computation Time   Basic Approach                                 Analogy to the Approach of Carrillo and Lipman              Computing an Upp er Bound calcChat                    Some O k ^ n^  Time Metho ds                     Iterative Metho ds                            Polynomial Time Metho ds                       Computing a C Optimal Cut calcCopt                    Monotony Bounds                            Prepro cessing Subfamilies                      Approximate Slicing Positions                       

The Program DCA  Some Implementational Asp ects                        Incorp oration of MSA                          Prepro cessing Subfamilies                      Parameters of DCA                             

 Evaluation of the Algorithm   Families of Random Sequences                         Quality of the Alignments                           Dep endence on the Recursion Stop Size                Improvement by Windowing                      Relaxing c                               Approximate Slicing Positions                     Comparison of Dierent Realizations of calcChat              Comparison of Dierent Realizations of calcCopt              Dep endence on Sequence Length and Numb er                Dep endence on Sequence Similarity                    

Results on Biological Sequences  Six Tyrosine Kinases                              Four Famous Benchmark Problems                     Comparison with MSA                              Alignment and Phylogeny of RNase MRP RNA               How Many Sequences                           

Conclusion 

Bibliography

 CHAPTER  INTRODUCTION

0

100

200

300

400

500

600

700

1984 1986 1988 1990 1992 1994 1996

million bases

year

Figure  The amount of DNA sequence data in GenBank                 

analyzed by metho ds very similar to those studied here Kruskal   gives a survey of these and further applications Yet the fo cus of the present work will b e on string comparison as a to ol for the molecular biologist

 Preliminaries

In this section we intro duce our terminology and we discuss some questions which have led to the development of a new algorithm the DivideandConquer Alignment algorithm DCA This section cannot b e a self contained intro duction to compu tational biology For this purp ose the reader might use any one of the relevant textb o oks eg        The most prominent biological sequences are proteins and nucleic acid sequences Proteins are p olymers made up of amino acids connected linearly by p eptide b onds that is they are polypeptides They play an imp ortant role as enzymes in the metab olism of the cell A protein sequence is usually a few hundred units long On our level of abstraction proteins can b e viewed as sequences of letters drawn from the alphab et of the twenty amino acids o ccurring in living matter Table  lists the full names the three letter acronyms and the one letter co de

Nucleic acids are also p olymers made up of small molecules called nucleotides

 PRELIMINARIES 

Alanine Arginine Asparagine Aspartic acid Cysteine Ala Arg Asn Asp Cys A R N D C Glutamine Glutamic acid Glycine Histidine Isoleucine Gln Glu Gly His Ile Q E G H I Leucine Lysine Methionine Phenylalanine Proline Leu Lys Met Phe Pro L K M F P Serine Threonine Tryptophan Tyrosine Valine Ser Thr Trp Tyr Val S T W Y V

Table  The twenty common amino acids

which can b e distinguished by the four bases they contain adenine A cytosine C guanine G and either thymine T or uracil U for deoxyribonucleic acids DNA or ribonucleic acids RNA resp ectively A and G are purines C and T are pyrimidines Nucleic acids contain typically from tens to thousands of units for RNA or millions of units for DNA Given one or several such sequences many questions arising in molecular biology can b e reformulated as string pro cessing problems

The inference of physical mapping and sequence assembly ie the reconstruc tion of a DNA molecule from the nucleic acid sequences of fragments within it        

Molecular mo delling ie the determination of the three dimensional structure of a protein from its sequence of amino acids           

The assessment of structurefunction correlations mostly by conclusions from structurally similar or historically related molecules  

The reconstruction of phylogenies ie the inference of the evolutionary history of some sp ecies from their asso ciated sequences    

The exploration of statistical geometry in sequence space 

The database search for sequences similar to a given sequence        

The prediction of RNA secondary and tertiary structure ie the way in which dierent segments of the sequence connect with each other        

Most of these applications require the comparison of sequences An early approach to compare sequences of the same length is the computation of their Hamming Distance

 OVERVIEW 

In the last decades various simultaneous multiple alignment algorithms have b een develop ed see eg              Unfortunately almost all of these metho ds either exhibit a prohibitive computational complexity or yield biologically unplausible results Current algorithms which try to optimize one of the standard score functions are limited to half a dozen of sequences With the algorithm DCA develop ed in this thesis we improve this situation By slicing the sequences at appropriate p ositions in a divide and conquer manner DCA allows to compute high quality  but not necessarily optimal  simultaneous alignments of up to fourteen related sequences in a rather short time requiring only mo derate computer memory Surprisingly to our knowledge the divide and conquer principle a well established metho d in algorithmic computer science cf    has never b een applied b efore systematically in such a simple way within this context

 Overview

In Chapter  a formal intro duction to the multiple sequence alignment problem is given We present denitions of pairwise and multiple sequence alignments followed by a discussion of the most commonly used alignment score functions The multiple sequence alignment problem is formulated on this basis

In Chapter  we recall the standard metho d for solving the alignment problem Time and space limitations of this approach will b e discussed which make it neces sary to accept fast but in general sub optimal solutions The standard principles of computing such heuristic alignments are briey summarized

Chapter is devoted to the presentation of the basic divide and conquer align ment algorithm which  in contrast to previous heuristic metho ds  computes si multaneous multiple sequence alignments optimizing a well dened alignment score Time and space complexity of the algorithm are analyzed and several suggestions for variations and further uses of the algorithm are discussed

In Chapter  we present improvements of the basic algorithm which allow an enormous increase of eciency Using branch and b ound techniques the search for appropriate slicing p ositions of the sequences is accelerated so that more than a dozen of sequences can b e aligned simultaneously by our metho d

Chapter  gives a short description of the computer program DCA which is part of this thesis

The applicability of our algorithm is illustrated in Chapters  and  We establish the high quality of the computed alignments in mathematical as well as in biological terms and validate the theoretical time and space analyses Using well established b enchmark problems alignments pro duced by our algorithm are also compared to those computed with other metho ds

Chapter  concludes the thesis by recalling its main results

 CHAPTER  INTRODUCTION

 Acknowledgements

Parts of this dissertation thesis have b een published in advance        and two further pap ers are forthcoming    The work was carried out whilst b eing a do ctoral fellow at the Center for Inter disciplinary Research on Structure Formation FSPM at the University of Bielefeld Germany and memb er of the Graduiertenkolleg Strukturbildungsprozesse b oth of which have b een a p erfect work environment due to the interdisciplinary nature of these institutions which made p ossible many instructive discussions with colleagues and friends I would like to thank my advisor Prof Dr Rob ert Giegerich for bringing me into contact with the sequence analysis asp ects of molecular biology and for maintaining my enthusiasm I would also like to thank Prof Dr Andreas Dress for intro duc ing me to the particular sub ject of this thesis as well as for his helpful comments and the fruitful discussions during the development of the algorithm and its many variants My sp ecial thank go es to Dr Soren Perrey who worked out parts of joint publications and who suggested several improvements regarding an earlier version of the manuscript He also provided me with the example in Section   Finally I am much obliged to Dr Vincent Moulton for improving the English of the thesis

 CHAPTER  ANALYSIS OF DIFFERENCES SEQUENCE ALIGNMENT

corresp onding sux So we have si^ i^ s s We also say that s is cut at slicing position i

The just dened formalism facilitates a closer lo ok at mutations as they are ob served in biological sequences so called accepted mutations Dayho et al   distinguish two principal kinds large changes and point mutations

Large changes in genetic sequences are b elieved to b e caused by unequal crossing over of the chromosomes These large scale rearrangements can include entire genes The subsequences s and t in the following list of the most imp ortant typ es of large changes can b e several thousands or millions of nucleotides long

Inversions  A subsequence is reversed usv  us^ v  Translocations  Subsequences from dierent chromosomes are exchanged usv and wtx  utv and wsx In most cases the exchanged subsequences are terminal ie u or v and w or x resp ectively are empty in which case s and t are prexes or suxes of the chromosome sequence Deletions  A subsequence is deleted usv  uv  Duplications  A subsequence is duplicated usvw  usvsw  Transpositions  Two subsequences on the same chromosome are exchanged usvtw  utvsw 

Point mutations are lo cal changes of one or a few consecutive nucleotides during the copying pro cess of DNA Occurring within co ding regions they may b e observed as changes of a single or a few amino acids in the translated protein In the following formal description of the most imp ortant typ es of p oint mutations the subsequences a and b are usually very short eg one to ve letters in case of amino acids  

Insertions  A single letter or a small numb er of consecutive letters is inserted uv  uav  Deletions  A single letter or a small numb er of consecutive letters is deleted uav  uv  Substitutions  A short subsequence is substituted by a dierent sequence uav  ubv  a  b

In other than biological applications of string comparison further lo cal mu tations are considered

Swaps  Two single neighb oring letters are exchanged uabv  ubav  Swaps play an imp ortant role in sp elling correction    

 GLOBAL SEQUENCE ALIGNMENT 

Compression and expansion  In sp eech recognition    and other appli cations   where continuous input streams are compared it is often necessary to re scale the incoming data Scaling op erations of this kind are also referred to as time or space warps

In the study of genomic sequences large changes and p oint mutations are contem plated in dierent situations In the search for the correct order of sequence fragments on a chromosome the so called sequence assembly problem and in the comparison of the gene p o ol of dierent sp ecies large changes have to b e considered In con trast p oint mutations play the most imp ortant role in the study of single proteins or other comparatively small regions of the genome sequences of some hundreds up to  letters in length Such sequences are often compared for common overall also called global  or lo cal similarities Global and lo cal sequence comparison can b e handled by alignments The divide and conquer algorithm develop ed in this thesis facilitates global sequence comparison by sp eeding up the search for global sequence alignments

 Global Sequence Alignment

A standard metho d in computational molecular biology for presenting the result of sequence comparison is an alignment which we formalize in this section Multiple sequence alignments are a natural generalization of pairwise sequence alignments which we intro duce rst

 Pairwise Sequence Alignment

Assume that we are given two sequences s and t which are known to b e globally related In general s and t are of dierent length For b etter comparibility the

lengths of the sequences are equalized Blanks denoted by dashes  are inserted

into or at either end of s and t such that the two resulting sequences s^ and t^  resp ectively are of the same length N  Apart from the length equalization the most imp ortant aim of inserting blanks is the following By selecting the lo cation of blanks carefully regions from s may b e lo cated in s^ at the same p osition where similar regions of t are lo cated in t^  By writing the sequences s^ and t^ ab ove each other similarities and dierences are then easier to observe

De nition   Pairwise Sequence Alignment

An alignment of two sequences s and t over A is a matrix

A

s s     s N t t     t N

with two rows s^ s s     s N  t^ t t     t N and N columns maxfjsj jtjg  N 

jsj  jtj where

 GLOBAL SEQUENCE ALIGNMENT

Using Stirlings Formula f n n can b e approximated by    

f n n   

p  n^ n^  

Notation  

Two identical letters ab ove each other in an alignment form a match and two distinct ones form a mismatch or substitution A blank in one sequence aligned with a letter a in the other can b e viewed as an insertion of a into the rst sequence or as a deletion of a from the second sequence Following Kruskal   we use the term indel to denote the event of a deletion or an insertion

The letters of the original sequence s and of the padded sequence s^ in an alignment are connected by the following maps s and s

De nition  

Assume a sequence s over A and a corresp onding aligned sequence s^ of length N  jsj ie assume that s^ repro duces s up on elimination of the blanks For j  f     N g let s j  b e the numb er of letters in s^ b efore p osition j which are not blanks plus one ie

s j  " fk  f     j  gjs k  g  

Clearly s^ is monotonously increasing and the assumed relationship b etween s and s^ implies f      jsjg fs  s      s N g f      jsj  g with s^ N  jsj  if and only if s N  Hence for i  f     jsjg we dene s i to b e the largest j with s j  i ie

s^ i max j N fj js^ j  ig 

Clearly we have s k  i if and only if s i    k  s i for i      jsj with s^    for convenience and we have s k si  if k s^ i for some i  f     jsjg and s k  otherwise

Note that apart from alignments other equivalent or very similar data structures presenting the result of global sequence comparison are discussed in the literature denoted as edit scripts traces or listings                In biological applications the term alignment and the ab ove denitions have b ecome generally accepted

 Multiple Sequence Alignment

The concept of global pairwise alignment can b e extended to the comparison of more than two sequences in a straightforward way

 CHAPTER  ANALYSIS OF DIFFERENCES SEQUENCE ALIGNMENT

De nition   Multiple Sequence Alignment

Consider a family^ hs      sk i of k sequences over A A multiple alignment of

hs      sk i is given by a k N matrix

A

BB

s s  s N  

s k s k  s k N

CC

A

for some N  maxfjs j     jsk jg  N 

Pk

i jsi^ j^ where

s ij  A  fg for all  i  k   j  N 

for each i      k  the row s i s i s i    s iN repro duces the sequence si

up on ignoring all of its blanks and

A do es not contain any column consisting of blanks only

The set of all alignments of S hs      sk i is denoted by AS 

Similar to sequences alignments with the same numb er of rows may b e concate nated by the  op erator

De nition  Pro jection

Consider a family of sequences S and an alignment A  AS  Given a subfamily

S ^ S  the alignment obtained by extracting from A all rows corresp onding to the sequences in S ^  where the columns consisting of blanks only are removed  is called the projection of A on S ^  denoted by S  A

De nition   Combination

Consider a family of sequences S and subfamilies S      Sn S  Given alignments

A  AS      An  ASn  an alignment A  AS  is called a combination of the

A      An  if Si A Ai for all i  f     ng

De nition  Compatibility of Alignments

Consider as ab ove a family of sequences S and subfamilies S      Sn S  Align

ments A  AS      An  ASn  are compatible if there exists a combination

A  AS  of the A      An such that Si A Ai for all i  f     ng

 (^) Note the use of angle brackets h i to designate a family of sequences instead of the usual set

notation with braces Sequence families in our context are lists of sequences which in contrast to sets are ordered and may contain more than one identical element To denote subfamilies we use the standard symb ol from set notation

CHAPTER  ANALYSIS OF DIFFERENCES SEQUENCE ALIGNMENT

have evolved indep endently Therefore only a usually symmetric substitution score function is needed which quanties for each pair of letters a b  A^  the preference of aligning letter a with letter b represented by an jAj jAj substitution matrix see eg        In general there are two natural ways of dening substitution score functions

a in form of a similarity score s A^  R where letters a b with similar prop erties are scored with high p ositive values and dierent letters are scored with low negative values or

b in form of a distance score d A^  R where pairs of similar letters are scored with small values while the alignment of dierent letters is p enalized by a high distance score

The simplest p ossible distance score is the unit cost function Non identical letters are scored with the value  identities are scored with  The unit cost function is memb er of a broad class of distance score functions with interesting and useful prop erties the metric functions d A^  R is said to b e a metric on A if in addition to symmetry the zero property da b a b and the triangle inequality da b  da c  dc b hold for all a b c  A In case d is a metric on A Sellers   and Waterman et al   showed that for global sequence alignments the metric prop erties carry over to the set of sequences over A On the other hand in the context of lo cal alignments similarity scores with p os itive and negative values have b een shown to b e sup erior to distance scores   In a similarity score the value zero has a sp ecial meaning since it separates the p osi tively correlated pairs of letters from the negatively correlated pairs Thus a series of p ositively scored substitutions in an alignment indicates highly similar regions of the involved sequences For global sequence alignment it has b een shown that under certain conditions b oth distance and similarity score are equivalent   and it is easy to derive the corresp onding similarity score function from a given distance score function and vice versa     Throughout this thesis we will restrict the discussion of alignment scores and algorithms to distances This choice is arbitrary and with little changes in some denitions it is p ossible to apply the pro cedures develop ed in the following chapters also to a similarity score function Furthermore the algorithms describ ed in this thesis do not require any of the metric prop erties except symmetry Thus the score matrices commonly used in biological applications which are non metric eg the PAM   and Blosum  series of amino acid substitution matrices are applicable The eort made for the invention of biologically reasonable score functions diers signicantly for nucleic acids and for amino acids For nucleic acids often rather simple score functions are used^  Beside the unit cost function mentioned ab ove score functions are largely established which distinguish only b etween transitions

 (^) This trend may change in the near future Very recently Agarwal and States  published a nucleic acid substitution matrix based on an advanced statistical mo del

 ALIGNMENT SCORES 

exchanges of two purines or of two pyrimidines and transversions all other com binations     For amino acid sequences more elab orated score matrices are used They reect dierences in the genetic co de     chemical prop erties of the amino acids        secondary structural prop ensities of amino acids in proteins       or  among which are the most p opular ones  they are de rived from empirical data by counting true matches and mismatches in databases of structural alignments and by computing the most reasonable substitution proba bilities which might have led to the observed exchanges             See  for a recent comparison of more than dierent amino acid substitution matrices

 Pairwise Alignment Score

As in the discussion of alignments we b egin with the case of two sequences s and t Based on letter to letter distances there is a natural approach of dening a pairwise alignment score motivated by the following consideration Sequences s and t are supp osed to have a low overall distance if it is p ossible to nd an alignment such that at many p ositions the paired letters have low distances In the case of aligning proteins this can b e read as If the chains of amino acids in b oth proteins contain similar residues over their whole length the proteins are supp osed to have similar main chain folding patterns and likely related functions  Of course this is an idealistic view and the existence of false negatives cannot b e excluded Pro cedures which weight dierent p ositions of a biological sequence according to their mutability or imp ortance have b een suggested           Formally sp eaking we extend the distance score function d to a function

d^ A  fg^  R

which is identical to d for all pairs of letters ie d^ a b da b for all a b  A

Additionally aligning a letter from A with a blank is p enalized by a high value

usually a constant d^ a  d^  a for all a  A Hence a consecutive

string of l blanks receives the value l  That is the reason for calling this way of scoring blanks also additive or homogeneous gap costs The alignment score of A is then dened as the sum of the d^ values over all sites of the alignment

De nition   Alignment Score

Assume sequences s and t over A Given an alignment A

s t

 Ahs ti of

length N and an extended distance score function d^ A  fg^  R  we dene

the alignment score of A with resp ect to d^ by

wd A

X

iN

d^ s i  t i