Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Divide and Conquer Multiple Sequence Alignment | CSC 8910, Papers of Computer Science

Georgia State University (GSU)Computer Science

Material Type: Paper; Class: COMPUTER SCIENCE TOPICS SEMINR; Subject: COMPUTER SCIENCE; University: Georgia State University; Term: Unknown 1997;

Typology: Papers

Pre 2010

Uploaded on 08/31/2009

koofers-user-d0t 🇺🇸

10 documents

1 / 144

This page cannot be seen from the preview

Don't miss anything!

Universit¨

at

Bielefeld

Forschungsbericht der

Technischen Fakultat

Abteilung Informationstechnik

Divide-and-Conquer Multiple Sequence Alignment

Jens Stoye

Report 97-02

Universitat Bielefeld



Postfach 10 01 31



33501 Bielefeld



FRG

Discover Papers of Computer Science Georgia State University (GSU)

Partial preview of the text

Download Divide and Conquer Multiple Sequence Alignment | CSC 8910 and more Papers Computer Science in PDF only on Docsity!

Universit¨at

Bielefeld

Forschungsb ericht der

Technischen Fakultat

Abteilung Informationstechnik

DivideandConquer Multiple Sequence Alignment

Jens Stoye

Rep ort

Universitat Bielefeld Postfach Bielefeld FRG

Impressum Herausgeb er Rob ert Giegerich Alois Knoll Peter Ladkin Helge Ritter Gerhard Sagerer Ipke Wachsmuth

Technische Fakultat der Universitat Bielefeld Abteilung Informationstechnik Postfach Bielefeld FRG ISSN

Contents

Intro duction Preliminaries Overview Acknowledgements

Analysis of Dierences Sequence Alignment Intro duction Global Sequence Alignment Pairwise Sequence Alignment Multiple Sequence Alignment Alignment Scores Single Letter Substitutions Pairwise Alignment Score Multiple Sequence Alignment Score The Problem Sequence Weighting

Calculating Distances Dynamic Programming Optimal Alignment of Two Sequences General Gap Functions Optimal Alignment of More Than Two Sequences The Approach of Carrillo and Lipman NP Completeness of Multiple Sequence Alignment

Basic DivideandConquer Alignment Intro duction Additional Cost Matrix of Two Sequences C Optimal Families of Slicing Positions The Algorithm Variations of the Basic Algorithm Stopping Criteria Windowing Relaxing c

i

ii CONTENTS

Rapid Simultaneous Three Way Alignment Combination with Fragment Based Metho ds

Reducing Computation Time Basic Approach Analogy to the Approach of Carrillo and Lipman Computing an Upp er Bound calcChat Some O k ^ n^ Time Metho ds Iterative Metho ds Polynomial Time Metho ds Computing a C Optimal Cut calcCopt Monotony Bounds Prepro cessing Subfamilies Approximate Slicing Positions

The Program DCA Some Implementational Asp ects Incorp oration of MSA Prepro cessing Subfamilies Parameters of DCA

Evaluation of the Algorithm Families of Random Sequences Quality of the Alignments Dep endence on the Recursion Stop Size Improvement by Windowing Relaxing c Approximate Slicing Positions Comparison of Dierent Realizations of calcChat Comparison of Dierent Realizations of calcCopt Dep endence on Sequence Length and Numb er Dep endence on Sequence Similarity

Results on Biological Sequences Six Tyrosine Kinases Four Famous Benchmark Problems Comparison with MSA Alignment and Phylogeny of RNase MRP RNA How Many Sequences

Conclusion

Bibliography

CHAPTER INTRODUCTION

0

100

200

300

400

500

600

700

1984 1986 1988 1990 1992 1994 1996

million bases

year

Figure The amount of DNA sequence data in GenBank

analyzed by metho ds very similar to those studied here Kruskal gives a survey of these and further applications Yet the fo cus of the present work will b e on string comparison as a to ol for the molecular biologist

Preliminaries

In this section we intro duce our terminology and we discuss some questions which have led to the development of a new algorithm the DivideandConquer Alignment algorithm DCA This section cannot b e a self contained intro duction to compu tational biology For this purp ose the reader might use any one of the relevant textb o oks eg The most prominent biological sequences are proteins and nucleic acid sequences Proteins are p olymers made up of amino acids connected linearly by p eptide b onds that is they are polypeptides They play an imp ortant role as enzymes in the metab olism of the cell A protein sequence is usually a few hundred units long On our level of abstraction proteins can b e viewed as sequences of letters drawn from the alphab et of the twenty amino acids o ccurring in living matter Table lists the full names the three letter acronyms and the one letter co de

Nucleic acids are also p olymers made up of small molecules called nucleotides

PRELIMINARIES

Alanine Arginine Asparagine Aspartic acid Cysteine Ala Arg Asn Asp Cys A R N D C Glutamine Glutamic acid Glycine Histidine Isoleucine Gln Glu Gly His Ile Q E G H I Leucine Lysine Methionine Phenylalanine Proline Leu Lys Met Phe Pro L K M F P Serine Threonine Tryptophan Tyrosine Valine Ser Thr Trp Tyr Val S T W Y V

Table The twenty common amino acids

which can b e distinguished by the four bases they contain adenine A cytosine C guanine G and either thymine T or uracil U for deoxyribonucleic acids DNA or ribonucleic acids RNA resp ectively A and G are purines C and T are pyrimidines Nucleic acids contain typically from tens to thousands of units for RNA or millions of units for DNA Given one or several such sequences many questions arising in molecular biology can b e reformulated as string pro cessing problems

The inference of physical mapping and sequence assembly ie the reconstruc tion of a DNA molecule from the nucleic acid sequences of fragments within it

Molecular mo delling ie the determination of the three dimensional structure of a protein from its sequence of amino acids

The assessment of structurefunction correlations mostly by conclusions from structurally similar or historically related molecules

The reconstruction of phylogenies ie the inference of the evolutionary history of some sp ecies from their asso ciated sequences

The exploration of statistical geometry in sequence space

The database search for sequences similar to a given sequence

The prediction of RNA secondary and tertiary structure ie the way in which dierent segments of the sequence connect with each other

Most of these applications require the comparison of sequences An early approach to compare sequences of the same length is the computation of their Hamming Distance

OVERVIEW

In the last decades various simultaneous multiple alignment algorithms have b een develop ed see eg Unfortunately almost all of these metho ds either exhibit a prohibitive computational complexity or yield biologically unplausible results Current algorithms which try to optimize one of the standard score functions are limited to half a dozen of sequences With the algorithm DCA develop ed in this thesis we improve this situation By slicing the sequences at appropriate p ositions in a divide and conquer manner DCA allows to compute high quality but not necessarily optimal simultaneous alignments of up to fourteen related sequences in a rather short time requiring only mo derate computer memory Surprisingly to our knowledge the divide and conquer principle a well established metho d in algorithmic computer science cf has never b een applied b efore systematically in such a simple way within this context

Overview

In Chapter a formal intro duction to the multiple sequence alignment problem is given We present denitions of pairwise and multiple sequence alignments followed by a discussion of the most commonly used alignment score functions The multiple sequence alignment problem is formulated on this basis

In Chapter we recall the standard metho d for solving the alignment problem Time and space limitations of this approach will b e discussed which make it neces sary to accept fast but in general sub optimal solutions The standard principles of computing such heuristic alignments are briey summarized

Chapter is devoted to the presentation of the basic divide and conquer align ment algorithm which in contrast to previous heuristic metho ds computes si multaneous multiple sequence alignments optimizing a well dened alignment score Time and space complexity of the algorithm are analyzed and several suggestions for variations and further uses of the algorithm are discussed

In Chapter we present improvements of the basic algorithm which allow an enormous increase of eciency Using branch and b ound techniques the search for appropriate slicing p ositions of the sequences is accelerated so that more than a dozen of sequences can b e aligned simultaneously by our metho d

Chapter gives a short description of the computer program DCA which is part of this thesis

The applicability of our algorithm is illustrated in Chapters and We establish the high quality of the computed alignments in mathematical as well as in biological terms and validate the theoretical time and space analyses Using well established b enchmark problems alignments pro duced by our algorithm are also compared to those computed with other metho ds

Chapter concludes the thesis by recalling its main results

CHAPTER INTRODUCTION

Acknowledgements

Parts of this dissertation thesis have b een published in advance and two further pap ers are forthcoming The work was carried out whilst b eing a do ctoral fellow at the Center for Inter disciplinary Research on Structure Formation FSPM at the University of Bielefeld Germany and memb er of the Graduiertenkolleg Strukturbildungsprozesse b oth of which have b een a p erfect work environment due to the interdisciplinary nature of these institutions which made p ossible many instructive discussions with colleagues and friends I would like to thank my advisor Prof Dr Rob ert Giegerich for bringing me into contact with the sequence analysis asp ects of molecular biology and for maintaining my enthusiasm I would also like to thank Prof Dr Andreas Dress for intro duc ing me to the particular sub ject of this thesis as well as for his helpful comments and the fruitful discussions during the development of the algorithm and its many variants My sp ecial thank go es to Dr Soren Perrey who worked out parts of joint publications and who suggested several improvements regarding an earlier version of the manuscript He also provided me with the example in Section Finally I am much obliged to Dr Vincent Moulton for improving the English of the thesis

CHAPTER ANALYSIS OF DIFFERENCES SEQUENCE ALIGNMENT

corresp onding sux So we have si^ i^ s s We also say that s is cut at slicing position i

The just dened formalism facilitates a closer lo ok at mutations as they are ob served in biological sequences so called accepted mutations Dayho et al distinguish two principal kinds large changes and point mutations

Large changes in genetic sequences are b elieved to b e caused by unequal crossing over of the chromosomes These large scale rearrangements can include entire genes The subsequences s and t in the following list of the most imp ortant typ es of large changes can b e several thousands or millions of nucleotides long

Inversions A subsequence is reversed usv us^ v Translocations Subsequences from dierent chromosomes are exchanged usv and wtx utv and wsx In most cases the exchanged subsequences are terminal ie u or v and w or x resp ectively are empty in which case s and t are prexes or suxes of the chromosome sequence Deletions A subsequence is deleted usv uv Duplications A subsequence is duplicated usvw usvsw Transpositions Two subsequences on the same chromosome are exchanged usvtw utvsw

Point mutations are lo cal changes of one or a few consecutive nucleotides during the copying pro cess of DNA Occurring within co ding regions they may b e observed as changes of a single or a few amino acids in the translated protein In the following formal description of the most imp ortant typ es of p oint mutations the subsequences a and b are usually very short eg one to ve letters in case of amino acids

Insertions A single letter or a small numb er of consecutive letters is inserted uv uav Deletions A single letter or a small numb er of consecutive letters is deleted uav uv Substitutions A short subsequence is substituted by a dierent sequence uav ubv a b

In other than biological applications of string comparison further lo cal mu tations are considered

Swaps Two single neighb oring letters are exchanged uabv ubav Swaps play an imp ortant role in sp elling correction

GLOBAL SEQUENCE ALIGNMENT

Compression and expansion In sp eech recognition and other appli cations where continuous input streams are compared it is often necessary to re scale the incoming data Scaling op erations of this kind are also referred to as time or space warps

In the study of genomic sequences large changes and p oint mutations are contem plated in dierent situations In the search for the correct order of sequence fragments on a chromosome the so called sequence assembly problem and in the comparison of the gene p o ol of dierent sp ecies large changes have to b e considered In con trast p oint mutations play the most imp ortant role in the study of single proteins or other comparatively small regions of the genome sequences of some hundreds up to letters in length Such sequences are often compared for common overall also called global or lo cal similarities Global and lo cal sequence comparison can b e handled by alignments The divide and conquer algorithm develop ed in this thesis facilitates global sequence comparison by sp eeding up the search for global sequence alignments

Global Sequence Alignment

A standard metho d in computational molecular biology for presenting the result of sequence comparison is an alignment which we formalize in this section Multiple sequence alignments are a natural generalization of pairwise sequence alignments which we intro duce rst

Pairwise Sequence Alignment

Assume that we are given two sequences s and t which are known to b e globally related In general s and t are of dierent length For b etter comparibility the

lengths of the sequences are equalized Blanks denoted by dashes are inserted

into or at either end of s and t such that the two resulting sequences s^ and t^ resp ectively are of the same length N Apart from the length equalization the most imp ortant aim of inserting blanks is the following By selecting the lo cation of blanks carefully regions from s may b e lo cated in s^ at the same p osition where similar regions of t are lo cated in t^ By writing the sequences s^ and t^ ab ove each other similarities and dierences are then easier to observe

De nition Pairwise Sequence Alignment

An alignment of two sequences s and t over A is a matrix

A

s s s N t t t N

with two rows s^ s s s N t^ t t t N and N columns maxfjsj jtjg N

jsj jtj where

GLOBAL SEQUENCE ALIGNMENT

Using Stirlings Formula f n n can b e approximated by

f n n

p n^ n^

Notation

Two identical letters ab ove each other in an alignment form a match and two distinct ones form a mismatch or substitution A blank in one sequence aligned with a letter a in the other can b e viewed as an insertion of a into the rst sequence or as a deletion of a from the second sequence Following Kruskal we use the term indel to denote the event of a deletion or an insertion

The letters of the original sequence s and of the padded sequence s^ in an alignment are connected by the following maps s and s

De nition

Assume a sequence s over A and a corresp onding aligned sequence s^ of length N jsj ie assume that s^ repro duces s up on elimination of the blanks For j f N g let s j b e the numb er of letters in s^ b efore p osition j which are not blanks plus one ie

s j " fk f j gjs k g

Clearly s^ is monotonously increasing and the assumed relationship b etween s and s^ implies f jsjg fs s s N g f jsj g with s^ N jsj if and only if s N Hence for i f jsjg we dene s i to b e the largest j with s j i ie

s^ i max j N fj js^ j ig

Clearly we have s k i if and only if s i k s i for i jsj with s^ for convenience and we have s k si if k s^ i for some i f jsjg and s k otherwise

Note that apart from alignments other equivalent or very similar data structures presenting the result of global sequence comparison are discussed in the literature denoted as edit scripts traces or listings In biological applications the term alignment and the ab ove denitions have b ecome generally accepted

Multiple Sequence Alignment

The concept of global pairwise alignment can b e extended to the comparison of more than two sequences in a straightforward way

CHAPTER ANALYSIS OF DIFFERENCES SEQUENCE ALIGNMENT

De nition Multiple Sequence Alignment

Consider a family^ hs sk i of k sequences over A A multiple alignment of

hs sk i is given by a k N matrix

A

BB

s s s N

s k s k s k N

CC

A

for some N maxfjs j jsk jg N

Pk

i jsi^ j^ where

s ij A fg for all i k j N

for each i k the row s i s i s i s iN repro duces the sequence si

up on ignoring all of its blanks and

A do es not contain any column consisting of blanks only

The set of all alignments of S hs sk i is denoted by AS

Similar to sequences alignments with the same numb er of rows may b e concate nated by the op erator

De nition Pro jection

Consider a family of sequences S and an alignment A AS Given a subfamily

S ^ S the alignment obtained by extracting from A all rows corresp onding to the sequences in S ^ where the columns consisting of blanks only are removed is called the projection of A on S ^ denoted by S A

De nition Combination

Consider a family of sequences S and subfamilies S Sn S Given alignments

A AS An ASn an alignment A AS is called a combination of the

A An if Si A Ai for all i f ng

De nition Compatibility of Alignments

Consider as ab ove a family of sequences S and subfamilies S Sn S Align

ments A AS An ASn are compatible if there exists a combination

A AS of the A An such that Si A Ai for all i f ng

(^) Note the use of angle brackets h i to designate a family of sequences instead of the usual set

notation with braces Sequence families in our context are lists of sequences which in contrast to sets are ordered and may contain more than one identical element To denote subfamilies we use the standard symb ol from set notation

CHAPTER ANALYSIS OF DIFFERENCES SEQUENCE ALIGNMENT

have evolved indep endently Therefore only a usually symmetric substitution score function is needed which quanties for each pair of letters a b A^ the preference of aligning letter a with letter b represented by an jAj jAj substitution matrix see eg In general there are two natural ways of dening substitution score functions

a in form of a similarity score s A^ R where letters a b with similar prop erties are scored with high p ositive values and dierent letters are scored with low negative values or

b in form of a distance score d A^ R where pairs of similar letters are scored with small values while the alignment of dierent letters is p enalized by a high distance score

The simplest p ossible distance score is the unit cost function Non identical letters are scored with the value identities are scored with The unit cost function is memb er of a broad class of distance score functions with interesting and useful prop erties the metric functions d A^ R is said to b e a metric on A if in addition to symmetry the zero property da b a b and the triangle inequality da b da c dc b hold for all a b c A In case d is a metric on A Sellers and Waterman et al showed that for global sequence alignments the metric prop erties carry over to the set of sequences over A On the other hand in the context of lo cal alignments similarity scores with p os itive and negative values have b een shown to b e sup erior to distance scores In a similarity score the value zero has a sp ecial meaning since it separates the p osi tively correlated pairs of letters from the negatively correlated pairs Thus a series of p ositively scored substitutions in an alignment indicates highly similar regions of the involved sequences For global sequence alignment it has b een shown that under certain conditions b oth distance and similarity score are equivalent and it is easy to derive the corresp onding similarity score function from a given distance score function and vice versa Throughout this thesis we will restrict the discussion of alignment scores and algorithms to distances This choice is arbitrary and with little changes in some denitions it is p ossible to apply the pro cedures develop ed in the following chapters also to a similarity score function Furthermore the algorithms describ ed in this thesis do not require any of the metric prop erties except symmetry Thus the score matrices commonly used in biological applications which are non metric eg the PAM and Blosum series of amino acid substitution matrices are applicable The eort made for the invention of biologically reasonable score functions diers signicantly for nucleic acids and for amino acids For nucleic acids often rather simple score functions are used^ Beside the unit cost function mentioned ab ove score functions are largely established which distinguish only b etween transitions

(^) This trend may change in the near future Very recently Agarwal and States published a nucleic acid substitution matrix based on an advanced statistical mo del

ALIGNMENT SCORES

exchanges of two purines or of two pyrimidines and transversions all other com binations For amino acid sequences more elab orated score matrices are used They reect dierences in the genetic co de chemical prop erties of the amino acids secondary structural prop ensities of amino acids in proteins or among which are the most p opular ones they are de rived from empirical data by counting true matches and mismatches in databases of structural alignments and by computing the most reasonable substitution proba bilities which might have led to the observed exchanges See for a recent comparison of more than dierent amino acid substitution matrices

Pairwise Alignment Score

As in the discussion of alignments we b egin with the case of two sequences s and t Based on letter to letter distances there is a natural approach of dening a pairwise alignment score motivated by the following consideration Sequences s and t are supp osed to have a low overall distance if it is p ossible to nd an alignment such that at many p ositions the paired letters have low distances In the case of aligning proteins this can b e read as If the chains of amino acids in b oth proteins contain similar residues over their whole length the proteins are supp osed to have similar main chain folding patterns and likely related functions Of course this is an idealistic view and the existence of false negatives cannot b e excluded Pro cedures which weight dierent p ositions of a biological sequence according to their mutability or imp ortance have b een suggested Formally sp eaking we extend the distance score function d to a function

d^ A fg^ R

which is identical to d for all pairs of letters ie d^ a b da b for all a b A

Additionally aligning a letter from A with a blank is p enalized by a high value

usually a constant d^ a d^ a for all a A Hence a consecutive

string of l blanks receives the value l That is the reason for calling this way of scoring blanks also additive or homogeneous gap costs The alignment score of A is then dened as the sum of the d^ values over all sites of the alignment

De nition Alignment Score

Assume sequences s and t over A Given an alignment A

s t

Ahs ti of

length N and an extended distance score function d^ A fg^ R we dene

the alignment score of A with resp ect to d^ by

wd A

X

iN

d^ s i t i

Divide and Conquer Multiple Sequence Alignment | CSC 8910, Papers of Computer Science

Related documents

Partial preview of the text

Download Divide and Conquer Multiple Sequence Alignment | CSC 8910 and more Papers Computer Science in PDF only on Docsity!

Universit¨at

Bielefeld

Forschungsb ericht der

Technischen Fakultat

Abteilung Informationstechnik

DivideandConquer Multiple Sequence Alignment

Jens Stoye

Rep ort

CHAPTER INTRODUCTION

Preliminaries

PRELIMINARIES

OVERVIEW

Overview

 CHAPTER INTRODUCTION

Acknowledgements

 CHAPTER ANALYSIS OF DIFFERENCES SEQUENCE ALIGNMENT

GLOBAL SEQUENCE ALIGNMENT 

Global Sequence Alignment

Pairwise Sequence Alignment

lengths of the sequences are equalized Blanks denoted by dashes  are inserted

An alignment of two sequences s and t over A is a matrix

A

with two rows s^ s s s N  t^ t t t N and N columns maxfjsj jtjg N

jsj  jtj where

GLOBAL SEQUENCE ALIGNMENT

Multiple Sequence Alignment

CHAPTER ANALYSIS OF DIFFERENCES SEQUENCE ALIGNMENT

Consider a family^ hs sk i of k sequences over A A multiple alignment of

hs sk i is given by a k N matrix

A

BB

CC

A

for some N  maxfjs j jsk jg N

Pk

i jsi^ j^ where

s ij  A  fg for all i k  j N 

for each i k  the row s i s i s i s iN repro duces the sequence si

A do es not contain any column consisting of blanks only

The set of all alignments of S hs sk i is denoted by AS 

Consider a family of sequences S and an alignment A  AS  Given a subfamily

A  AS  An  ASn  an alignment A  AS  is called a combination of the

A An  if Si A Ai for all i  f ng

ments A  AS  An  ASn  are compatible if there exists a combination

A  AS  of the A An such that Si A Ai for all i  f ng

CHAPTER ANALYSIS OF DIFFERENCES SEQUENCE ALIGNMENT

ALIGNMENT SCORES

Pairwise Alignment Score

d^ A  fg^ R

which is identical to d for all pairs of letters ie d^ a b da b for all a b  A

Additionally aligning a letter from A with a blank is p enalized by a high value

usually a constant d^ a  d^  a for all a  A Hence a consecutive

Assume sequences s and t over A Given an alignment A

 Ahs ti of

length N and an extended distance score function d^ A  fg^ R  we dene

X

CHAPTER INTRODUCTION

CHAPTER ANALYSIS OF DIFFERENCES SEQUENCE ALIGNMENT

GLOBAL SEQUENCE ALIGNMENT

lengths of the sequences are equalized Blanks denoted by dashes are inserted

with two rows s^ s s s N t^ t t t N and N columns maxfjsj jtjg N

jsj jtj where

for some N maxfjs j jsk jg N

i jsi^ j^ where

s ij A fg for all i k j N

for each i k the row s i s i s i s iN repro duces the sequence si

The set of all alignments of S hs sk i is denoted by AS

Consider a family of sequences S and an alignment A AS Given a subfamily

A AS An ASn an alignment A AS is called a combination of the

A An if Si A Ai for all i f ng

ments A AS An ASn are compatible if there exists a combination

A AS of the A An such that Si A Ai for all i f ng

d^ A fg^ R

which is identical to d for all pairs of letters ie d^ a b da b for all a b A

Additionally aligning a letter from A with a blank is p enalized by a high value

usually a constant d^ a d^ a for all a A Hence a consecutive

Ahs ti of

length N and an extended distance score function d^ A fg^ R we dene