Alignment and Tree Inference - Lecture Notes | BSC 5936, Study notes of Biology

Material Type: Notes; Class: ST:TEACH/LEARN SCIEN; Subject: BIOLOGICAL SCIENCES; University: Florida State University; Term: Fall 2005;

Typology: Study notes

Pre 2010

Uploaded on 08/30/2009

koofers-user-har-1
koofers-user-har-1 🇺🇸

10 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Alignment and tree inference
Peter Beerli
December 5, 2005
1 Evolution of sequences
In several chapters we did talk about mutation models. These mutation models are unfortunately
only looking at point mutations, other processes such as duplication, gene conversion, inversions,
insertions and deletions are ignored. Typically we run a two-part analysis by first aligning the
sequences, and then do the real analysis and we ignore everything except the point mutation and
Figure 1: Processes that change a sequence: Point mutation, inversion, duplication, conversion,
insertion, deletion.
the insertion/deletion pattern. Only recently procedures are developed that take into account
1
pf3
pf4
pf5

Partial preview of the text

Download Alignment and Tree Inference - Lecture Notes | BSC 5936 and more Study notes Biology in PDF only on Docsity!

Alignment and tree inference

Peter Beerli

December 5, 2005

1 Evolution of sequences

In several chapters we did talk about mutation models. These mutation models are unfortunately only looking at point mutations, other processes such as duplication, gene conversion, inversions, insertions and deletions are ignored. Typically we run a two-part analysis by first aligning the sequences, and then do the real analysis and we ignore everything except the point mutation and

Figure 1: Processes that change a sequence: Point mutation, inversion, duplication, conversion, insertion, deletion.

the insertion/deletion pattern. Only recently procedures are developed that take into account

multiple processes in a joint estimation process, as none of these processes is independent of the phylogeny or from each other. In this chapter we will only explore the complication introduced by insertion and deletion patterns.

In a most simple procedure we produce genealogies or species phylogenies using already aligned sequences. Aligned sequences are the product of an alignment algorithm where we minimize the changes among several sequences. For example look at this fragment:

AATACAATTC GACACGGGGG CCACTCACGA AAATTAGAAG AAGAGC

AATACCATTC GACACGGGGC CACTCACTAA AACTAGAAGA AAAGC

AATACCAATT CGACACGGGC CACTCACTAA AACTAGAAGA AGAGC

AATACAATTC GACACCGGGC CACTCACTTA AAACTAGAAG AAAAGC

AATACAATTC AACACTAAAA TTAGAAGAAA AGC

GAATCATTCC GACACCGGGC CACTCACTAA AACTAGAAGA AAAGC

We can see that the sequence does not match at all between the individual sequences. We could align this by hand and try to minimize changes by inserting “gaps” so that nucleotides match up; and after lots of work we would end up with something like the sequence below.

AATAC-AATT CGACACGGGG GCCACT-CAC G-AAAATTAG AAGAAGAG-C

AATAC-CATT CGACACG-GG GCCACT-CAC T-AAAACTAG AAGAAAAG-C

AATACCAATT CGACAC--GG GCCACT-CAC T-AAAACTAG AAGAAGAG-C

AATAC-AATT CGACACC-GG GCCACT-CAC TTAAAACTAG AAGAAAAG-C

AATAC-AATT CAACAC---- ---------- T-AAAATTAG AAGAAAAG-C

GAATC-ATTC CGACACC-GG GCCACT-CAC T-AAAACTAG AAGAAAAGC-

Of course alignment by hand is tedious and often problematic, although most current alignment programs have their own little problems and tweaking by hand is often needed when working with real data.

2 Pairwise alignment

Needleman and Wunsch (1970) developed the first algorithms to align sequences. this procedure does not take into account any tree structure (and therefore correlation) among sequences. The algorithm has 3 main steps,

numbers of taxa. Most popular today are even more approximative methods (like the one used in the program ClustalV), where one does not revisit the alignments once they are chosen, these methods allow to align large data matrices but are less precise than the Sankoff et al algorithm.

3.2 Probabilistic methods

A probabilistic model that takes into account insertion and deletions of single sites was developed by Bishop and Thompson in 1986. Their methods could not handle superimposed gaps and gaps were collapsed or expanded by single steps.

3.2.1 The Thorne-Kishino-Felsenstein model

Jeff Thorne developed during his thesis a model that allows for multiple gaps and that is tractable (although it is still difficult to use on data sets with multiple sequences). He parametrized the model like this

∗A ∼ G ∼ A ∼ C ∼ A ∼ T ∼ G ∼

on the left there as immortal link, need that a sequence cannot shrink to nothing, each nucleotide has an attached link. Each nucleotide can be deleted and each link can add nucleotide-link pairs. The model contains a standard substitution model and additionally two rates that control the insertion/deletion process. Each link can insert a new site with rate λ. Each site has a constant risk μ of being deleted. Since all action happens in pairs of nulceotide and link, the immortal link to the left of everything cannot be deleted. The whole sequence of length n will increase with probability (n + 1)λ per unit time and will shrink with nμ per unit time. This type of birth-death process is well understood and with μ > λ the equilibrium distribution of the sequence lengths is a geometric distribution. And the equilibrium base composition will be the on from the substitution model (Felsenstein 2003). The parametrization allows to calculate transition probabilities where one can calculate the 3 possible events that happen to a particular sequence

GT ACCT AC → substitute → GT ACT T AC → insert → GT ACCAT AC → delete → GT ACT AC

Given the rates and all possible events one can set up a transition probability matrix, that can be solved analytically (for more detail see Lunter et al. 2005). A specific alignment is the sum off all

Figure 2: Example of a transition from ancestral sequence to the current one (based on Lunter et al. 2005)

Figure 3: HMM for the TKF model