



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Class: ST:TEACH/LEARN SCIEN; Subject: BIOLOGICAL SCIENCES; University: Florida State University; Term: Fall 2005;
Typology: Study notes
1 / 7
This page cannot be seen from the preview
Don't miss anything!




In several chapters we did talk about mutation models. These mutation models are unfortunately only looking at point mutations, other processes such as duplication, gene conversion, inversions, insertions and deletions are ignored. Typically we run a two-part analysis by first aligning the sequences, and then do the real analysis and we ignore everything except the point mutation and
Figure 1: Processes that change a sequence: Point mutation, inversion, duplication, conversion, insertion, deletion.
the insertion/deletion pattern. Only recently procedures are developed that take into account
multiple processes in a joint estimation process, as none of these processes is independent of the phylogeny or from each other. In this chapter we will only explore the complication introduced by insertion and deletion patterns.
In a most simple procedure we produce genealogies or species phylogenies using already aligned sequences. Aligned sequences are the product of an alignment algorithm where we minimize the changes among several sequences. For example look at this fragment:
We can see that the sequence does not match at all between the individual sequences. We could align this by hand and try to minimize changes by inserting “gaps” so that nucleotides match up; and after lots of work we would end up with something like the sequence below.
Of course alignment by hand is tedious and often problematic, although most current alignment programs have their own little problems and tweaking by hand is often needed when working with real data.
2 Pairwise alignment
Needleman and Wunsch (1970) developed the first algorithms to align sequences. this procedure does not take into account any tree structure (and therefore correlation) among sequences. The algorithm has 3 main steps,
numbers of taxa. Most popular today are even more approximative methods (like the one used in the program ClustalV), where one does not revisit the alignments once they are chosen, these methods allow to align large data matrices but are less precise than the Sankoff et al algorithm.
A probabilistic model that takes into account insertion and deletions of single sites was developed by Bishop and Thompson in 1986. Their methods could not handle superimposed gaps and gaps were collapsed or expanded by single steps.
3.2.1 The Thorne-Kishino-Felsenstein model
Jeff Thorne developed during his thesis a model that allows for multiple gaps and that is tractable (although it is still difficult to use on data sets with multiple sequences). He parametrized the model like this
∗A ∼ G ∼ A ∼ C ∼ A ∼ T ∼ G ∼
on the left there as immortal link, need that a sequence cannot shrink to nothing, each nucleotide has an attached link. Each nucleotide can be deleted and each link can add nucleotide-link pairs. The model contains a standard substitution model and additionally two rates that control the insertion/deletion process. Each link can insert a new site with rate λ. Each site has a constant risk μ of being deleted. Since all action happens in pairs of nulceotide and link, the immortal link to the left of everything cannot be deleted. The whole sequence of length n will increase with probability (n + 1)λ per unit time and will shrink with nμ per unit time. This type of birth-death process is well understood and with μ > λ the equilibrium distribution of the sequence lengths is a geometric distribution. And the equilibrium base composition will be the on from the substitution model (Felsenstein 2003). The parametrization allows to calculate transition probabilities where one can calculate the 3 possible events that happen to a particular sequence
GT ACCT AC → substitute → GT ACT T AC → insert → GT ACCAT AC → delete → GT ACT AC
Given the rates and all possible events one can set up a transition probability matrix, that can be solved analytically (for more detail see Lunter et al. 2005). A specific alignment is the sum off all
Figure 2: Example of a transition from ancestral sequence to the current one (based on Lunter et al. 2005)
Figure 3: HMM for the TKF model