Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Alignment and Tree Inference - Lecture Notes | BSC 5936, Study notes of Biology

Florida State University (FSU)Biology

Material Type: Notes; Class: ST:TEACH/LEARN SCIEN; Subject: BIOLOGICAL SCIENCES; University: Florida State University; Term: Fall 2005;

Typology: Study notes

Pre 2010

Uploaded on 08/30/2009

koofers-user-har-1 🇺🇸

10 documents

1 / 7

This page cannot be seen from the preview

Don't miss anything!

Alignment and tree inference

Peter Beerli

December 5, 2005

1 Evolution of sequences

In several chapters we did talk about mutation models. These mutation models are unfortunately

only looking at point mutations, other processes such as duplication, gene conversion, inversions,

insertions and deletions are ignored. Typically we run a two-part analysis by first aligning the

sequences, and then do the real analysis and we ignore everything except the point mutation and

Figure 1: Processes that change a sequence: Point mutation, inversion, duplication, conversion,

insertion, deletion.

the insertion/deletion pattern. Only recently procedures are developed that take into account

Discover Study notes of Biology Florida State University (FSU)

Partial preview of the text

Download Alignment and Tree Inference - Lecture Notes | BSC 5936 and more Study notes Biology in PDF only on Docsity!

Alignment and tree inference

Peter Beerli

December 5, 2005

1 Evolution of sequences

In several chapters we did talk about mutation models. These mutation models are unfortunately only looking at point mutations, other processes such as duplication, gene conversion, inversions, insertions and deletions are ignored. Typically we run a two-part analysis by first aligning the sequences, and then do the real analysis and we ignore everything except the point mutation and

Figure 1: Processes that change a sequence: Point mutation, inversion, duplication, conversion, insertion, deletion.

the insertion/deletion pattern. Only recently procedures are developed that take into account

multiple processes in a joint estimation process, as none of these processes is independent of the phylogeny or from each other. In this chapter we will only explore the complication introduced by insertion and deletion patterns.

In a most simple procedure we produce genealogies or species phylogenies using already aligned sequences. Aligned sequences are the product of an alignment algorithm where we minimize the changes among several sequences. For example look at this fragment:

AATACAATTC GACACGGGGG CCACTCACGA AAATTAGAAG AAGAGC

AATACCATTC GACACGGGGC CACTCACTAA AACTAGAAGA AAAGC

AATACCAATT CGACACGGGC CACTCACTAA AACTAGAAGA AGAGC

AATACAATTC GACACCGGGC CACTCACTTA AAACTAGAAG AAAAGC

AATACAATTC AACACTAAAA TTAGAAGAAA AGC

GAATCATTCC GACACCGGGC CACTCACTAA AACTAGAAGA AAAGC

We can see that the sequence does not match at all between the individual sequences. We could align this by hand and try to minimize changes by inserting “gaps” so that nucleotides match up; and after lots of work we would end up with something like the sequence below.

AATAC-AATT CGACACGGGG GCCACT-CAC G-AAAATTAG AAGAAGAG-C

AATAC-CATT CGACACG-GG GCCACT-CAC T-AAAACTAG AAGAAAAG-C

AATACCAATT CGACAC--GG GCCACT-CAC T-AAAACTAG AAGAAGAG-C

AATAC-AATT CGACACC-GG GCCACT-CAC TTAAAACTAG AAGAAAAG-C

AATAC-AATT CAACAC---- ---------- T-AAAATTAG AAGAAAAG-C

GAATC-ATTC CGACACC-GG GCCACT-CAC T-AAAACTAG AAGAAAAGC-

Of course alignment by hand is tedious and often problematic, although most current alignment programs have their own little problems and tweaking by hand is often needed when working with real data.

2 Pairwise alignment

Needleman and Wunsch (1970) developed the first algorithms to align sequences. this procedure does not take into account any tree structure (and therefore correlation) among sequences. The algorithm has 3 main steps,

numbers of taxa. Most popular today are even more approximative methods (like the one used in the program ClustalV), where one does not revisit the alignments once they are chosen, these methods allow to align large data matrices but are less precise than the Sankoff et al algorithm.

3.2 Probabilistic methods

A probabilistic model that takes into account insertion and deletions of single sites was developed by Bishop and Thompson in 1986. Their methods could not handle superimposed gaps and gaps were collapsed or expanded by single steps.

3.2.1 The Thorne-Kishino-Felsenstein model

Jeff Thorne developed during his thesis a model that allows for multiple gaps and that is tractable (although it is still difficult to use on data sets with multiple sequences). He parametrized the model like this

∗A ∼ G ∼ A ∼ C ∼ A ∼ T ∼ G ∼

on the left there as immortal link, need that a sequence cannot shrink to nothing, each nucleotide has an attached link. Each nucleotide can be deleted and each link can add nucleotide-link pairs. The model contains a standard substitution model and additionally two rates that control the insertion/deletion process. Each link can insert a new site with rate λ. Each site has a constant risk μ of being deleted. Since all action happens in pairs of nulceotide and link, the immortal link to the left of everything cannot be deleted. The whole sequence of length n will increase with probability (n + 1)λ per unit time and will shrink with nμ per unit time. This type of birth-death process is well understood and with μ > λ the equilibrium distribution of the sequence lengths is a geometric distribution. And the equilibrium base composition will be the on from the substitution model (Felsenstein 2003). The parametrization allows to calculate transition probabilities where one can calculate the 3 possible events that happen to a particular sequence

GT ACCT AC → substitute → GT ACT T AC → insert → GT ACCAT AC → delete → GT ACT AC

Given the rates and all possible events one can set up a transition probability matrix, that can be solved analytically (for more detail see Lunter et al. 2005). A specific alignment is the sum off all

Figure 2: Example of a transition from ancestral sequence to the current one (based on Lunter et al. 2005)

Figure 3: HMM for the TKF model

Alignment and Tree Inference - Lecture Notes | BSC 5936, Study notes of Biology

Related documents

Partial preview of the text

Download Alignment and Tree Inference - Lecture Notes | BSC 5936 and more Study notes Biology in PDF only on Docsity!

Alignment and tree inference

Peter Beerli

December 5, 2005

1 Evolution of sequences

AATACAATTC GACACGGGGG CCACTCACGA AAATTAGAAG AAGAGC

AATACCATTC GACACGGGGC CACTCACTAA AACTAGAAGA AAAGC

AATACCAATT CGACACGGGC CACTCACTAA AACTAGAAGA AGAGC

AATACAATTC GACACCGGGC CACTCACTTA AAACTAGAAG AAAAGC

AATACAATTC AACACTAAAA TTAGAAGAAA AGC

GAATCATTCC GACACCGGGC CACTCACTAA AACTAGAAGA AAAGC

AATAC-AATT CGACACGGGG GCCACT-CAC G-AAAATTAG AAGAAGAG-C

AATAC-CATT CGACACG-GG GCCACT-CAC T-AAAACTAG AAGAAAAG-C

AATACCAATT CGACAC--GG GCCACT-CAC T-AAAACTAG AAGAAGAG-C

AATAC-AATT CGACACC-GG GCCACT-CAC TTAAAACTAG AAGAAAAG-C

AATAC-AATT CAACAC---- ---------- T-AAAATTAG AAGAAAAG-C

GAATC-ATTC CGACACC-GG GCCACT-CAC T-AAAACTAG AAGAAAAGC-

3.2 Probabilistic methods