Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Phylogenetics: Understanding Evolutionary History through Sequence Analysis, Study notes of Computer Science

University of Maryland Computer Science

An overview of phylogenetics, a field of study that aims to understand the relationship between sequences and construct phylogenetic trees representing their evolutionary history. Motivation, applications, tree construction algorithms, and assessing phylogenetic trees. It also discusses the concepts of unrooted and rooted trees, leaves, joins, and branch lengths.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-m9f 🇺🇸

9 documents

1 / 21

This page cannot be seen from the preview

Don't miss anything!

CMSC 838T – Lecture 5

XPhylogenetics

0Study of evolutionary relationships (sequences / species)

0Infer evolutionary relationship from shared features

0May improve multiple sequence alignment (MSA)

CMSC 838T – Lecture 5

Phylogenetics

XPhylogeny

0Relationship between organisms with common ancestor

XPhylogenetic tree

0Graph representing evolutionary history of sequence / species

XPremise

0Members sharing common evolutionary history

(i.e., common ancestor) are more related to each other

0Can infer evolutionary relationship from shared features

XLong history of phylogenetics (from field of genetics)

0Historically →based on analysis of observable features

(e.g., morphology, behavior, geographical distribution)

0Now →mostly analysis of DNA / RNA / amino acid sequences

Discover Study notes of Computer Science University of Maryland

Partial preview of the text

Download Phylogenetics: Understanding Evolutionary History through Sequence Analysis and more Study notes Computer Science in PDF only on Docsity!

CMSC 838T – Lecture 5

X Phylogenetics

0 Study of evolutionary relationships (sequences / species) 0 Infer evolutionary relationship from shared features 0 May improve multiple sequence alignment (MSA)

Phylogenetics

X Phylogeny

0 Relationship between organisms with common ancestor

X Phylogenetic tree

0 Graph representing evolutionary history of sequence / species

X Premise

0 Members sharing common evolutionary history (i.e., common ancestor) are more related to each other 0 Can infer evolutionary relationship from shared features

X Long history of phylogenetics (from field of genetics)

0 Historically → based on analysis of observable features (e.g., morphology, behavior, geographical distribution) 0 Now → mostly analysis of DNA / RNA / amino acid sequences

CMSC 838T – Lecture 5

Phylogenetics – Motivation & Alignment

X Goal of phylogenetics

0 Understand relationship of sequence to similar sequences 0 Construct phylogenetic tree representing evolutionary history

X Motivation / application

0 Identify closely related families O Use phylogenetic relationships to predict gene function 0 Follow changes in rapidly evolving species (e.g., viruses) O Analysis can reveal which genes are under selection O Provide epidemiology for tracking infections & vectors 0 Few direct applications

X Relationship to multiple sequence alignment (MSA)

0 Alignment of sequences should take evolution into account 0 More precise phylogenetic relationships ↔ improved MSA

Plylogenetics Overview

X Phylogenetic trees

X Tree construction algorithms

0 Distance methods O UPGMA O Neighbor-joining 0 Maximum parsimony 0 Maximum likelihood

X Assessing phylogenetic trees

CMSC 838T – Lecture 5

Phylogenetic Trees

X Leaves / taxa

0 Original sequences

X Branches

0 Represent change 0 Length represents evolutionary distance

X Cluster / clade

0 All sequences in subtree with common ancestor (treated as single node)

X Join / node

0 Point of joining two leaves / clusters

distance

Phylogenetic Trees

X Use binary trees (evolution is bifurcating process)

0 Can approximate all tree shapes (w/ arbitrarily short edges) 0 Simplifies tree generation & analysis

X Trees can be represented in rectangular form

0 Alternative form of representation 0 Distance determined only by “height” of branch

D B C A

normal form

D B C A

rectangular form

CMSC 838T – Lecture 5

Phylogenetic Trees

X Can label branches of tree with change to sequence

N Y L S

Phylogenetic Trees – Distance

X (Evolutionary) Distance

0 Many possible measures O Fraction of sites that differ between two sequences O # of changes needed to convert one sequence to another O Pairwise alignment scores, normalized by average score for random alignment [Feng & Doolittle 1996] Score = (S.actual – S.random) / (S.identical – S.random) Where s.identical = score for aligning identical sequence

X Distance matrix

0 Matrix of pairwise distances between all sequences 0 Used to generate tree

X Tree shape

0 Varies with construction method, distance metric

Seq. A B C D A — 8 7 12 B — 9 14 C — 11 D —

CMSC 838T – Lecture 5

Tree Construction – UPGMA

X UPGMA (Unweighted Pair Group Method using

Arithmetic Averages) [Sokal & Michener 1958]

X Algorithm

Find pair of sequences A, B with smallest distance D (^) AB
Insert join for A, B at tree height = ½ D (^) AB
Update distance to new cluster as the average distance betweens pairs of sequences in each cluster
Repeat until all sequences / clusters joined
Produces rooted tree

X Assumptions

0 Distances for tree are ultrametric O Branch lengths for 2 leaves same after join 0 Distances for tree are additive

A

B

A

B

C

½ D AB

½ D C(AB)

Tree Construction Example

Distance matrix

Sequences A B C D

A — 8 7 12

B — 9 14

C — 11

D —

Original tree

Note that tree distances are additive (i.e., distance between X, Y = sum of lengths of edges connecting X, Y)

CMSC 838T – Lecture 5

Tree Construction Example – UPGMA

A B C D

A — 8 7 12

B — 9 14

C — 11

D —

A-C B D

A-C — 8.5 11.

B — 14

D —

A-C-B — 12.

A-C-B D D B C A

Distance matrices UPGMA tree

A-C

A-C-B

UPGMA keeps all leaves in clusters and uses them in calculations

A, C

closest

A-C, B

closest

Height = ½ distance

X Goal

0 Join closest neighbors (nodes w / same parent) in tree 0 Avoids problem with UPGMA when rates of change differ

X Example

0 Closest leaves not neighbors in correct tree, but joined first by UPGMA

X Assumptions

0 Rate of change can differ O Branch lengths may differ after join 0 Branch lengths for tree are additive

Tree Construction – Neighbor-Joining

A D A D B C

B C

CMSC 838T – Lecture 5

X Exploit principle for neighbor-joining algorithm

X Replace distance to C

0 Used normalized divergence r (^) A r (^) B (~ avg. distance to nodes) 0 We can calculate O a = ½ (d (^) A,B + d (^) A,C – dB,C ) → ½ (d (^) A,B + r (^) A– r (^) B) O b = ½ (d (^) A,B + dB,C – d (^) A,C ) → ½ (d (^) A,B + r (^) B– r (^) A) O c = ½ (dB,C + d (^) A,C – d (^) A,B ) → ½ (dB,C + d (^) A,C – d (^) A,B )

Neighbor-Joining – Basic Principle

A

C

B

A B C

A — d (^) A,B d (^) A,C B — dB,C C —

a

b

c

K

r (^) B

r (^) A

Simply treat all other nodes as C, and treat distance to C as r

r (^) A

r (^) B

Tree Construction – Neighbor-Joining

X Approach

0 To find closest pair of neighbors O Reduce branch length for a node by (approximately) the average distance of the node from all other nodes O Find smallest distance between nodes (after reduction)

X Definitions

For all pairs of nodes A & B in set of all nodes L, let d (^) A,B = distance between A,B R (^) X = Σ dX,N where N ∈ L (total distance from X to all N) r (^) X = R (^) X / (L– 2), whereL= # of nodes (normalized divergence from X to all other nodes) D (^) A,B = d (^) A,B – (r (^) A + r (^) B) (rate-corrected distance)

X Key property − 2 nodes w/ mininum D are always neighbors!

CMSC 838T – Lecture 5

Tree Construction – Neighbor-Joining

X Algorithm [Saitou & Nei 1987, Studier & Keppler 1988]

Begin with star tree & all sequences as nodes in L
Find pair of nodes A & B ∈ L with minimum D (^) A,B
Create & insert new join (node K) w/ branch lengths 0 d (^) A,K = ½ (d (^) A,B + r (^) A – r (^) B) 0 dB,K = ½ (d (^) A,B + r (^) B – r (^) A)
For remaining nodes C ∈ L, update distance to K as 0 dK,C = ½ (d (^) A,C + dB,C – d (^) A,B)
Insert K and remove A, B from L
Repeat steps 2−5 until only two nodes left

A

B

K

Tree Construction Example – Neighbor Joining

(Rate-corrected) distance matrix

D -20 -20 -21 — 18.

C -20 -20 — 11 13.

B -21 — 9 14 15.

A — 8 7 12 13.

A B C D R A

B C

D

Rate-corrected distances

D (^) A,B = d (^) A,B – (r (^) A + r (^) B) = 8 – (13.5 + 15.5) = -

D (^) A,C = d (^) A,C – (r (^) A + r (^) C) = 7 – (13.5 + 13.5) = -

D (^) A,D = d (^) A,D – (r (^) A + r (^) D) = 12 – (13.5 + 18.5) = -

D (^) B,C = dB,C – (rB + rC) = 9 – (15.5 + 13.5) = -

D (^) B,D = dB,D – (rB + rD) = 14 – (15.5 + 18.5) = -

D (^) C,D = dC,D – (rC + rD) = 11 – (13.5 + 18.5) = -

normalized divergence =

Σ d / (L– 2)

= Σ d / 2

CMSC 838T – Lecture 5

Tree Construction Example – Neighbor Joining

K 1 -24 -24 — 13

D -24 — 9 20

C — 11 4 15

C D K 1 r

(Rate-corrected) distance matrix

averaged distance =

Σ d / (L– 2)

= Σ d / 1

Edge lengths for C,D

dC,K2 = ½ (dC,D + r (^) C – r (^) D) = ½ (11 + 15 – 20) = 3

dD,K2 = ½ (dC,D + r (^) D – r (^) C) = ½ (11 + 20 – 15) = 8

Distances to K (^2)

dK2,K1 = ½ (dK1,C + dK1,D – dC,D) = ½ (4 + 9 – 11) = 1

A

B C

D

K 1 K 2

A

B

C

D

K 1 K 2

Tree Construction Example – Neighbor Joining

K 1 -24 -24 — 13

D -24 — 9 20

C — 11 4 15

C D K 1 r

(Rate-corrected) distance matrix

averaged distance =

Σ d / (L– 2)

= Σ d / 1

Edge lengths for C, K (^1)

dC,K2 = ½ (dC,K1 + r (^) C – r (^) K1) = ½ (4 + 15 – 13) = 3

dK1,K2 = ½ (dC,K1 + r (^) K1 – rC) = ½ (4 + 13 – 15) = 1

Distances to K (^2)

dK2,D = ½ (dD,C + dD,K1 – dC,K1 ) = ½ (11 + 9 – 4) = 8

A

B C

D

K 1 K 2

A

B

C

D

K 1 K 2

CMSC 838T – Lecture 5

Tree Construction Example – Neighbor Joining

K 1 -24 -24 — 13

D -24 — 9 20

C — 11 4 15

C D K 1 r

(Rate-corrected) distance matrix

averaged distance =

Σ d / (L– 2)

= Σ d / 1

Edge lengths for D, K (^1)

dD,K2 = ½ (dD,K1 + r (^) D – r (^) K1) = ½ (9 + 20 – 13) = 8

dK1,K2 = ½ (dD,K1 + r (^) K1 – rD) = ½ (9 + 13 – 20) = 1

Distances to K (^2)

dK2,C = ½ (dC,D + dC,K1 – dD,K1 ) = ½ (11 + 4 – 9) = 3

A

B C

D

K 1 K 2

A

B

C

D

K 1 K 2

Tree Construction Example – Neighbor Joining

A

B

C

D

K 1 K 2

Original tree Neighbor-joining tree

Except for missing root,

finds same tree topology

CMSC 838T – Lecture 5

Tree Construction – Maximum Parsimony

X Maximum parsimony [Fitch 1971]

0 Minimize number of sequence changes in tree 0 Assume fewest changes (mutations) = most likely (evolution)

X Informative site

0 Position with useful change information (for parsimony) 0 I.e., # of changes in position dependent on tree chosen

0 Must have ≥ 2 different bases / residues, such that

each base / residue appears in ≥ 2 sequences

Seq1 A A G A G T G C A

Seq2 A G C C G T G C G

Seq3 A G A T A T C C A

Seq4 A G A G A T C C G

Informative Sites

Tree Construction – Maximum Parsimony

X Most parsimonious tree

0 Tree with fewest total # of changes at informative sites

G G A

G G G

A C A

A C G

ACA

Tree 1 ACG

Tree 3

Tree 2

GGA

GGG

GGA

ACA

GGG

ACG

GGG

GGA

ACA

GGA

ACG

Sites Changed Tree 1 = 4 Tree 2 = 6 Tree 3 = 5

Informative Sites

CMSC 838T – Lecture 5

Tree Construction – Maximum Parsimony

X Algorithm

0 Generate all possible tree topologies 0 Count number of changes required 0 Select tree with minimum # changes 0 Use branch-and-bound to reduce search O Search trees with increasing # of leaves

O Abandon subtree when # changes ≥ best completed tree

X Characteristics

0 Computationally expensive 0 Analyze only informative sites 0 Misleading if rates of changes vary among branches 0 Evolution is not always parsimonious

Tree Construction – Maximum Likelihood

X Goal

0 Given the probability P(x|y,t) for a sequence y to evolve (mutate) to sequence x along an edge of length t (time) 0 Find tree that has highest probability of taking place

X Mutation probabilities

0 Bases: Jukes-Cantor model [Jukes-Cantor 1969, Kimura 1980] 0 Amino acids: PAM [Dayhoff+ 1978]

X Algorithm

0 Seach over all tree topologies & sequence assignments 0 For each topology & assignment, search all branch lengths

X Characteristics

0 Very computationally expensive

CMSC 838T – Lecture 5

Plylogenetics Summary

X Phylogenetic prediction

0 Infer evolutionary relationships from shared features 0 May have application to sequence alignment, epidemiology

X Phylogenetic trees

0 May be ultrametric and / or additive

X Tree construction

0 Inexpensive distance-based (UPGMA, neighbor-joining) 0 Expensive (exhaustive) tree searches (parsimony, likelihood)

X Assessing phylogenetic trees

0 Algorithms always produce some tree (of varying accuracy) 0 Expert biology knowledge to assess correctness / significance

Where Are We Now?

X Bioinformatics topics covered

0 Molecular biology background 0 Pairwise sequence alignment 0 Multiple sequence alignment 0 Phylogenetics

X Remaining bioinformatics topics

0 Protein structure prediction 0 Gene assembly and prediction 0 Microarrays & expressed sequence tag (EST) analysis 0 Sequence / structure database search & organization

X High performance computing…

CMSC 838T – Lecture 5

More Bioinformatics Terms

X Functional genomics

0 Identify function of genes in organism

X Comparative genomics

0 Identify genes O Related to other genes in organism O Related to genes in other species 0 Create evolutionary history of related genes 0 Locate insertions, deletions, substitutions occurring in evolution

X Proteomics

0 Identify & characterize all gene products (proteins) in organism

X Structural proteomics

0 Identify or predict 3D structure of all proteins in organism

More Bioinformatics Terms

X Pharmacogenomics

0 Application of genomic approaches to identify drug targets O Searching genomes for potential drug receptors O Examining characteristic gene expression in pathogens & hosts during infection for diagnostics or therapy targets 0 Cataloguing & processing info on pharmacology & genetics

X Pharmacogenetics

0 Identifying genetic causes for individualized drug responses O Identify genetic variation (e.g., SNPs) characteristic of particular patient response profiles O Use to improve administration & development of therapies O Identify receptive patient subsets, optimize drug dosages