













Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of phylogenetics, a field of study that aims to understand the relationship between sequences and construct phylogenetic trees representing their evolutionary history. Motivation, applications, tree construction algorithms, and assessing phylogenetic trees. It also discusses the concepts of unrooted and rooted trees, leaves, joins, and branch lengths.
Typology: Study notes
1 / 21
This page cannot be seen from the preview
Don't miss anything!














CMSC 838T – Lecture 5
0 Study of evolutionary relationships (sequences / species) 0 Infer evolutionary relationship from shared features 0 May improve multiple sequence alignment (MSA)
0 Relationship between organisms with common ancestor
0 Graph representing evolutionary history of sequence / species
0 Members sharing common evolutionary history (i.e., common ancestor) are more related to each other 0 Can infer evolutionary relationship from shared features
0 Historically → based on analysis of observable features (e.g., morphology, behavior, geographical distribution) 0 Now → mostly analysis of DNA / RNA / amino acid sequences
CMSC 838T – Lecture 5
0 Understand relationship of sequence to similar sequences 0 Construct phylogenetic tree representing evolutionary history
0 Identify closely related families O Use phylogenetic relationships to predict gene function 0 Follow changes in rapidly evolving species (e.g., viruses) O Analysis can reveal which genes are under selection O Provide epidemiology for tracking infections & vectors 0 Few direct applications
0 Alignment of sequences should take evolution into account 0 More precise phylogenetic relationships ↔ improved MSA
0 Distance methods O UPGMA O Neighbor-joining 0 Maximum parsimony 0 Maximum likelihood
CMSC 838T – Lecture 5
0 Original sequences
0 Represent change 0 Length represents evolutionary distance
0 All sequences in subtree with common ancestor (treated as single node)
0 Point of joining two leaves / clusters
distance
0 Can approximate all tree shapes (w/ arbitrarily short edges) 0 Simplifies tree generation & analysis
0 Alternative form of representation 0 Distance determined only by “height” of branch
normal form
rectangular form
CMSC 838T – Lecture 5
0 Many possible measures O Fraction of sites that differ between two sequences O # of changes needed to convert one sequence to another O Pairwise alignment scores, normalized by average score for random alignment [Feng & Doolittle 1996] Score = (S.actual – S.random) / (S.identical – S.random) Where s.identical = score for aligning identical sequence
0 Matrix of pairwise distances between all sequences 0 Used to generate tree
0 Varies with construction method, distance metric
Seq. A B C D A — 8 7 12 B — 9 14 C — 11 D —
CMSC 838T – Lecture 5
0 Distances for tree are ultrametric O Branch lengths for 2 leaves same after join 0 Distances for tree are additive
Note that tree distances are additive (i.e., distance between X, Y = sum of lengths of edges connecting X, Y)
CMSC 838T – Lecture 5
UPGMA keeps all leaves in clusters and uses them in calculations
closest
closest
Height = ½ distance
0 Join closest neighbors (nodes w / same parent) in tree 0 Avoids problem with UPGMA when rates of change differ
0 Closest leaves not neighbors in correct tree, but joined first by UPGMA
0 Rate of change can differ O Branch lengths may differ after join 0 Branch lengths for tree are additive
CMSC 838T – Lecture 5
0 Used normalized divergence r (^) A r (^) B (~ avg. distance to nodes) 0 We can calculate O a = ½ (d (^) A,B + d (^) A,C – dB,C ) → ½ (d (^) A,B + r (^) A– r (^) B) O b = ½ (d (^) A,B + dB,C – d (^) A,C ) → ½ (d (^) A,B + r (^) B– r (^) A) O c = ½ (dB,C + d (^) A,C – d (^) A,B ) → ½ (dB,C + d (^) A,C – d (^) A,B )
A — d (^) A,B d (^) A,C B — dB,C C —
r (^) B
r (^) A
Simply treat all other nodes as C, and treat distance to C as r
r (^) A
r (^) B
0 To find closest pair of neighbors O Reduce branch length for a node by (approximately) the average distance of the node from all other nodes O Find smallest distance between nodes (after reduction)
For all pairs of nodes A & B in set of all nodes L, let d (^) A,B = distance between A,B R (^) X = Σ dX,N where N ∈ L (total distance from X to all N) r (^) X = R (^) X / (L– 2), whereL= # of nodes (normalized divergence from X to all other nodes) D (^) A,B = d (^) A,B – (r (^) A + r (^) B) (rate-corrected distance)
CMSC 838T – Lecture 5
Rate-corrected distances
D (^) A,B = d (^) A,B – (r (^) A + r (^) B) = 8 – (13.5 + 15.5) = -
D (^) A,C = d (^) A,C – (r (^) A + r (^) C) = 7 – (13.5 + 13.5) = -
D (^) A,D = d (^) A,D – (r (^) A + r (^) D) = 12 – (13.5 + 18.5) = -
D (^) B,C = dB,C – (rB + rC) = 9 – (15.5 + 13.5) = -
D (^) B,D = dB,D – (rB + rD) = 14 – (15.5 + 18.5) = -
D (^) C,D = dC,D – (rC + rD) = 11 – (13.5 + 18.5) = -
normalized divergence =
CMSC 838T – Lecture 5
C D K 1 r
averaged distance =
Edge lengths for C,D
dC,K2 = ½ (dC,D + r (^) C – r (^) D) = ½ (11 + 15 – 20) = 3
dD,K2 = ½ (dC,D + r (^) D – r (^) C) = ½ (11 + 20 – 15) = 8
Distances to K (^2)
dK2,K1 = ½ (dK1,C + dK1,D – dC,D) = ½ (4 + 9 – 11) = 1
C D K 1 r
averaged distance =
Edge lengths for C, K (^1)
dC,K2 = ½ (dC,K1 + r (^) C – r (^) K1) = ½ (4 + 15 – 13) = 3
dK1,K2 = ½ (dC,K1 + r (^) K1 – rC) = ½ (4 + 13 – 15) = 1
Distances to K (^2)
dK2,D = ½ (dD,C + dD,K1 – dC,K1 ) = ½ (11 + 9 – 4) = 8
CMSC 838T – Lecture 5
C D K 1 r
averaged distance =
Edge lengths for D, K (^1)
dD,K2 = ½ (dD,K1 + r (^) D – r (^) K1) = ½ (9 + 20 – 13) = 8
dK1,K2 = ½ (dD,K1 + r (^) K1 – rD) = ½ (9 + 13 – 20) = 1
Distances to K (^2)
dK2,C = ½ (dC,D + dC,K1 – dD,K1 ) = ½ (11 + 4 – 9) = 3
CMSC 838T – Lecture 5
0 Minimize number of sequence changes in tree 0 Assume fewest changes (mutations) = most likely (evolution)
0 Position with useful change information (for parsimony) 0 I.e., # of changes in position dependent on tree chosen
Informative Sites
0 Tree with fewest total # of changes at informative sites
Tree 3
Tree 2
Sites Changed Tree 1 = 4 Tree 2 = 6 Tree 3 = 5
Informative Sites
CMSC 838T – Lecture 5
0 Generate all possible tree topologies 0 Count number of changes required 0 Select tree with minimum # changes 0 Use branch-and-bound to reduce search O Search trees with increasing # of leaves
0 Computationally expensive 0 Analyze only informative sites 0 Misleading if rates of changes vary among branches 0 Evolution is not always parsimonious
0 Given the probability P(x|y,t) for a sequence y to evolve (mutate) to sequence x along an edge of length t (time) 0 Find tree that has highest probability of taking place
0 Bases: Jukes-Cantor model [Jukes-Cantor 1969, Kimura 1980] 0 Amino acids: PAM [Dayhoff+ 1978]
0 Seach over all tree topologies & sequence assignments 0 For each topology & assignment, search all branch lengths
0 Very computationally expensive
CMSC 838T – Lecture 5
0 Infer evolutionary relationships from shared features 0 May have application to sequence alignment, epidemiology
0 May be ultrametric and / or additive
0 Inexpensive distance-based (UPGMA, neighbor-joining) 0 Expensive (exhaustive) tree searches (parsimony, likelihood)
0 Algorithms always produce some tree (of varying accuracy) 0 Expert biology knowledge to assess correctness / significance
0 Molecular biology background 0 Pairwise sequence alignment 0 Multiple sequence alignment 0 Phylogenetics
0 Protein structure prediction 0 Gene assembly and prediction 0 Microarrays & expressed sequence tag (EST) analysis 0 Sequence / structure database search & organization
CMSC 838T – Lecture 5
0 Identify function of genes in organism
0 Identify genes O Related to other genes in organism O Related to genes in other species 0 Create evolutionary history of related genes 0 Locate insertions, deletions, substitutions occurring in evolution
0 Identify & characterize all gene products (proteins) in organism
0 Identify or predict 3D structure of all proteins in organism
0 Application of genomic approaches to identify drug targets O Searching genomes for potential drug receptors O Examining characteristic gene expression in pathogens & hosts during infection for diagnostics or therapy targets 0 Cataloguing & processing info on pharmacology & genetics
0 Identifying genetic causes for individualized drug responses O Identify genetic variation (e.g., SNPs) characteristic of particular patient response profiles O Use to improve administration & development of therapies O Identify receptive patient subsets, optimize drug dosages