Phylogenetics: Understanding Evolutionary History through Sequence Analysis, Study notes of Computer Science

An overview of phylogenetics, a field of study that aims to understand the relationship between sequences and construct phylogenetic trees representing their evolutionary history. Motivation, applications, tree construction algorithms, and assessing phylogenetic trees. It also discusses the concepts of unrooted and rooted trees, leaves, joins, and branch lengths.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-m9f
koofers-user-m9f 🇺🇸

9 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CMSC 838T – Lecture 5
CMSC 838T – Lecture 5
XPhylogenetics
0Study of evolutionary relationships (sequences / species)
0Infer evolutionary relationship from shared features
0May improve multiple sequence alignment (MSA)
CMSC 838T – Lecture 5
Phylogenetics
XPhylogeny
0Relationship between organisms with common ancestor
XPhylogenetic tree
0Graph representing evolutionary history of sequence / species
XPremise
0Members sharing common evolutionary history
(i.e., common ancestor) are more related to each other
0Can infer evolutionary relationship from shared features
XLong history of phylogenetics (from field of genetics)
0Historically based on analysis of observable features
(e.g., morphology, behavior, geographical distribution)
0Now mostly analysis of DNA / RNA / amino acid sequences
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download Phylogenetics: Understanding Evolutionary History through Sequence Analysis and more Study notes Computer Science in PDF only on Docsity!

CMSC 838T – Lecture 5

CMSC 838T – Lecture 5

X Phylogenetics

0 Study of evolutionary relationships (sequences / species) 0 Infer evolutionary relationship from shared features 0 May improve multiple sequence alignment (MSA)

Phylogenetics

X Phylogeny

0 Relationship between organisms with common ancestor

X Phylogenetic tree

0 Graph representing evolutionary history of sequence / species

X Premise

0 Members sharing common evolutionary history (i.e., common ancestor) are more related to each other 0 Can infer evolutionary relationship from shared features

X Long history of phylogenetics (from field of genetics)

0 Historically → based on analysis of observable features (e.g., morphology, behavior, geographical distribution) 0 Now → mostly analysis of DNA / RNA / amino acid sequences

CMSC 838T – Lecture 5

Phylogenetics – Motivation & Alignment

X Goal of phylogenetics

0 Understand relationship of sequence to similar sequences 0 Construct phylogenetic tree representing evolutionary history

X Motivation / application

0 Identify closely related families O Use phylogenetic relationships to predict gene function 0 Follow changes in rapidly evolving species (e.g., viruses) O Analysis can reveal which genes are under selection O Provide epidemiology for tracking infections & vectors 0 Few direct applications

X Relationship to multiple sequence alignment (MSA)

0 Alignment of sequences should take evolution into account 0 More precise phylogenetic relationships ↔ improved MSA

Plylogenetics Overview

X Phylogenetic trees

X Tree construction algorithms

0 Distance methods O UPGMA O Neighbor-joining 0 Maximum parsimony 0 Maximum likelihood

X Assessing phylogenetic trees

CMSC 838T – Lecture 5

Phylogenetic Trees

X Leaves / taxa

0 Original sequences

X Branches

0 Represent change 0 Length represents evolutionary distance

X Cluster / clade

0 All sequences in subtree with common ancestor (treated as single node)

X Join / node

0 Point of joining two leaves / clusters

distance

Phylogenetic Trees

X Use binary trees (evolution is bifurcating process)

0 Can approximate all tree shapes (w/ arbitrarily short edges) 0 Simplifies tree generation & analysis

X Trees can be represented in rectangular form

0 Alternative form of representation 0 Distance determined only by “height” of branch

D B C A

normal form

D B C A

rectangular form

CMSC 838T – Lecture 5

Phylogenetic Trees

X Can label branches of tree with change to sequence

N Y L S

Phylogenetic Trees – Distance

X (Evolutionary) Distance

0 Many possible measures O Fraction of sites that differ between two sequences O # of changes needed to convert one sequence to another O Pairwise alignment scores, normalized by average score for random alignment [Feng & Doolittle 1996] Score = (S.actual – S.random) / (S.identical – S.random) Where s.identical = score for aligning identical sequence

X Distance matrix

0 Matrix of pairwise distances between all sequences 0 Used to generate tree

X Tree shape

0 Varies with construction method, distance metric

Seq. A B C D A — 8 7 12 B — 9 14 C — 11 D —

CMSC 838T – Lecture 5

Tree Construction – UPGMA

X UPGMA (Unweighted Pair Group Method using

Arithmetic Averages) [Sokal & Michener 1958]

X Algorithm

  1. Find pair of sequences A, B with smallest distance D (^) AB
  2. Insert join for A, B at tree height = ½ D (^) AB
  3. Update distance to new cluster as the average distance betweens pairs of sequences in each cluster
  4. Repeat until all sequences / clusters joined
  5. Produces rooted tree

X Assumptions

0 Distances for tree are ultrametric O Branch lengths for 2 leaves same after join 0 Distances for tree are additive

A

B

A

B

C

½ D AB

½ D C(AB)

Tree Construction Example

Distance matrix

Sequences A B C D

A — 8 7 12

B — 9 14

C — 11

D —

Original tree

Note that tree distances are additive (i.e., distance between X, Y = sum of lengths of edges connecting X, Y)

CMSC 838T – Lecture 5

Tree Construction Example – UPGMA

A B C D

A — 8 7 12

B — 9 14

C — 11

D —

A-C B D

A-C — 8.5 11.

B — 14

D —

D —

A-C-B — 12.

A-C-B D D B C A

Distance matrices UPGMA tree

A-C

A-C-B

UPGMA keeps all leaves in clusters and uses them in calculations

A, C

closest

A-C, B

closest

Height = ½ distance

X Goal

0 Join closest neighbors (nodes w / same parent) in tree 0 Avoids problem with UPGMA when rates of change differ

X Example

0 Closest leaves not neighbors in correct tree, but joined first by UPGMA

X Assumptions

0 Rate of change can differ O Branch lengths may differ after join 0 Branch lengths for tree are additive

Tree Construction – Neighbor-Joining

A D A D B C

B C

CMSC 838T – Lecture 5

X Exploit principle for neighbor-joining algorithm

X Replace distance to C

0 Used normalized divergence r (^) A r (^) B (~ avg. distance to nodes) 0 We can calculate O a = ½ (d (^) A,B + d (^) A,C – dB,C ) → ½ (d (^) A,B + r (^) A– r (^) B) O b = ½ (d (^) A,B + dB,C – d (^) A,C ) → ½ (d (^) A,B + r (^) B– r (^) A) O c = ½ (dB,C + d (^) A,C – d (^) A,B ) → ½ (dB,C + d (^) A,C – d (^) A,B )

Neighbor-Joining – Basic Principle

A

C

B

A B C

A — d (^) A,B d (^) A,C B — dB,C C —

a

b

c

K

r (^) B

r (^) A

Simply treat all other nodes as C, and treat distance to C as r

r (^) A

r (^) B

Tree Construction – Neighbor-Joining

X Approach

0 To find closest pair of neighbors O Reduce branch length for a node by (approximately) the average distance of the node from all other nodes O Find smallest distance between nodes (after reduction)

X Definitions

For all pairs of nodes A & B in set of all nodes L, let d (^) A,B = distance between A,B R (^) X = Σ dX,N where N ∈ L (total distance from X to all N) r (^) X = R (^) X / (L– 2), whereL= # of nodes (normalized divergence from X to all other nodes) D (^) A,B = d (^) A,B – (r (^) A + r (^) B) (rate-corrected distance)

X Key property − 2 nodes w/ mininum D are always neighbors!

CMSC 838T – Lecture 5

Tree Construction – Neighbor-Joining

X Algorithm [Saitou & Nei 1987, Studier & Keppler 1988]

  1. Begin with star tree & all sequences as nodes in L
  2. Find pair of nodes A & B ∈ L with minimum D (^) A,B
  3. Create & insert new join (node K) w/ branch lengths 0 d (^) A,K = ½ (d (^) A,B + r (^) A – r (^) B) 0 dB,K = ½ (d (^) A,B + r (^) B – r (^) A)
  4. For remaining nodes C ∈ L, update distance to K as 0 dK,C = ½ (d (^) A,C + dB,C – d (^) A,B)
  5. Insert K and remove A, B from L
  6. Repeat steps 2−5 until only two nodes left

A

B

K

Tree Construction Example – Neighbor Joining

(Rate-corrected) distance matrix

D -20 -20 -21 — 18.

C -20 -20 — 11 13.

B -21 — 9 14 15.

A — 8 7 12 13.

A B C D R A

B C

D

Rate-corrected distances

D (^) A,B = d (^) A,B – (r (^) A + r (^) B) = 8 – (13.5 + 15.5) = -

D (^) A,C = d (^) A,C – (r (^) A + r (^) C) = 7 – (13.5 + 13.5) = -

D (^) A,D = d (^) A,D – (r (^) A + r (^) D) = 12 – (13.5 + 18.5) = -

D (^) B,C = dB,C – (rB + rC) = 9 – (15.5 + 13.5) = -

D (^) B,D = dB,D – (rB + rD) = 14 – (15.5 + 18.5) = -

D (^) C,D = dC,D – (rC + rD) = 11 – (13.5 + 18.5) = -

normalized divergence =

Σ d / (L– 2)

= Σ d / 2

CMSC 838T – Lecture 5

Tree Construction Example – Neighbor Joining

K 1 -24 -24 — 13

D -24 — 9 20

C — 11 4 15

C D K 1 r

(Rate-corrected) distance matrix

averaged distance =

Σ d / (L– 2)

= Σ d / 1

Edge lengths for C,D

dC,K2 = ½ (dC,D + r (^) C – r (^) D) = ½ (11 + 15 – 20) = 3

dD,K2 = ½ (dC,D + r (^) D – r (^) C) = ½ (11 + 20 – 15) = 8

Distances to K (^2)

dK2,K1 = ½ (dK1,C + dK1,D – dC,D) = ½ (4 + 9 – 11) = 1

A

B C

D

K 1 K 2

A

B

C

D

K 1 K 2

Tree Construction Example – Neighbor Joining

K 1 -24 -24 — 13

D -24 — 9 20

C — 11 4 15

C D K 1 r

(Rate-corrected) distance matrix

averaged distance =

Σ d / (L– 2)

= Σ d / 1

Edge lengths for C, K (^1)

dC,K2 = ½ (dC,K1 + r (^) C – r (^) K1) = ½ (4 + 15 – 13) = 3

dK1,K2 = ½ (dC,K1 + r (^) K1 – rC) = ½ (4 + 13 – 15) = 1

Distances to K (^2)

dK2,D = ½ (dD,C + dD,K1 – dC,K1 ) = ½ (11 + 9 – 4) = 8

A

B C

D

K 1 K 2

A

B

C

D

K 1 K 2

CMSC 838T – Lecture 5

Tree Construction Example – Neighbor Joining

K 1 -24 -24 — 13

D -24 — 9 20

C — 11 4 15

C D K 1 r

(Rate-corrected) distance matrix

averaged distance =

Σ d / (L– 2)

= Σ d / 1

Edge lengths for D, K (^1)

dD,K2 = ½ (dD,K1 + r (^) D – r (^) K1) = ½ (9 + 20 – 13) = 8

dK1,K2 = ½ (dD,K1 + r (^) K1 – rD) = ½ (9 + 13 – 20) = 1

Distances to K (^2)

dK2,C = ½ (dC,D + dC,K1 – dD,K1 ) = ½ (11 + 4 – 9) = 3

A

B C

D

K 1 K 2

A

B

C

D

K 1 K 2

Tree Construction Example – Neighbor Joining

A

B

C

D

K 1 K 2

Original tree Neighbor-joining tree

Except for missing root,

finds same tree topology

CMSC 838T – Lecture 5

Tree Construction – Maximum Parsimony

X Maximum parsimony [Fitch 1971]

0 Minimize number of sequence changes in tree 0 Assume fewest changes (mutations) = most likely (evolution)

X Informative site

0 Position with useful change information (for parsimony) 0 I.e., # of changes in position dependent on tree chosen

0 Must have ≥ 2 different bases / residues, such that

each base / residue appears in ≥ 2 sequences

Seq1 A A G A G T G C A

Seq2 A G C C G T G C G

Seq3 A G A T A T C C A

Seq4 A G A G A T C C G

Informative Sites

Tree Construction – Maximum Parsimony

X Most parsimonious tree

0 Tree with fewest total # of changes at informative sites

G G A

G G G

A C A

A C G

ACA

ACA

Tree 1 ACG

Tree 3

Tree 2

GGA

GGA

GGG

GGA

GGA

ACA

GGG

GGG

ACG

GGG

GGA

ACA

GGA

GGA

ACG

Sites Changed Tree 1 = 4 Tree 2 = 6 Tree 3 = 5

Informative Sites

CMSC 838T – Lecture 5

Tree Construction – Maximum Parsimony

X Algorithm

0 Generate all possible tree topologies 0 Count number of changes required 0 Select tree with minimum # changes 0 Use branch-and-bound to reduce search O Search trees with increasing # of leaves

O Abandon subtree when # changes ≥ best completed tree

X Characteristics

0 Computationally expensive 0 Analyze only informative sites 0 Misleading if rates of changes vary among branches 0 Evolution is not always parsimonious

Tree Construction – Maximum Likelihood

X Goal

0 Given the probability P(x|y,t) for a sequence y to evolve (mutate) to sequence x along an edge of length t (time) 0 Find tree that has highest probability of taking place

X Mutation probabilities

0 Bases: Jukes-Cantor model [Jukes-Cantor 1969, Kimura 1980] 0 Amino acids: PAM [Dayhoff+ 1978]

X Algorithm

0 Seach over all tree topologies & sequence assignments 0 For each topology & assignment, search all branch lengths

X Characteristics

0 Very computationally expensive

CMSC 838T – Lecture 5

Plylogenetics Summary

X Phylogenetic prediction

0 Infer evolutionary relationships from shared features 0 May have application to sequence alignment, epidemiology

X Phylogenetic trees

0 May be ultrametric and / or additive

X Tree construction

0 Inexpensive distance-based (UPGMA, neighbor-joining) 0 Expensive (exhaustive) tree searches (parsimony, likelihood)

X Assessing phylogenetic trees

0 Algorithms always produce some tree (of varying accuracy) 0 Expert biology knowledge to assess correctness / significance

Where Are We Now?

X Bioinformatics topics covered

0 Molecular biology background 0 Pairwise sequence alignment 0 Multiple sequence alignment 0 Phylogenetics

X Remaining bioinformatics topics

0 Protein structure prediction 0 Gene assembly and prediction 0 Microarrays & expressed sequence tag (EST) analysis 0 Sequence / structure database search & organization

X High performance computing…

CMSC 838T – Lecture 5

More Bioinformatics Terms

X Functional genomics

0 Identify function of genes in organism

X Comparative genomics

0 Identify genes O Related to other genes in organism O Related to genes in other species 0 Create evolutionary history of related genes 0 Locate insertions, deletions, substitutions occurring in evolution

X Proteomics

0 Identify & characterize all gene products (proteins) in organism

X Structural proteomics

0 Identify or predict 3D structure of all proteins in organism

More Bioinformatics Terms

X Pharmacogenomics

0 Application of genomic approaches to identify drug targets O Searching genomes for potential drug receptors O Examining characteristic gene expression in pathogens & hosts during infection for diagnostics or therapy targets 0 Cataloguing & processing info on pharmacology & genetics

X Pharmacogenetics

0 Identifying genetic causes for individualized drug responses O Identify genetic variation (e.g., SNPs) characteristic of particular patient response profiles O Use to improve administration & development of therapies O Identify receptive patient subsets, optimize drug dosages

X Lots of data mining…