Lecture 4 - Multiple Sequence Alignment | CMSC 828T, Study notes of Computer Science

Material Type: Notes; Class: ROBERTS LEARN MANIP ACT; Subject: Computer Science; University: University of Maryland; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-j3w
koofers-user-j3w ๐Ÿ‡บ๐Ÿ‡ธ

10 documents

1 / 19

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CMSC 838T โ€“ Lecture 4
CMSC 838T โ€“ Lecture 4
XMultiple sequence alignment (MSA)
0Alignment containing multiple DNA / protein sequences
0Look for conserved regions โ†’similar function
CMSC 838T โ€“ Lecture 4
Multiple Sequence Alignments - Motivation
XIdentify highly conserved residues
0Likely to be essential sites for structure / function
0More precision from multiple sequences
0Better structure / function prediction, pairwise alignments
XBuilding gene / protein families
0Use conserved regions to guide search
XBasis for phylogenetic analysis
0Infer evolutionary relationships between genes
XDevelop primers & probes
0Use conserved region to develop
OPrimers for PCR
OProbes for DNA microarrays
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13

Partial preview of the text

Download Lecture 4 - Multiple Sequence Alignment | CMSC 828T and more Study notes Computer Science in PDF only on Docsity!

CMSC 838T โ€“ Lecture 4

CMSC 838T โ€“ Lecture 4

X Multiple sequence alignment (MSA)

0 Alignment containing multiple DNA / protein sequences 0 Look for conserved regions โ†’ similar function

Multiple Sequence Alignments - Motivation

X Identify highly conserved residues

0 Likely to be essential sites for structure / function 0 More precision from multiple sequences 0 Better structure / function prediction, pairwise alignments

X Building gene / protein families

0 Use conserved regions to guide search

X Basis for phylogenetic analysis

0 Infer evolutionary relationships between genes

X Develop primers & probes

0 Use conserved region to develop O Primers for PCR O Probes for DNA microarrays

CMSC 838T โ€“ Lecture 4

Multiple Sequence Alignment (MSA)

X Outline

0 Basic concepts & terms 0 Global alignment O Optimal โ€“ dynamic programming (MSA) O Progressive โ€“ pairwise (PILEUP, CLUSTALW) O Iterative progressive (MULTALIN) O Block-based (DIALIGN) 0 Local alignment (motif finding) O Patterns (MOTIF, PROTOMAT) O Statistical profiles (HMMER2, PSI-BLAST) 0 Viewing & editing multiple sequence alignments

Terminology (for Proteins)

X Family

0 Group of proteins of similar biochemical function with (roughly) > 50% sequence identity when aligned 0 Family is transitive, even if sequence identity < 50% O A โ†’ B and B โ†’ C implies A โ†’ C 0 1940 protein families in Protein DataBank (v1.61, Nov 2002)

X Superfamily

0 Group of protein families related by distant yet detectable sequence similarity 0 1100 protein superfamilies in Protein DataBank (v1.61)

CMSC 838T โ€“ Lecture 4

Multiple Sequence Alignment - Block

X Ungapped conserved sequence pattern

X Types of blocks

0 Exact โ€“ composed of identical segments 0 Uniform โ€“ found in every sequence 0 Consistent โ€“ can be part of global MSA

A A T G T G T G A G A C T C

C T T A G T G T A C C A C G

A A C A T G T A T A C G T A C G

A C G T G C C T A C T A

non-consistent

Diagonal

MSA - Global vs. Local Alignment

X Protein structure

0 Derived from a limited number of building blocks (domains) that have been mixed and shuffled through evolution 0 Proteins can thus share a global or local relationship

X Global sequence alignment

0 Alignment over entire sequence (near same length)

X Local sequence alignment

0 Alignment over parts(s) of sequence

Global Alignment Local Alignment

CMSC 838T โ€“ Lecture 4

Multiple Sequence Alignment - Approaches

X Progressive global

alignment

0 If sequences related over entire length

X Block-based global

alignment

0 If related by large consistent blocks

X Local alignment

0 If related by small non-consistent blocks

Multiple Sequence Alignments - Issues

X No single โ€œcorrectโ€ answer

0 Ideally, find single evolutionary correct alignment 0 In practice, evolutionary history must be inferred 0 Try find sequence alignment โ†’ good structure alignment

X Typical alignment target

0 Protein family with ~30% identity

X Computationally expensive

0 โ€œOptimalโ€ solution exponential in number of sequences 0 Greater reliance on (greedy) heuristics

X Benefits from user interaction

0 Select which sequences to include in alignment 0 Select which regions to align 0 Edit resulting alignments

CMSC 838T โ€“ Lecture 4

Global MSA - Approach

X General approach to global MSA

  1. Find sequences to align (e.g., result of pairwise search)
  2. Locate region(s) of similar length to include in alignment
  3. Apply global alignment algorithm
  4. Refine alignment (repeat as needed)
    1. Inspect resulting alignment 0 Identify conserved physical / chemical properties
    2. Remove seriously misaligned sequences
    3. Reapply algorithm
    4. Add back remaining sequences 0 While preserving key features of alignment
  5. If looking for local alignment, reduce sequence length to highly conserved regions, align to conserved region

Global MSA - Dynamic Programming (DP)

X โ€œOptimalโ€ dynamic programming [Sankoff+ 1983]

0 Assume k sequences of length n 0 Attempt to maximize sum-of-pairs (SP) score 0 Build F, a k-dimensional table of length n+1 (n k^ elements) 0 Recursive formula โ†’ F( i ) = max( F(i-1) + SP(columni ) )

X Complexity

0 O(nk^ ) entries to fill 0 Each entry combines O(2k^ ) other entries 0 Total cost = O(2k^ nk)

X Bounded search (MSA) [Carillo & Lipman 1988]

0 Apply heuristic aligment, use resulting SP to bound search 0 Significant speed improvement, still limited to small values of k

K=

CMSC 838T โ€“ Lecture 4

Global MSA โ€“ Progressive Global Alignment

X Motivation

0 Reduce cost by building global alignment incrementally

X Approach

  1. Compute distance between all pairs of sequences
  2. Build simple guide tree reflecting distance between sequences 0 Use UPGMA (PILEUP) OR neighbor-joining (CLUSTALW)
  3. Align sequences following guide tree, starting at leaves 0 Align consensus sequences OR profiles 0 Use optimal or heuristic pairwise algorithms 0 Attempt to place gaps between conserved regions

X Problems

0 Greedy approach dependent on initial pairwise alignments 0 Cannot fix early mistakes (gaps cannot be removed)

Global MSA โ€“ CLUSTALW

X Algorithm [Thompson+ 1994]

0 Calculate evolutionary distances from alignment scores 0 Performs pairwise alignment of profiles (probabilities of residues at each position) using dynamic programming 0 Later calculates consensus sequence from profile

X Heuristics for improving multiple alignments

0 Weight sequences to compensate for biased representation 0 Scoring matrix chosen based on expected similarity from tree O E.g., nearby โ†’ BLOSUM 80, distant โ†’ BLOSUM 50 0 Gap penalty modified by residue (function) at position O E.g., Higher gap penalty for hydrophobic residues 0 Gap penalty higher if first gap in column & nearby gaps 0 Dynamically adjust guide tree to defer poor alignments

CMSC 838T โ€“ Lecture 4

Multiple Sequence Alignment (MSA)

X Outline

0 Basic concepts & terms 0 Global alignment O Optimal โ€“ dynamic programming (MSA) O Progressive โ€“ pairwise (PILEUP, CLUSTALW) O Iterative progressive (MULTALIN) O Block-based (DIALIGN) 0 Local alignment (motif finding) O Patterns (MOTIF, PROTOMAT) O Statistical profiles (HMMER2, PSI-BLAST) 0 Viewing & editing multiple sequence alignments

Local MSA (Motif-finding) - Approach

X Motivation

0 Find local regions of high similarity (motifs) 0 Align based on motifs

X Approach

0 Find motifs O Patterns O Blocks O Statistical profiles X Position-specific scoring matrix (PSSM) X Hidden Markov model (HMM) 0 Align sequences O Preserve motifs as much as possible

CMSC 838T โ€“ Lecture 4

Terminology (for Protein Sequences)

X Pattern

0 Deterministic syntax describing well-conserved region

X Profile

0 Probabilistic syntax describing well-conserved region 0 Score-based representations O Position-specific scoring matrix (PSSM) O Hidden Markov model (HMM)

X Pattern & profile

0 Can be used to search for motifs / domains of biological significance that characterize protein family

Significance of Patterns / Motifs

X DNA

0 Recognition sites of restriction endonucleases 0 Codons specifying the amino acid sequence of a protein 0 Intron splice sites 0 Promoter 0 Binding sites for regulatory proteins which activate or repress transcription

X Proteins

0 Presence of active sites 0 Prediction of protein secondary structure 0 Presence of signals used to localize the protein in the cell

CMSC 838T โ€“ Lecture 4

Local MSA - Statistical Profile

X Position-specific scoring matrix (PSSM)

0 Summary representation for (aligned) conserved region 0 Stores probability of element at each position in sequence 0 Entries usually stored in log-odds form 0 Weight entries by 1) average proportion, 2) evolutionary dist. 0 Consensus โ†’ most likely base / residue at each position

A A C

A C G

A - T

A A T

3) T 0.00 0.25 0.25 0.50 -

2) A 0.75 0.25 0.00 0.00 -

1) A 1.00 0.00 0.00 0.00 -

Cons A C G T Gap

Local MSA - Statistical Profile

X Hidden Markov model (HMM)

0 Statistical summary representation for conserved region 0 Model stores probability of match, mismatch, insertions, and deletions at each position in sequence 0 Alignment of conserved region not necessary, but helpful

A A C

A C G

A - T

A A T

match states

insert states

delete states

b e g i n

e n d

A C G T

A C G T

A C G T

CMSC 838T โ€“ Lecture 4

Local MSA - HMMs

X HMM construction (HMMER2, PSI-BLAST)

  1. Initialize model with estimated amino acid transition probabilities, using PAM / BLOSUM / Dirichlet mixtures
  2. For each โ€œtrainingโ€ sequence containing conserved region O Find all possible paths for sequence through model Forward-backward algorithm = O(size ร— sequences) O Increase weight (probability) of each path taken

X HMM properties

0 Ideally use 20-100 training sequences to build model O With better initialization, smaller โ€œtraining setโ€ sufficient 0 Well grounded in probability theory (statistical significance) 0 Explicit gap penalties not needed (automatically trained) 0 Can extract consensus sequence (w/ dynamic programming)

Calibrating Profiles for PSSM & HMM

X Profiling methods (PSSM, HMM)

0 Training set used to build profile may be biased / skewed O Over-represented sequences (common motifs) O Under-represented sequences (rare residues) 0 Resulting profile matches training set, not desired motif

X Weighting / calibration

0 Differentially weight sequences to compensate for non- representative sampling in training set 0 Similar sequences โ†’ lower weights 0 Rare sequences โ†’ higher weights 0 Maximum discrimination โ†’ set of weights that best differentiate between real matches and background noise

X Simulated annealing to avoid local maxima

CMSC 838T โ€“ Lecture 4

Searching Based on MSA Profile

X Advantages

0 Searches based on domain, not sequences 0 Greatly improved sensitivity in practice

X Dependent on user selection / deletion of sequences

0 Once included in profile, sequence will score well 0 Including false positives (mismatches) reduces accuracy O Can use pairwise alignment to compare sequences O Demonstrate sequence can mutate into other sequences

Multiple Sequence Alignment (MSA)

X Outline

0 Basic concepts & terms 0 Global alignment O Optimal โ€“ dynamic programming (MSA) O Progressive โ€“ pairwise (PILEUP, CLUSTALW) O Iterative progressive (MULTALIN) O Block-based (DIALIGN) 0 Local alignment (motif finding) O Patterns (MOTIF, PROTOMAT) O Statistical profiles (HMMER2, PSI-BLAST) 0 Viewing & editing multiple sequence alignments

CMSC 838T โ€“ Lecture 4

Viewing & Editing Multiple Alignments

X Motivation

0 Use multiple sequence alignment as starting point 0 Improve usability / readability O Format alignments O Add annotations 0 Improve alignments manually with expert knowledge O Find biologically significant regions

X Multiple sequence alignment tools

0 Viewers O ClustalX, Jalview, Cinema, Sequence logos 0 Editors / annotation O SeqVu, MACAW

Viewing Multiple Sequence Alignments

X Coloring scheme

0 Helps better visualize conserved regions

X Example color code

0 AVFPMILW: RED, Small O (small + hydrophobic (including aromatic -Y)) 0 DE: BLUE, Acidic 0 RHK: MAGENTA, Basic 0 STYHCNGQ: GREEN, Hydroxyl + Amine + Basic โ€“ Q 0 Others: Grey

Multiple Sequence Alignment (MSA)

X Summary

0 Many multiple sequence alignment algorithms 0 Most global alignment algorithms too expensive O Exception - progressive pairwise alignment (heuristic) 0 Local alignment algs. try to find essential conserved regions O Can be very simple (matching motifs) O Or use heavy-duty statistical analysis models 0 Searches using MSA more sensitive than pairwise alignments 0 When using MSA to search / edit motifs O Knowledge of biochemistry provides major advantage