Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Lecture 4 - Multiple Sequence Alignment | CMSC 828T, Study notes of Computer Science

University of Maryland Computer Science

Material Type: Notes; Class: ROBERTS LEARN MANIP ACT; Subject: Computer Science; University: University of Maryland; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-j3w 🇺🇸

10 documents

1 / 19

This page cannot be seen from the preview

Don't miss anything!

CMSC 838T – Lecture 4

XMultiple sequence alignment (MSA)

0Alignment containing multiple DNA / protein sequences

0Look for conserved regions →similar function

CMSC 838T – Lecture 4

Multiple Sequence Alignments - Motivation

XIdentify highly conserved residues

0Likely to be essential sites for structure / function

0More precision from multiple sequences

0Better structure / function prediction, pairwise alignments

XBuilding gene / protein families

0Use conserved regions to guide search

XBasis for phylogenetic analysis

0Infer evolutionary relationships between genes

XDevelop primers & probes

0Use conserved region to develop

OPrimers for PCR

OProbes for DNA microarrays

Discover Study notes of Computer Science University of Maryland

Partial preview of the text

Download Lecture 4 - Multiple Sequence Alignment | CMSC 828T and more Study notes Computer Science in PDF only on Docsity!

CMSC 838T – Lecture 4

X Multiple sequence alignment (MSA)

0 Alignment containing multiple DNA / protein sequences 0 Look for conserved regions → similar function

Multiple Sequence Alignments - Motivation

X Identify highly conserved residues

0 Likely to be essential sites for structure / function 0 More precision from multiple sequences 0 Better structure / function prediction, pairwise alignments

X Building gene / protein families

0 Use conserved regions to guide search

X Basis for phylogenetic analysis

0 Infer evolutionary relationships between genes

X Develop primers & probes

0 Use conserved region to develop O Primers for PCR O Probes for DNA microarrays

CMSC 838T – Lecture 4

Multiple Sequence Alignment (MSA)

X Outline

0 Basic concepts & terms 0 Global alignment O Optimal – dynamic programming (MSA) O Progressive – pairwise (PILEUP, CLUSTALW) O Iterative progressive (MULTALIN) O Block-based (DIALIGN) 0 Local alignment (motif finding) O Patterns (MOTIF, PROTOMAT) O Statistical profiles (HMMER2, PSI-BLAST) 0 Viewing & editing multiple sequence alignments

Terminology (for Proteins)

X Family

0 Group of proteins of similar biochemical function with (roughly) > 50% sequence identity when aligned 0 Family is transitive, even if sequence identity < 50% O A → B and B → C implies A → C 0 1940 protein families in Protein DataBank (v1.61, Nov 2002)

X Superfamily

0 Group of protein families related by distant yet detectable sequence similarity 0 1100 protein superfamilies in Protein DataBank (v1.61)

CMSC 838T – Lecture 4

Multiple Sequence Alignment - Block

X Ungapped conserved sequence pattern

X Types of blocks

0 Exact – composed of identical segments 0 Uniform – found in every sequence 0 Consistent – can be part of global MSA

A A T G T G T G A G A C T C

C T T A G T G T A C C A C G

A A C A T G T A T A C G T A C G

A C G T G C C T A C T A

non-consistent

Diagonal

MSA - Global vs. Local Alignment

X Protein structure

0 Derived from a limited number of building blocks (domains) that have been mixed and shuffled through evolution 0 Proteins can thus share a global or local relationship

X Global sequence alignment

0 Alignment over entire sequence (near same length)

X Local sequence alignment

0 Alignment over parts(s) of sequence

Global Alignment Local Alignment

CMSC 838T – Lecture 4

Multiple Sequence Alignment - Approaches

X Progressive global

alignment

0 If sequences related over entire length

X Block-based global

alignment

0 If related by large consistent blocks

X Local alignment

0 If related by small non-consistent blocks

Multiple Sequence Alignments - Issues

X No single “correct” answer

0 Ideally, find single evolutionary correct alignment 0 In practice, evolutionary history must be inferred 0 Try find sequence alignment → good structure alignment

X Typical alignment target

0 Protein family with ~30% identity

X Computationally expensive

0 “Optimal” solution exponential in number of sequences 0 Greater reliance on (greedy) heuristics

X Benefits from user interaction

0 Select which sequences to include in alignment 0 Select which regions to align 0 Edit resulting alignments

CMSC 838T – Lecture 4

Global MSA - Approach

X General approach to global MSA

Find sequences to align (e.g., result of pairwise search)
Locate region(s) of similar length to include in alignment
Apply global alignment algorithm
Refine alignment (repeat as needed)
1. Inspect resulting alignment 0 Identify conserved physical / chemical properties
2. Remove seriously misaligned sequences
3. Reapply algorithm
4. Add back remaining sequences 0 While preserving key features of alignment
If looking for local alignment, reduce sequence length to highly conserved regions, align to conserved region

Global MSA - Dynamic Programming (DP)

X “Optimal” dynamic programming [Sankoff+ 1983]

0 Assume k sequences of length n 0 Attempt to maximize sum-of-pairs (SP) score 0 Build F, a k-dimensional table of length n+1 (n k^ elements) 0 Recursive formula → F( i ) = max( F(i-1) + SP(columni ) )

X Complexity

0 O(nk^ ) entries to fill 0 Each entry combines O(2k^ ) other entries 0 Total cost = O(2k^ nk)

X Bounded search (MSA) [Carillo & Lipman 1988]

0 Apply heuristic aligment, use resulting SP to bound search 0 Significant speed improvement, still limited to small values of k

K=

CMSC 838T – Lecture 4

Global MSA – Progressive Global Alignment

X Motivation

0 Reduce cost by building global alignment incrementally

X Approach

Compute distance between all pairs of sequences
Build simple guide tree reflecting distance between sequences 0 Use UPGMA (PILEUP) OR neighbor-joining (CLUSTALW)
Align sequences following guide tree, starting at leaves 0 Align consensus sequences OR profiles 0 Use optimal or heuristic pairwise algorithms 0 Attempt to place gaps between conserved regions

X Problems

0 Greedy approach dependent on initial pairwise alignments 0 Cannot fix early mistakes (gaps cannot be removed)

Global MSA – CLUSTALW

X Algorithm [Thompson+ 1994]

0 Calculate evolutionary distances from alignment scores 0 Performs pairwise alignment of profiles (probabilities of residues at each position) using dynamic programming 0 Later calculates consensus sequence from profile

X Heuristics for improving multiple alignments

0 Weight sequences to compensate for biased representation 0 Scoring matrix chosen based on expected similarity from tree O E.g., nearby → BLOSUM 80, distant → BLOSUM 50 0 Gap penalty modified by residue (function) at position O E.g., Higher gap penalty for hydrophobic residues 0 Gap penalty higher if first gap in column & nearby gaps 0 Dynamically adjust guide tree to defer poor alignments

CMSC 838T – Lecture 4

Multiple Sequence Alignment (MSA)

X Outline

Local MSA (Motif-finding) - Approach

X Motivation

0 Find local regions of high similarity (motifs) 0 Align based on motifs

X Approach

0 Find motifs O Patterns O Blocks O Statistical profiles X Position-specific scoring matrix (PSSM) X Hidden Markov model (HMM) 0 Align sequences O Preserve motifs as much as possible

CMSC 838T – Lecture 4

Terminology (for Protein Sequences)

X Pattern

0 Deterministic syntax describing well-conserved region

X Profile

0 Probabilistic syntax describing well-conserved region 0 Score-based representations O Position-specific scoring matrix (PSSM) O Hidden Markov model (HMM)

X Pattern & profile

0 Can be used to search for motifs / domains of biological significance that characterize protein family

Significance of Patterns / Motifs

X DNA

0 Recognition sites of restriction endonucleases 0 Codons specifying the amino acid sequence of a protein 0 Intron splice sites 0 Promoter 0 Binding sites for regulatory proteins which activate or repress transcription

X Proteins

0 Presence of active sites 0 Prediction of protein secondary structure 0 Presence of signals used to localize the protein in the cell

CMSC 838T – Lecture 4

Local MSA - Statistical Profile

X Position-specific scoring matrix (PSSM)

0 Summary representation for (aligned) conserved region 0 Stores probability of element at each position in sequence 0 Entries usually stored in log-odds form 0 Weight entries by 1) average proportion, 2) evolutionary dist. 0 Consensus → most likely base / residue at each position

A A C

A C G

A - T

A A T

3) T 0.00 0.25 0.25 0.50 -

2) A 0.75 0.25 0.00 0.00 -

1) A 1.00 0.00 0.00 0.00 -

Cons A C G T Gap

Local MSA - Statistical Profile

X Hidden Markov model (HMM)

0 Statistical summary representation for conserved region 0 Model stores probability of match, mismatch, insertions, and deletions at each position in sequence 0 Alignment of conserved region not necessary, but helpful

A A C

A C G

A - T

A A T

match states

insert states

delete states

b e g i n

e n d

A C G T

CMSC 838T – Lecture 4

Local MSA - HMMs

X HMM construction (HMMER2, PSI-BLAST)

Initialize model with estimated amino acid transition probabilities, using PAM / BLOSUM / Dirichlet mixtures
For each “training” sequence containing conserved region O Find all possible paths for sequence through model Forward-backward algorithm = O(size × sequences) O Increase weight (probability) of each path taken

X HMM properties

0 Ideally use 20-100 training sequences to build model O With better initialization, smaller “training set” sufficient 0 Well grounded in probability theory (statistical significance) 0 Explicit gap penalties not needed (automatically trained) 0 Can extract consensus sequence (w/ dynamic programming)

Calibrating Profiles for PSSM & HMM

X Profiling methods (PSSM, HMM)

0 Training set used to build profile may be biased / skewed O Over-represented sequences (common motifs) O Under-represented sequences (rare residues) 0 Resulting profile matches training set, not desired motif

X Weighting / calibration

0 Differentially weight sequences to compensate for non- representative sampling in training set 0 Similar sequences → lower weights 0 Rare sequences → higher weights 0 Maximum discrimination → set of weights that best differentiate between real matches and background noise

X Simulated annealing to avoid local maxima

CMSC 838T – Lecture 4

Searching Based on MSA Profile

X Advantages

0 Searches based on domain, not sequences 0 Greatly improved sensitivity in practice

X Dependent on user selection / deletion of sequences

0 Once included in profile, sequence will score well 0 Including false positives (mismatches) reduces accuracy O Can use pairwise alignment to compare sequences O Demonstrate sequence can mutate into other sequences

Multiple Sequence Alignment (MSA)

X Outline

CMSC 838T – Lecture 4

Viewing & Editing Multiple Alignments

X Motivation

0 Use multiple sequence alignment as starting point 0 Improve usability / readability O Format alignments O Add annotations 0 Improve alignments manually with expert knowledge O Find biologically significant regions

X Multiple sequence alignment tools

0 Viewers O ClustalX, Jalview, Cinema, Sequence logos 0 Editors / annotation O SeqVu, MACAW

Viewing Multiple Sequence Alignments

X Coloring scheme

0 Helps better visualize conserved regions

X Example color code

0 AVFPMILW: RED, Small O (small + hydrophobic (including aromatic -Y)) 0 DE: BLUE, Acidic 0 RHK: MAGENTA, Basic 0 STYHCNGQ: GREEN, Hydroxyl + Amine + Basic – Q 0 Others: Grey

Multiple Sequence Alignment (MSA)

X Summary

0 Many multiple sequence alignment algorithms 0 Most global alignment algorithms too expensive O Exception - progressive pairwise alignment (heuristic) 0 Local alignment algs. try to find essential conserved regions O Can be very simple (matching motifs) O Or use heavy-duty statistical analysis models 0 Searches using MSA more sensitive than pairwise alignments 0 When using MSA to search / edit motifs O Knowledge of biochemistry provides major advantage

Lecture 4 - Multiple Sequence Alignment | CMSC 828T, Study notes of Computer Science

Related documents

Partial preview of the text

Download Lecture 4 - Multiple Sequence Alignment | CMSC 828T and more Study notes Computer Science in PDF only on Docsity!

CMSC 838T – Lecture 4

X Multiple sequence alignment (MSA)

Multiple Sequence Alignments - Motivation

X Identify highly conserved residues

X Building gene / protein families

X Basis for phylogenetic analysis

X Develop primers & probes

Multiple Sequence Alignment (MSA)

X Outline

Terminology (for Proteins)

X Family

X Superfamily

Multiple Sequence Alignment - Block

X Ungapped conserved sequence pattern

X Types of blocks

A A T G T G T G A G A C T C

C T T A G T G T A C C A C G

A A C A T G T A T A C G T A C G

A C G T G C C T A C T A

MSA - Global vs. Local Alignment

X Protein structure

X Global sequence alignment

X Local sequence alignment

Multiple Sequence Alignment - Approaches

X Progressive global

alignment

X Block-based global

alignment

X Local alignment

Multiple Sequence Alignments - Issues

X No single “correct” answer

X Typical alignment target

X Computationally expensive

X Benefits from user interaction

Global MSA - Approach

X General approach to global MSA

Global MSA - Dynamic Programming (DP)

X “Optimal” dynamic programming [Sankoff+ 1983]

X Complexity

X Bounded search (MSA) [Carillo & Lipman 1988]

K=

Global MSA – Progressive Global Alignment

X Motivation

X Approach

X Problems

Global MSA – CLUSTALW

X Algorithm [Thompson+ 1994]

X Heuristics for improving multiple alignments

Multiple Sequence Alignment (MSA)

X Outline

Local MSA (Motif-finding) - Approach

X Motivation

X Approach

Terminology (for Protein Sequences)

X Pattern

X Profile

X Pattern & profile

Significance of Patterns / Motifs

X DNA

X Proteins

Local MSA - Statistical Profile

X Position-specific scoring matrix (PSSM)

A A C

A C G

A - T

A A T

3) T 0.00 0.25 0.25 0.50 -

2) A 0.75 0.25 0.00 0.00 -

1) A 1.00 0.00 0.00 0.00 -

Local MSA - Statistical Profile

X Hidden Markov model (HMM)

A A C

A C G

A - T

A A T

Local MSA - HMMs