











Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Class: ROBERTS LEARN MANIP ACT; Subject: Computer Science; University: University of Maryland; Term: Unknown 1989;
Typology: Study notes
1 / 19
This page cannot be seen from the preview
Don't miss anything!












CMSC 838T โ Lecture 4
0 Alignment containing multiple DNA / protein sequences 0 Look for conserved regions โ similar function
0 Likely to be essential sites for structure / function 0 More precision from multiple sequences 0 Better structure / function prediction, pairwise alignments
0 Use conserved regions to guide search
0 Infer evolutionary relationships between genes
0 Use conserved region to develop O Primers for PCR O Probes for DNA microarrays
CMSC 838T โ Lecture 4
0 Basic concepts & terms 0 Global alignment O Optimal โ dynamic programming (MSA) O Progressive โ pairwise (PILEUP, CLUSTALW) O Iterative progressive (MULTALIN) O Block-based (DIALIGN) 0 Local alignment (motif finding) O Patterns (MOTIF, PROTOMAT) O Statistical profiles (HMMER2, PSI-BLAST) 0 Viewing & editing multiple sequence alignments
0 Group of proteins of similar biochemical function with (roughly) > 50% sequence identity when aligned 0 Family is transitive, even if sequence identity < 50% O A โ B and B โ C implies A โ C 0 1940 protein families in Protein DataBank (v1.61, Nov 2002)
0 Group of protein families related by distant yet detectable sequence similarity 0 1100 protein superfamilies in Protein DataBank (v1.61)
CMSC 838T โ Lecture 4
0 Exact โ composed of identical segments 0 Uniform โ found in every sequence 0 Consistent โ can be part of global MSA
non-consistent
Diagonal
0 Derived from a limited number of building blocks (domains) that have been mixed and shuffled through evolution 0 Proteins can thus share a global or local relationship
0 Alignment over entire sequence (near same length)
0 Alignment over parts(s) of sequence
Global Alignment Local Alignment
CMSC 838T โ Lecture 4
0 If sequences related over entire length
0 If related by large consistent blocks
0 If related by small non-consistent blocks
0 Ideally, find single evolutionary correct alignment 0 In practice, evolutionary history must be inferred 0 Try find sequence alignment โ good structure alignment
0 Protein family with ~30% identity
0 โOptimalโ solution exponential in number of sequences 0 Greater reliance on (greedy) heuristics
0 Select which sequences to include in alignment 0 Select which regions to align 0 Edit resulting alignments
CMSC 838T โ Lecture 4
0 Assume k sequences of length n 0 Attempt to maximize sum-of-pairs (SP) score 0 Build F, a k-dimensional table of length n+1 (n k^ elements) 0 Recursive formula โ F( i ) = max( F(i-1) + SP(columni ) )
0 O(nk^ ) entries to fill 0 Each entry combines O(2k^ ) other entries 0 Total cost = O(2k^ nk)
0 Apply heuristic aligment, use resulting SP to bound search 0 Significant speed improvement, still limited to small values of k
CMSC 838T โ Lecture 4
0 Reduce cost by building global alignment incrementally
0 Greedy approach dependent on initial pairwise alignments 0 Cannot fix early mistakes (gaps cannot be removed)
0 Calculate evolutionary distances from alignment scores 0 Performs pairwise alignment of profiles (probabilities of residues at each position) using dynamic programming 0 Later calculates consensus sequence from profile
0 Weight sequences to compensate for biased representation 0 Scoring matrix chosen based on expected similarity from tree O E.g., nearby โ BLOSUM 80, distant โ BLOSUM 50 0 Gap penalty modified by residue (function) at position O E.g., Higher gap penalty for hydrophobic residues 0 Gap penalty higher if first gap in column & nearby gaps 0 Dynamically adjust guide tree to defer poor alignments
CMSC 838T โ Lecture 4
0 Basic concepts & terms 0 Global alignment O Optimal โ dynamic programming (MSA) O Progressive โ pairwise (PILEUP, CLUSTALW) O Iterative progressive (MULTALIN) O Block-based (DIALIGN) 0 Local alignment (motif finding) O Patterns (MOTIF, PROTOMAT) O Statistical profiles (HMMER2, PSI-BLAST) 0 Viewing & editing multiple sequence alignments
0 Find local regions of high similarity (motifs) 0 Align based on motifs
0 Find motifs O Patterns O Blocks O Statistical profiles X Position-specific scoring matrix (PSSM) X Hidden Markov model (HMM) 0 Align sequences O Preserve motifs as much as possible
CMSC 838T โ Lecture 4
0 Deterministic syntax describing well-conserved region
0 Probabilistic syntax describing well-conserved region 0 Score-based representations O Position-specific scoring matrix (PSSM) O Hidden Markov model (HMM)
0 Can be used to search for motifs / domains of biological significance that characterize protein family
0 Recognition sites of restriction endonucleases 0 Codons specifying the amino acid sequence of a protein 0 Intron splice sites 0 Promoter 0 Binding sites for regulatory proteins which activate or repress transcription
0 Presence of active sites 0 Prediction of protein secondary structure 0 Presence of signals used to localize the protein in the cell
CMSC 838T โ Lecture 4
0 Summary representation for (aligned) conserved region 0 Stores probability of element at each position in sequence 0 Entries usually stored in log-odds form 0 Weight entries by 1) average proportion, 2) evolutionary dist. 0 Consensus โ most likely base / residue at each position
Cons A C G T Gap
0 Statistical summary representation for conserved region 0 Model stores probability of match, mismatch, insertions, and deletions at each position in sequence 0 Alignment of conserved region not necessary, but helpful
match states
insert states
delete states
b e g i n
e n d
A C G T
A C G T
A C G T
CMSC 838T โ Lecture 4
0 Ideally use 20-100 training sequences to build model O With better initialization, smaller โtraining setโ sufficient 0 Well grounded in probability theory (statistical significance) 0 Explicit gap penalties not needed (automatically trained) 0 Can extract consensus sequence (w/ dynamic programming)
0 Training set used to build profile may be biased / skewed O Over-represented sequences (common motifs) O Under-represented sequences (rare residues) 0 Resulting profile matches training set, not desired motif
0 Differentially weight sequences to compensate for non- representative sampling in training set 0 Similar sequences โ lower weights 0 Rare sequences โ higher weights 0 Maximum discrimination โ set of weights that best differentiate between real matches and background noise
CMSC 838T โ Lecture 4
0 Searches based on domain, not sequences 0 Greatly improved sensitivity in practice
0 Once included in profile, sequence will score well 0 Including false positives (mismatches) reduces accuracy O Can use pairwise alignment to compare sequences O Demonstrate sequence can mutate into other sequences
0 Basic concepts & terms 0 Global alignment O Optimal โ dynamic programming (MSA) O Progressive โ pairwise (PILEUP, CLUSTALW) O Iterative progressive (MULTALIN) O Block-based (DIALIGN) 0 Local alignment (motif finding) O Patterns (MOTIF, PROTOMAT) O Statistical profiles (HMMER2, PSI-BLAST) 0 Viewing & editing multiple sequence alignments
CMSC 838T โ Lecture 4
0 Use multiple sequence alignment as starting point 0 Improve usability / readability O Format alignments O Add annotations 0 Improve alignments manually with expert knowledge O Find biologically significant regions
0 Viewers O ClustalX, Jalview, Cinema, Sequence logos 0 Editors / annotation O SeqVu, MACAW
0 Helps better visualize conserved regions
0 AVFPMILW: RED, Small O (small + hydrophobic (including aromatic -Y)) 0 DE: BLUE, Acidic 0 RHK: MAGENTA, Basic 0 STYHCNGQ: GREEN, Hydroxyl + Amine + Basic โ Q 0 Others: Grey
0 Many multiple sequence alignment algorithms 0 Most global alignment algorithms too expensive O Exception - progressive pairwise alignment (heuristic) 0 Local alignment algs. try to find essential conserved regions O Can be very simple (matching motifs) O Or use heavy-duty statistical analysis models 0 Searches using MSA more sensitive than pairwise alignments 0 When using MSA to search / edit motifs O Knowledge of biochemistry provides major advantage