Download Lecture Slides on Two Sequence Alignment and Scoring Models | BME 110 and more Assignments Chemistry in PDF only on Docsity!
Two Sequence Alignment &^ Scoring Matrices
BME 110: CompBio Tools
Todd Lowe April 08, 2008
Admin
- Reading:^ – Chapter 3 should be completed^ – Chapter 5 for tuesday • Homework #1 due tomorrow (Fri) 5pm • Homework #2 assigned Tuesday
Full GenomeDot-Plot
Multiple Genome Alignment
Dot Plots P.calidifontis
P.arsenaticum
P.islandicum P.aerophilum^
P.aerophilum^
P.aerophilum
P.calidifontis
P.islandicum
P.islandicum P.arsenaticum^
P.arsenaticum^
P.calidifontis
Pair-wise Sequence Comparison • Basis for relating biological information from awell-studied gene to a new sequence • Many programs exist for pairwise comparison • Some specialize in fast
database searching and get “good” alignments^ – One sequence v. many thousands:^ • BLAST or FASTA • Some are much slower, but guarantee the“optimal alignment”^ – Smith-Waterman is the de facto standard
Dot-plots: Dotlet
http://myhits.isb-sib.ch/cgi-bin/dotlet Example: In Archaeal Genome browser, bring up
Pyrobaculum aerophilum Select CRISPR2 region (chr:45,423-46,754) to compare to CRISPR6-7region (chr:1,898,656-1,899,678) Get DNA, paste into Dotlet one at a time, giving descriptive labels,Zoom 1:5, Are there direct or inverted repeats in each CRISPR (against itself?) Relative to each other, are these direct or inverted repeats?
Assessing Alignment Significance Most Basic Rules of thumb: Two nucleotide sequences – at least 70%identical, they are likely homologous Two protein sequences – at least 25% identicalover 100 amino acid alignment Does not take into account precise length ofalignment, or number of gaps! Not sufficient to quantitatively rank hits from adatabase search
The “Twilight Zone” • Less than^ 25% sequence identity for twoprotein sequences • May still be homologous, but only similarityof 3-D protein structures can verify similarfunction (structural comparison tools todetect these discussed later in quarter) • Must have a good / near optimal alignmentfor most distantly related proteins
- Dynamic Programming • Fancy term for type of algorithm used to get the“optimal” or best possible alignment between twosequences • Needleman and Wunsch (1970) most basic method – Gives the “global” (end to end) best alignment • Smith-Waterman based closely on this algorithm, butallows for “local” alignments (best subsequencematch only) • See simple example of Global v. Local alignments inbook, Figure 3.1 p.
Basic Example
- Find best global alignment of twosequences:^ G^ A^
T^ C G T G^ C
Which is better? Match +1, Mismatch –1, Gap -
G^ A^ T^ C^
|^ |^
OR^ (Score
=^ 0)
G^ T^ G^ C G^ A^ T^ -^
C^ +1-1+1-1+
|^ |^
|^
(Score^ =
G^ -^ T^ G^
C
Moral: Scoring Model Matters!! • For DNA, model can be very simple: • +1 match, -1 mismatch • However, not all mutations have equallikelihood: • Transition: A<–>G
or^ C <–> T
- more likely • Transversion: A<–>C
or^ G <–> T
-^ less likely
Same Matrix (*10) A C^ G^
T
A^6
C^1
G^2
T^1
Actual values not important, only values relative toeach other
Protein Matrices, Same Idea • Original: Dayhoff matrix aka PAM • PAM = Percent accepted mutations • Based on small number of
correctly^ aligned proteins • Simply count how often each amino acid issubstituted for another • Frequency of substitutions based on propertiesof amino acids relative to each other