Pairwise Sequence Comparison - Slides | BME 110, Study notes of Chemistry

Material Type: Notes; Class: Computational Biology Tools; Subject: Biomolecular Engineering; University: University of California-Santa Cruz; Term: Unknown 2009;

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-520
koofers-user-520 🇺🇸

10 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Pair-wise Sequence Comparison
Basis for relating biological information from a
well-studied gene to a new sequence
Many programs exist for pairwise comparison
Some specialize in fast database searching and
get “good” alignments
One sequence v. many thousands:
BLAST or FASTA
Some are much slower, but guarantee the
“optimal alignment”
Smith-Waterman is the de facto standard
What is Optimal??
How do we get an “optimal” alignment
Optimal to who?
Optimal based on scoring model:
Substitution scoring matrix
Insertion / deletion scoring (penalties)
Caution: Just because it is optimal for a
given scoring scheme, doesn’t mean it is
biologically correct!!
pf3
pf4
pf5

Partial preview of the text

Download Pairwise Sequence Comparison - Slides | BME 110 and more Study notes Chemistry in PDF only on Docsity!

Pair-wise Sequence Comparison

  • Basis for relating biological information from a well-studied gene to a new sequence
  • Many programs exist for pairwise comparison
  • Some specialize in fast database searching and get “good” alignments - One sequence v. many thousands: - BLAST or FASTA
  • Some are much slower, but guarantee the “optimal alignment” - Smith-Waterman is the de facto standard

What is Optimal??

  • How do we get an “optimal” alignment
  • Optimal to who?
  • Optimal based on scoring model:
    • Substitution scoring matrix
    • Insertion / deletion scoring (penalties)
  • Caution: Just because it is optimal for a

given scoring scheme, doesn’t mean it is

biologically correct!!

Dynamic Programming

  • Fancy term for type of algorithm used to get the “optimal” or best possible alignment between two sequences
  • Needleman and Wunsch ( 1970 ) most basic method - Gives the “global” (end to end) best alignment
  • Smith-Waterman based closely on this algorithm, but allows for “local” alignments (best subsequence match only)

Basic Example

  • Find best global alignment of two

sequences:

G A T C

G T G C

Moral: Scoring Model Matters!!

  • For DNA, model can be very simple:
  • +1 match, -1 mismatch
  • However, not all mutations have equal likelihood:
  • Transition: A<–>G or C <–> T
    • more likely
  • Transversion: A<–>C or G <–> T
    • less likely

Kimura Two-parameter

Scoring Matrix

A C G T

A 0. 6 0. 1 0. 2 0. 1

C 0. 1 0. 6 0. 1 0. 2

G 0. 2 0. 1 0. 6 0. 1

T 0. 1 0. 2 0. 1 0. 6

Actual values not important, only values relative to each other

Same Matrix (* 10 )

A C G T

A 6 1 2 1

C 1 6 1 2

G 2 1 6 1

T 1 2 1 6

Actual values not important, only values relative to each other

Protein Matrices, Same Idea

  • Original: Dayhoff matrix aka PAM
  • PAM = Percent accepted mutations
  • Based on small number of correctly aligned proteins
  • Simply count how often each amino acid is substituted for another
  • Frequency of substitutions based on properties of amino acids relative to each other Point

BLASTp Scoring Matrices BLOSUM 80 BLOSUM 62 BLOSUM 45 PAM1 PAM120 PAM less divergent more divergent

  • BLO cks amino acid SU bstitution M atrices
    • Calculated directly from substitution frequencies in local, ungapped alignments of biochemically related sequences
    • Number indicates the highest sequence similarity between sequences used.
  • P ercent A ccepted M utations
    • Derived from global alignments of closely related sequences ( 85 % identity) using an evolutionary model to extrapolate to lower identities
    • Number indicates evolutionary distance
  • If in doubt, use BLOSUM.
    • More suited to searching databases using local alignment.
    • No assumed model of evolutionary divergence. Point Other BLASTp Parameters
  • Gap penalties
  • The harder it is to open/extend a gap, the fewer will be made. If you’re looking for close sequences, gap penalties should be higher.
  • Databases
  • NR (non-redundant, translated gene sequences)
  • SwissProt
  • PDB
  • Phylogenetically specific (i.e. Archaea only)