Download Sequence Alignment: Understanding and Applying Techniques and more Lecture notes Bioinformatics in PDF only on Docsity!
Lecture outlineLecture
outline
•^
Sequence alignmentSequence alignment– Why do we need to align sequences?– Evolutionary relationships
y
p
•^
Gaps and scoring matrices
•^
Dynamic programmingDynamic programming– Global alignment (Needleman & Wunsch)– Local alignment (Smith & Waterman)
g
(^
•^
Database searches– BLAST
BLAST
– FASTA
Complete DNA Sequences
p
q
Whole genome
g
sequencing projectsfor more than 2000
species
Sequence conservation implies function
Alignment is the key to •^
Finding important regions
-^
Determining function
-^
Determining
function
-^
Uncovering the evolutionary forces
Sequence alignmentSequence
alignment
• Comparing DNA/protein sequences for
Comparing DNA/protein sequences for– Similarity
Homology
– Homology
• Prediction of function• Construction of phylogeny• Shotgun assembly
Shotgun assembly– End-space-free alignment / overlap alignment
• Finding motifs• Finding motifs
Sequence Alignment
q
g
Procedure of comparing two (pairwise) or moreProcedure
of comparing two (pairwise) or more
(multiple) sequences by searching for a series ofindividual characters that are in the same order inindividual characters that are in the same order inthe sequences
VLSPADKTNVKAAWGKVGAHAGYEG|||
VLSEGDWQLVLHVWAKVEADVAGEGVLSEGDWQLVLHVWAKVEADVAGEG
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCCAGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC
AG
G
CTATCAC
CT
GACC
T
C
CA
GG
C
CGA
TGCCC
T
AG
CTATCAC
GACC
G
C
GG
T
CGA
TT
TGCCC
GAC
Definition
Given two strings
x = x x
x^
y = y y
y
Given two strings
x = x
x 1
...x 2
,M^
y = y
y 1
…y 2
,N
an alignment is an assignment of gaps to positions 0
M i
d 0
N i
t^
li^
h
0,…, M in x, and 0,…, N in y, so as to line up eachletter in one sequence with either a letter, or a gapin the other sequence
Orthologs and paralogsOrthologs
and paralogs
Understanding evolutionary
relationshipsrelationships
l^
l
Nothing in biology makes sense except in the light of evolution
molecular
molecular
g
gy
p
g
Dobzhansky, 1973
y
Differing rates of DNA evolution
•^
Functional/selective constraints (particular featuresof coding regions, particular features in 5'
ntranslated regions)
untranslated regions)
•^
Variation among different gene regions withdifferent functions (different parts of a protein maydifferent functions (different parts of a protein mayevolve at different rates).
•^
Within proteins variations are observed between
•^
Within proteins, variations are observed between–
surface and interior amino acids in proteins (order of magnitudedifference in rates in haemoglobins)
-^
charged and non-charged amino acids
-^
protein domains with different functionsregions which are strongly constrained to preserve particular
-^
regions which are strongly constrained to preserve particularfunctions and regions which are not
-^
different types of proteins -- those with constrained interactionsurfaces and those without surfaces
and those without
Common assumptions
ll
l^
id
i^
h
i d
d
l
• All nucleotide sites change independently• The substitution rate is constant over time
and in different lineages
• The base composition is at equilibrium
The base composition is at equilibrium
• The conditional probabilities of nucleotide
b tit ti
th
f^
ll
it
d
substitutions are the same for all sites, anddo not change over time
• • Most of these are not true in many cases…
Most of these are not true in many cases…
A simple alignment
• Let us try to align two short nucleotide
sequences:sequences:– AATCTATA
and AAGATA
• Without considering any gaps
(insertions/deletions) there are 3 possible waysto align these sequences
AATCTATAAAGATA
AATCTATA
AAGATA
AATCTATA
AAGATA
• Which one is better?
What is a good alignment?
AGGCTAGTT,
AGCGAAGTTT
AGGCTAGTT-
6 matches, 3 mismatches, 1 gap
AGGCTAGTT
6
matches, 3 mismatches, 1 gap
AGCGAAGTTTAGGCTA-GTT-
7 matches, 1 mismatch, 3 gaps
AG-CGAAGTTTAGGC-TA-GTT-
7 matches, 0 mismatches, 5 gaps
AGGC
TA
GTT
7
matches, 0 mismatches, 5 gaps
AG-CG-AAGTTT
Good alignments require gaps
• Maximal consecutive run of spaces in alignment
Maximal consecutive run of spaces in alignment– Matching mRNA (cDNA) to DNA
Sh
t^
i^
f DNA/
t i
– Shortening of DNA/protein sequences– Slippage during replication– Unequal crossing-over during meiosis– …
• We need to have a scoring function that considers
l
gaps also
Simple alignment with gaps
• Considering gapped alignments vastly
i^
h
b
f^
ibl
li
increases the number of possible alignments:^ AATCTATA
AATCTATA
AATCTATA
AATCTATA AAG-AT-A
AATCTATAAA-G-ATA
AATCTATAAA--GATA
more?
• If gap penalty is -1 what will be the new
If gap penalty is 1 what will be the newscores?