Sequence Alignment: DNA and Protein, Scoring Functions and Homology, Study notes of Bioinformatics

The importance of sequence alignment in recognizing common sequences and homology between dna and proteins. It covers various types of mutations and their probabilities, alignment methods, and scoring strategies. The document also introduces the concept of point accepted mutation (pam) and amino acid pair probabilities, which are essential for estimating the probability of homology.

Typology: Study notes

Pre 2010

Uploaded on 02/12/2009

koofers-user-aje
koofers-user-aje 🇺🇸

9 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
BINF 730
Lecture 2
Sequence Alignment
DNA Sequence Alignment –
Why?
Recognition sites might be common –
restriction enzymes, start sequences, stop
sequences, other regulatory sequences
Homology – evolutionary common
progenitor
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Sequence Alignment: DNA and Protein, Scoring Functions and Homology and more Study notes Bioinformatics in PDF only on Docsity!

BINF 730

Lecture 2

Sequence Alignment

DNA Sequence Alignment –

Why?

Recognition sites might be common –

restriction enzymes, start sequences, stop

sequences, other regulatory sequences

Homology – evolutionary common

progenitor

Mutations

-Deletions

-Insertions

-Transitional Substitution (purine-purine

A-G, pyr-pyr T-C)

-Translational Substitution (pur-pyr,

pyr-pur)

Example

Start with ACGTACGT after 9540

generations with the following probabilities:

Deletion 0.

Insertion 0.

Transitional substitution 0.

Translational substitution 0.

Example

or using Gotoh’s algorithm with mismatch

penalty 3 and gap penalty function g(k) =

2+2k for length k gap

ACACG - - GTCCTAATAATGGCC

  • CAGGAAGATCT - - TAGTT - - C

The alignment depends on algorithm used!

Protein sequence alignment

A. Homologous proteins

i. Evolutionary common origin ii. Structural similarity iii. Functional similarity

B. Conserved regions

i. Functional domains ii. Evolutionary similarity iii. Structural motif

Example 3.

Choosing the best alignment

•Every alignment has a score

•Chose alignment with highest score

•Must choose appropriate scoring function

•Scoring function based on evolutionary

model with insertions, deletions, and

substitutions

•Use substitution score matrix – contains an

entry for every amino acid pair

Statistical approach

  • Let s and s’ be two amino acid sequences

of length n that we want to compute an

alignment score

  • Assume only substitutions occur (no

insertions or deletions)

  • Works for local alignment
  • Odds Ratio and Log Odds Ratio

Odds Ratio and Log Odds Ratio

The score for aligning s and s’ is based on the comparison of the hypothesis that the two sequences are generated randomly with the hypothesis that they come from a common ancestor. Assume q (^) A is the probability of producing amino acid A in model R (based on the relative frequency at which A is found in proteins). The probability for the null hypothesis (that s and s’ do not stem from a common ancestor) is

∏ ∏ ∏ ≤≤

′ ≤≤

′ ≤≤

i n

si si in

si in

si 1

, , 1

, 1

P(s,s |R) q, q q q

Odds Ratio and Log Odds Ratio

The second hypothesis (homologous hypothesis) that s and s’ arise from a common ancestor sequence r, of length n, is based on the evolutionary model (E). The probability that the amino acids A and B are aligned and hence have been derived from an ancestor amino acid C is given by pA,B is given by

≤≤

′ = ′ i n

sis i 1

P(s,s |E) p, ,

How this probability is determined will be explained later.

Odds Ratio and Log Odds Ratio

The odds ratio compares the homologous hypothesis with the null hypothesis

∏ ∏

≤ ≤ ′

≤≤

≤≤

′ = = ′

i n si si

sisi

in

si si

in

sisi Pss R

Pss E (^1) , ,

, ,

1

, ,

1

, , q q

p q q

p

(, | )

(, | )

To achieve a scoring function that is additive rather that multiplicative, the log odds ratio can be used

A B

AB q q

p s (^) A,B = log

PAM and Amino Acid Pair

Probabilities

We now have which is the relative frequency of a

pair (A,B) in the alignment of s and s’ where

n (^) AB(s,s’) is the number of times the amino acids A

and B are aligned in one column in the alignment

of s and s’ and n is the length of s and s’.

To find a value for n (^) AB, some homologous

sequences are needed. To do this Dayhoff and co-

workers used local sequence alignment.

PAM and Amino Acid Pair

Probabilities

Problem – They used sequence alignment to find a substitution matrix (substitution score matrix) for sequence alignment – which comes first, the chicken or the egg?

Answer – Use only very closely related sequence (sequences differ in at most 15% of the amino acid.

Caveat – The substitution matrix is only valid for closely related protein sequences