Computational Biology: More on Sequence Operations, Slides of Computational Biology

This document, part of a larger set of notes for a computational biology course, focuses on sequence operations. It discusses the representation and matching of sequences, including bit-coding and matching one or more characters. It also covers automating probability calculations using nucleotide frequencies.

Typology: Slides

2010/2011

Uploaded on 11/02/2011

blueeyes_11
blueeyes_11 🇺🇸

4.7

(18)

261 documents

1 / 29

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Computational Biology, Part A
More on Sequence Operations
Robert F. Murphy
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d

Partial preview of the text

Download Computational Biology: More on Sequence Operations and more Slides Computational Biology in PDF only on Docsity!

Computational Biology, Part A

More on Sequence Operations

Robert F. Murphy

Representation and Matching of

Sequences

Matching one character - with

character variables

 Assume two character variables "C” and “Q” test for exact match  If(Q=C) {...} need complicated statements to handle wildcards  If(Q=C | (Q=„A‟&(C=„A‟|C=„R‟‟| C=„W‟ | C=„M‟ | C=„D‟ | C=„H‟‟| C=„V‟ | C=„N‟)|Q=„C‟&...)) {...} can build into a function  If(TestBase(Q,C)) {...}

Efficient method to match one

character

 Convert char to int 0-

 Create 26x26 matrix showing which matches which

 Lookup two characters to be compared to find value

Matching one character - with bit

coding

 Assume two integer variables “I” and “J”  test for exact match  If(I=J) {...}  test for match with wildcards (no lookup!)  If(I&J) {...}

Matching more than one

character - pattern matching

 Example: recognition site for a restriction enzyme  Input sequence string into variable Seq  Define Site as string of characters or masks  EcoRI recognizes GAATTC  AccI recognizes GTMKAC  Create function to search a sequence for that site  Find(Site,LenSite,Seq,LenSeq)  for each position in Seq, see if Site matches starting there

Automating the Calculation

 Goal: Calculate probability of occurrence of a sequence that may include ambiguous bases

 What we need is a way to consider all possible allowed nucleotides at each position in all allowed combinations

 When using dinucleotide probabilities, have to be careful about how the probabilities are combined

Illustration

 Question: What is the probability of observing sequence feature ART (A followed by a purine {either A or G}, followed by a T) using dinucleotide probabilities?

Expansions

 pART=pA(pAA+pAG)(pAT+pGT) [eq.1]

 pART=pApAApAT + pApAApGT

  • pApAGpAT + pApAGpGT)

 pART=pA(pAApAT+pAGpGT) [eq.2]

 pART= pApAApAT + pApAGpGT

Proof

 pART=pAAT+pAGT

 pAAT=pApAApAT

 pAGT=pApAGpGT

 pART= pApAApAT + pApAGpGT

 This matches equation 2 on previous slide

More complicated probability

illustration

 What is the probability of observing the sequence feature ARYT (A followed by a purine {either A or G}, followed by a pyrimidine {either C or T}, followed by a T)?

 Using equal mononucleotide frequencies

 pA = pC = pG = pT = 1/ pARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/ = 1/

Illustration (continued)

 Using observed mononucleotide frequencies: pARYT = pA (pA + pG) (pC + pT) pT

 Using dinucleotide frequencies:

pARYT = pA (pAA (pACpCT + pATpTT) + pAG (pGCpCT + pGTpTT) )

Multiply then add

 We conclude that for such strings our rule should be “multiply dinucleotide probabilities along each allowed path and then add the results”

How do we program this?

 “for” loops?

 Nested “if” structure?

 Other?