





















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This document, part of a larger set of notes for a computational biology course, focuses on sequence operations. It discusses the representation and matching of sequences, including bit-coding and matching one or more characters. It also covers automating probability calculations using nucleotide frequencies.
Typology: Slides
1 / 29
This page cannot be seen from the preview
Don't miss anything!






















Robert F. Murphy
Assume two character variables "C” and “Q” test for exact match If(Q=C) {...} need complicated statements to handle wildcards If(Q=C | (Q=„A‟&(C=„A‟|C=„R‟‟| C=„W‟ | C=„M‟ | C=„D‟ | C=„H‟‟| C=„V‟ | C=„N‟)|Q=„C‟&...)) {...} can build into a function If(TestBase(Q,C)) {...}
Convert char to int 0-
Create 26x26 matrix showing which matches which
Lookup two characters to be compared to find value
Assume two integer variables “I” and “J” test for exact match If(I=J) {...} test for match with wildcards (no lookup!) If(I&J) {...}
Example: recognition site for a restriction enzyme Input sequence string into variable Seq Define Site as string of characters or masks EcoRI recognizes GAATTC AccI recognizes GTMKAC Create function to search a sequence for that site Find(Site,LenSite,Seq,LenSeq) for each position in Seq, see if Site matches starting there
Goal: Calculate probability of occurrence of a sequence that may include ambiguous bases
What we need is a way to consider all possible allowed nucleotides at each position in all allowed combinations
When using dinucleotide probabilities, have to be careful about how the probabilities are combined
Question: What is the probability of observing sequence feature ART (A followed by a purine {either A or G}, followed by a T) using dinucleotide probabilities?
pART=pA(pAA+pAG)(pAT+pGT) [eq.1]
pART=pApAApAT + pApAApGT
pART=pA(pAApAT+pAGpGT) [eq.2]
pART= pApAApAT + pApAGpGT
pART=pAAT+pAGT
pAAT=pApAApAT
pAGT=pApAGpGT
pART= pApAApAT + pApAGpGT
This matches equation 2 on previous slide
What is the probability of observing the sequence feature ARYT (A followed by a purine {either A or G}, followed by a pyrimidine {either C or T}, followed by a T)?
Using equal mononucleotide frequencies
pA = pC = pG = pT = 1/ pARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/ = 1/
Using observed mononucleotide frequencies: pARYT = pA (pA + pG) (pC + pT) pT
Using dinucleotide frequencies:
pARYT = pA (pAA (pACpCT + pATpTT) + pAG (pGCpCT + pGTpTT) )
We conclude that for such strings our rule should be “multiply dinucleotide probabilities along each allowed path and then add the results”
“for” loops?
Nested “if” structure?
Other?