Assignment 3 - Unlocking the Code | STOR 072, Assignments of Statistics

Material Type: Assignment; Professor: Provan; Class: FYS UNLOCKING THE CODE; Subject: STATISTICS AND OPERATIONS RESEARCH; University: University of North Carolina - Chapel Hill; Term: Unknown 1989;

Typology: Assignments

Pre 2010

Uploaded on 03/11/2009

koofers-user-9su
koofers-user-9su 🇺🇸

10 documents

1 / 2

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
OR072: Assignment #3
1. Recall that a restriction enzyme cuts a DNA molecule at points in the base sequence exhibiting a
specific “word” of A/C/G/T bases. For example, a restriction enzyme might cut only after seeing the
sequence ATT.
a. How many possible sequences of length 3 are possible using the letters A,C,G, and T?
b. Assuming any one of these 3-letter “words” is equally likely to occur in a typical DNA
sequence, what are the chances of seeing the exact sequence ATT at a particular point in the
sequence?
c. Using the fact that the average number of steps you would take down a DNA strand before
hitting ATT is 1/(the chances of seeing ATT), on average how long would each fragment be
after cutting at all ATTs using this enzyme?
d. Try (a)-(c) for a four-base sequence, such as ATTG, to find out how big your fragments would
be. Continue this process to find out how big a “word” your restriction enzyme would need to
cut at, if you want to obtain DNA fragments of about size 1000. How about size 15,000?
2. Use the BLAST webpage to find the best match for the following DNA sequences. In each case, give
the first match that is a real organism (rather than a vector of some sort), give its E-value, and give the
common name for that organism.
mystery sequence #1
ATGTGCCATG TTAAGCGTAT TTAACACTGA TGATTACAGT CCAGCTGCGC AACAGAATAT
TCCTGCTCTC CGGAGAAGCT CTTCCTTCAT TTGCGCTGAA AGCTGTAGCT CTAAGTATCA
GTGTGAAGCA GGAGAAAACA GTAAAGGCAG CGTCCAGGAT AGAGTGAAGC GACCCATGAA
CGCATTCATT GTGTGGTCTC GGATCAGCAG GCGCAAGATG GCTCTAGAGA ATCCCAAAAT
GCGAAACTCA GAGATCAGCA AGCAGCTGGG ATACCAGTGG AAAATCCTTA CCGAAGCCGA
TAAATGGCCA TTCTTCCAGG AGGCACAGAA ACTACAGGCC ATGCATAGAG AGAAATACCC
GAATTATAAG TATCGACCTC GTCGGAAGGC GAAGATGCTG CAAAACAGTT GCAGTTTGCT
TCCGGCAGAT CCCTCTTCGG TCCCTGCCAG AGAAGTGTAC AACAACAGGT TGTACAGGGA
TGACTGTACC AAAGCCACGC ACTCAAGAAT GCAGCACCAG TTAGTCCACT TACCGCCCAT
CAACACAGCC AGCTCACCGC AGCAACGGGA CCGCTACAGC CACTCGATTC CAATCATATG
CCAAAGCTGT AG
Find the 6 errors in the best match, giving the location number of each error and whether it is a
mismatch, a gap in the databank sequence, or a gap in your sequence.
mystery sequence #2:
pf2

Partial preview of the text

Download Assignment 3 - Unlocking the Code | STOR 072 and more Assignments Statistics in PDF only on Docsity!

OR072: Assignment

  1. Recall that a restriction enzyme cuts a DNA molecule at points in the base sequence exhibiting a specific “word” of A/C/G/T bases. For example, a restriction enzyme might cut only after seeing the sequence ATT. a. How many possible sequences of length 3 are possible using the letters A,C,G, and T? b. Assuming any one of these 3-letter “words” is equally likely to occur in a typical DNA sequence, what are the chances of seeing the exact sequence ATT at a particular point in the sequence? c. Using the fact that the average number of steps you would take down a DNA strand before hitting ATT is 1/(the chances of seeing ATT), on average how long would each fragment be after cutting at all ATTs using this enzyme? d. Try (a)-(c) for a four-base sequence, such as ATTG, to find out how big your fragments would be. Continue this process to find out how big a “word” your restriction enzyme would need to cut at, if you want to obtain DNA fragments of about size 1000. How about size 15,000?
  2. Use the BLAST webpage to find the best match for the following DNA sequences. In each case, give the first match that is a real organism (rather than a vector of some sort), give its E-value, and give the common name for that organism. mystery sequence # ATGTGCCATG TTAAGCGTAT TTAACACTGA TGATTACAGT CCAGCTGCGC AACAGAATAT TCCTGCTCTC CGGAGAAGCT CTTCCTTCAT TTGCGCTGAA AGCTGTAGCT CTAAGTATCA GTGTGAAGCA GGAGAAAACA GTAAAGGCAG CGTCCAGGAT AGAGTGAAGC GACCCATGAA CGCATTCATT GTGTGGTCTC GGATCAGCAG GCGCAAGATG GCTCTAGAGA ATCCCAAAAT GCGAAACTCA GAGATCAGCA AGCAGCTGGG ATACCAGTGG AAAATCCTTA CCGAAGCCGA TAAATGGCCA TTCTTCCAGG AGGCACAGAA ACTACAGGCC ATGCATAGAG AGAAATACCC GAATTATAAG TATCGACCTC GTCGGAAGGC GAAGATGCTG CAAAACAGTT GCAGTTTGCT TCCGGCAGAT CCCTCTTCGG TCCCTGCCAG AGAAGTGTAC AACAACAGGT TGTACAGGGA TGACTGTACC AAAGCCACGC ACTCAAGAAT GCAGCACCAG TTAGTCCACT TACCGCCCAT CAACACAGCC AGCTCACCGC AGCAACGGGA CCGCTACAGC CACTCGATTC CAATCATATG CCAAAGCTGT AG Find the 6 errors in the best match, giving the location number of each error and whether it is a mismatch, a gap in the databank sequence, or a gap in your sequence. mystery sequence #2:

In Michael Crichton's Jurassic Park (p. 103), a putative dinosaur DNA sequence is given. What is the nearest match in the database to this sequence? Is Crichton pulling one over on us? In the output screen, scroll down to the first diagram of the first match (the one with the letter pairings separated by | ). Do you see anything unusual about the pattern of mismatches? Extra credit for the correct interpretation of this odd match. ( Hint : The sequence is formatted exactly the way it appears in the book. Further, the mismatches have nothing to do with biology or BLAST .) GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGC GGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGC TGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTG CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAA AGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAG ATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACT CCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCT GGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGG CCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAA CGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCG CACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAA CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA GCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGG CTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTG ACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCA ACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCC GCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGG CCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGG CCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT mystery sequence#3: this is a protein sequence: MAHETSFNDA LDYIYIANSM NDRAFLIAEP HPEQPNVDGQ DQDDAELEEL DDMAVTDDGQ LEDTNNNNNS KRYYSSGKRR ADFIGSLALK PPPTDVNTTT TTAGSPLATA ALAAAAASAS VAAAAARITA KAAHRALTTK QDATSSPASS PALQLIDMDN NYTNVAVGLG AMLLNDTLLL EGNDSSLFGE MLANRSGQLD LINGTGGLNV TTSKVAEDDF TQLLRMAVTS VLLGLMILVT IIGNVFVIAA IILERNLQNV ANYLVASLAV ADLFVACLVM PLGAVYEISQ GWILGPELCD IWTSCDVLCC TASILHLVAI AVDRYWAVTN IDYIHSRTSN RVFMMIFCVW TAAVIVSLAP QFGWKDPDYL QRIEQQKCMV SQDVSYQVFA TCCTFYVPLL VILALYWKIY QTARKRIHRR RPRPVDAAVN NNQPDGGAAT DTKLHRLRLR LGRFSTAKSK TGSAVGVSGP ASGGRALGLV DGNSTNTVNT VEDTEFSSSN VDSKSRAGVE APSTSGNQIA TVSHLVALAK QQGKSTAKSS AAVNGMAPSG RQEDDGQRPE HGEQEDREEL EDQDEQVGPQ PTTATSAMTA AGTNESEDQC KANGVEVLED PQLQQQLEQV QQLQKSVKSG GGGGASTSNA TTITSISALS PQTPTSQGVG IAAAAAGPMT AKTSTLTSCN QSHPLCGTAN ESPSTPEPRS RQPTTPQQQP HQQAHQQQQQ QQQLSSIANP MQKVNKRKET LEAKRERKAA KTLAIITGAF VVCWLPFFVM ALTMPLCAAC QISDSVASLF LWLGYFNSTL NPVIYTIFSP EFRQAFKRIL FGGHRPVHYR SGKL Also find the nearest match to sequence#3 among humans and among rats. In each case, give the E value of the match.