Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Hash Functions and Load, Python Dictionaries, Matching DNA Sequences Solved Problems, Exercises of Algorithms and Programming

Massachusetts Institute of Technology (MIT)Algorithms and Programming

Problem set with solutions on Hash Functions and Load, Python Dictionaries, Matching DNA Sequences

Typology: Exercises

2019/2020

Uploaded on 04/30/2020

jeny 🇺🇸

4.6

(14)

251 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

Introduction to Algorithms: 6.006

Massachusetts Institute of Technology 7 October, 2011

Professors Erik Demaine and Srini Devadas Problem Set 4

Problem Set 4

Both theory and programming questions are due Friday, 14 October at 11:59PM.

Remember that for the written response question, your goal is to communicate. Full credit will

be given only to a correct solution which is described clearly. Convoluted and obtuse descriptions

might receive low marks, even when they are correct. Also, aim for concise solutions, as it will

save you time spent on write-ups, and also help you conceptualize the key idea of the problem.

We will provide the solutions to the problem set 10 hours after the problem set is due, which

you will use to find any errors in the proof that you submitted. You will need to submit a critique

of your solutions by Thursday, October 20th, 11:59PM. Your grade will be based on both your

solutions and your critique of the solutions.

Problem 4-1. [35 points] Hash Functions and Load

(a) Imagine that an algorithm requires us to hash strings containing English phrases.

Knowing that strings are stored as sequences of characters, Alyssa P. Hacker decides

to simply use the sum of those character values (modulo the size of her hash table)

as the string’s hash. Will the performance of her implementation match the expected

value shown in lecture?

1. Yes, the sum operation will space strings out nicely by length.

2. Yes, the sum operation will space strings out nicely by the characters they contain.

3. No, because reordering the words in a string will not produce a different hash.

4. No, because the independence condition of the simple uniform hashing assump-

tion is violated.

Solution: No, because reordering the words in a string will not produce a different

hash.

(b) Alyssa decides to implement both collision resolution and dynamic resizing for her

hash table. However, she doesn’t want to do more work than necessary, so she wonders

if she needs both to maintain the correctness and performance she expects. After all,

if she has dynamic resizing, she can resize to avoid collisions; and if she has collision

resolution, collisions don’t cause correctness issues. Which statement about these two

properties true?

1. Dynamic resizing alone will preserve both properties.

2. Dynamic resizing alone will preserve correctness, but not performance.

1

Discover Exercises of Algorithms and Programming Massachusetts Institute of Technology (MIT)

Partial preview of the text

Download Hash Functions and Load, Python Dictionaries, Matching DNA Sequences Solved Problems and more Exercises Algorithms and Programming in PDF only on Docsity!

Introduction to Algorithms: 6. Massachusetts Institute of Technology 7 October, 2011 Professors Erik Demaine and Srini Devadas Problem Set 4

Problem Set 4

Both theory and programming questions are due Friday, 14 October at 11:59PM. Remember that for the written response question, your goal is to communicate. Full credit will be given only to a correct solution which is described clearly. Convoluted and obtuse descriptions might receive low marks, even when they are correct. Also, aim for concise solutions, as it will save you time spent on write-ups, and also help you conceptualize the key idea of the problem. We will provide the solutions to the problem set 10 hours after the problem set is due, which you will use to find any errors in the proof that you submitted. You will need to submit a critique of your solutions by Thursday, October 20th, 11:59PM. Your grade will be based on both your solutions and your critique of the solutions.

Problem 4-1. [35 points] Hash Functions and Load

(a) Imagine that an algorithm requires us to hash strings containing English phrases. Knowing that strings are stored as sequences of characters, Alyssa P. Hacker decides to simply use the sum of those character values (modulo the size of her hash table) as the string’s hash. Will the performance of her implementation match the expected value shown in lecture?

Yes, the sum operation will space strings out nicely by length.
Yes, the sum operation will space strings out nicely by the characters they contain.
No, because reordering the words in a string will not produce a different hash.
No, because the independence condition of the simple uniform hashing assump- tion is violated.

Solution: No, because reordering the words in a string will not produce a different hash. (b) Alyssa decides to implement both collision resolution and dynamic resizing for her hash table. However, she doesn’t want to do more work than necessary, so she wonders if she needs both to maintain the correctness and performance she expects. After all, if she has dynamic resizing, she can resize to avoid collisions; and if she has collision resolution, collisions don’t cause correctness issues. Which statement about these two properties true?

Dynamic resizing alone will preserve both properties.
Dynamic resizing alone will preserve correctness, but not performance.

Collision resolution alone will preserve performance, but not correctness.
Both are necessary to maintain performance and correctness.

Solution: Both are necessary to maintain performance and correctness. Without collision resolution, no correctness: could have an actual hash collision, and then no amount of resizing will let both be entered into the table. Without dynamic resiz- ing, the load factor will get large, and everything will turn into a linear-time lookup (assuming chaining). (c) Suppose that Alyssa decides to implement resizing. If Alyssa is enlarging a table of size m into a table of size m′, and the table contains n elements, what is the best time complexity she can achieve?

Θ(m)
Θ(m′)
Θ(n)
Θ(nm′)
Θ(m + m′)
Θ(m + n)
Θ(m′^ + n)

Solution: Θ(m′^ + n). It takes O(m′) time to create a new hash table (allocating the memory can take constant time, but it then needs to be initialized). It takes O(m + n) time to go through each slot in the old table and copy each item. In total, it comes out to Θ(m′^ + m + n), but since m < m′, the answer is just Θ(m′^ + n). (d) In lecture, we discussed doubling the size of our hash table. Ivy H. Crimson begins to implement this approach (that is, she lets m′^ = 2m) but stops when it occurs to her that she might be able to avoid wasting half of the memory the table occupies on empty space by letting m′^ = m+k instead, where k is some constant. Does this work? If so, why do you think we don’t do it? There is a good theoretical reason as well as several additional practical concerns; a complete answer will touch on both points.

Solution: Theoretically, our cost will now be O(n) even after amortization. Loosely speaking, we were able to achieve O(1) amortized cost because we performed an O(n) time operation every O(n) step. Now, however, we’re performing this O(n) operation every O(1) steps. Practically, the computer will play more nicely with operations based around doubling (doubling is a fast operation, allocating memory blocks of sizes that are powers of two has plenty of advantages, etc).

Problem 4-2. [10 points] Python Dictionaries

We’re going to get started by checking out a file from Python’s Subversion repository at svn.python.org. The Python project operates a web frontend to their version control system, so we’ll be able to do this using a browser.

and their meanings.) These sequences are very long, so comparing subsequences of them quickly is important. We’ve provided code in kfasta.py that reads the .fa files storing this data.

(a) Let’s start with subsequenceHashes, which returns all length-k subsequences and their hashes (and perhaps other information, if there’s anything else you might find useful). Hint: There will likely be many of these matches; the DNA sequences are tens of millions of nucleotides long. To avoid keeping them all in memory at once, implement your function as a generator. See the Python reference materials available online for details if you aren’t familiar with this important language construct. (b) Implement Multidict and verify that your work passes the simple sanity tests pro- vided. Multidict should behave just like a Python dictionary, except that it can store mul- tiple values per key. If no values exist for a key, it returns an empty list; otherwise, it returns the list of associated values. You may (and probably should) use the Python dictionary in your implementation. (c) Now it’s time to implement getExactSubmatches. Ignore the parameter m for the time being; we’ll get to that in the next part. Again, implementing this function as a generator is probably a good idea. (You will probably have many, many matches– think about the combinatorics of the situation briefly.) As a hint, consider that much of the work has already been done by Multidict and subsequenceHashes; also take a peek at the RollingHash implementation we’ve given you. With these building blocks, your solution probably does not need to be very complex (or more than a few lines). This function should return pairs of offsets into the inputs. A tuple (x, y) being re- turned indicates that the k-length subsequence at position x in the first input matches the subsequence at position y in the second input. We’ve provided a simple sanity test; your solution should be correct at this point (that is, dnaseq.py will produce the right output) but it’ll probably be too slow to be useful. If you like, you can try running it on the first portion of two inputs; we’ve provided two such prefixes (the short files in the data directory) that might be helpful. (d) The most significant reason why your solution is presently too slow to be useful is that you are hashing and inserting into your hash table tens of millions of elements, and then performing tens of millions of lookups into that hash table. Implement intervalSubsequenceHashes, which returns the same thing as subsequenceHashes except that it hashes only one in m subsequences. (A good implementation will not do more work than is necessary.) Modify your implementation of getExactSubmatches to honor m only for sequence A. Consider why we still see approximately the same result, and why we can’t further improve performance by applying this technique to sequence B as well.

(e) Run comparisons between the two human samples (paternal and maternal) and be- tween the paternal sample and each of the animal samples. Feel free to take a peek at how the image-generation code works. Conceptually, what it’s doing is keeping track of how many of your (x, y) match tuples land in each of a two-dimensional grid of bins, each of which corresponds to a pixel in the output image. At the end, it normalizes the counts so that the highest count observed is totally black and an empty bin is white. Think for a second about what a perfect match (e.g., comparing a sequence to itself) should look like. Try comparing the two human samples you have (maternal and pa- ternal), one of the humans against the chimp sample, and then against the dog sample. Make sure your results make sense! We’ve posted what our reference solution produced for the human-human comparison, the human-chimp comparison, and the human-dog comparison. Please submit the code that you wrote. (You should only have had to modify dnaseq.py, so that’s all you need to submit.)

Hash Functions and Load, Python Dictionaries, Matching DNA Sequences Solved Problems, Exercises of Algorithms and Programming

Related documents

Partial preview of the text

Download Hash Functions and Load, Python Dictionaries, Matching DNA Sequences Solved Problems and more Exercises Algorithms and Programming in PDF only on Docsity!

Problem Set 4