





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Two randomized algorithms: one for finding percentiles in an array and another for finding the minimum cut in an undirected graph. The selection algorithm uses a divide-and-conquer approach to find the kth smallest element in an array, while the minimum cut algorithm uses a randomized method to find the minimum cut in a graph. The analysis of both algorithms is provided.
Typology: Study notes
1 / 9
This page cannot be seen from the preview
Don't miss anything!






CSE 103: Probability and statistics Winter 2010
Suppose UCSD tracks this year’s graduating class in computer science and finds out everyone’s salary ten years down the line. What might these numbers look like? Well, if there are (say) 100 students, the spread might be roughly:
The mean salary would then be something like 100K, which UCSD could report with pride in its brochure for prospective students. Now suppose that one student managed to strike it rich and become a billionaire. Accordingly, take the spread of salaries above and convert one of the 200K salaries to 1000000K. What would be the new mean salary? Answer: at least 10 million dollars! (Do you see why?) If UCSD were to report this number, nobody would take it seriously, despite its being perfectly truthful. The problem is that the mean is extremely sensitive to outliers – it is very easily thrown off by a single number that is unusually small or unusually large. In many circumstances, therefore, the preferred summary statistic is the median, or 50th percentile. For the salary data, for instance, the median would remain unchanged (at around 100K) even if a few people were to become billionaires, or if a few more people were to lose their jobs. We’re also interested in other percentiles – the 25th, 75th, and so on. How can we compute these for a very large data set (for instance, a data set giving the salary of everyone in the US)?
Here the problem, formally.
Selection Input: An array S[1 · · · n] of n numbers; an integer k between 1 and n Output: The kth smallest number in the array.
The median corresponds to k = ⌈n/ 2 ⌉, while k = 1 retrieves the very smallest element. The pth percentile (0 ≤ p ≤ 100) can be obtained with k = ⌈pn/ 100 ⌉. The most natural algorithm for this problem is:
Sort S and return S[k]
The running time here is dominated by that of sorting, which is O(n log n). This is pretty good, but we’d like something faster since we often need to compute percentiles of enormous data sets.
Here’s a randomized (and recursive) procedure for selection. For any number v, imagine splitting array S into three categories: elements smaller than v, those equal to v (there might be duplicates), and those greater than v. Call these SL, Sv, and SR respectively. For instance, if the array S : 2 36 5 21 8 13 11 20 5 4 1
is split on v = 5, the three subarrays generated are
SL : 2 4 1 Sv : 5 5 SR : 36 21 8 13 11 20
The search can instantly be narrowed down to one of these sublists. If we want, say, the eighth-smallest element of S, we know it must be the third-smallest element of SR since |SL| + |Sv| = 5. That is, selection(S, 8) = selection(SR, 3). More generally, by checking k against the sizes of the subarrays, we can quickly determine which of them holds the desired element:
selection(S, k) =
selection(SL, k) if k ≤ |SL| v if |SL| < k ≤ |SL| + |Sv | selection(SR, k − |SL| − |Sv|) if k > |SL| + |Sv|.
The three sublists SL, Sv , SR can be computed from S in linear time, scanning left-to-right. We then recurse on the appropriate sublist. The effect of the split is thus to shrink the number of elements from |S| to at most max{|SL|, |SR|}. How much of an improvement is this, and what is the final running time?
By how much does a single split reduce the size of the array? Well, this depends on the choice of v.
(time to process array of n elements) = (time to split) + (time to process array of n − 1 elements).
Since the time to split is linear, O(n), this works out to a total running time of n+(n−1)+(n−2)+· · · = O(n^2 ), which is really bad. Fortunately, this case is unlikely to occur. The probability of consistently picking an element v which is the smallest or largest, is miniscule: 2 n
n − 1
n − 2
2 n n! (do you see where this expression comes from?).
The two algorithms we’ve seen – for finding percentiles and for sorting – are both guaranteed to return the correct answer. But if you run them multiple times on the same input, their running times will fluctuate, even though the answer will be the same every time. Therefore we are interested in their expected running time. We call these Las Vegas algorithms. There’s another type of algorithm – called a Monte Carlo algorithm – that always has the same running time on any given input, but is not guaranteed to return the correct answer. It merely guarantees that it has some probability p > 0 of being correct. Therefore, if you run it multiple times on an input, you’ll get many different answers, of which roughly a p fraction will be correct. In many cases, it is possible to look through these answers and figure out which one(s) are right. In a Monte Carlo algorithm, how much does the probability of success increase if you run it multiple times, say k times? Pr(wrong every time) = (1 − p)k^ ≤ e−pk.
To make this less than some δ, it is enough to run the algorithm (1/p) log(1/δ) times. For instance, to make the failure probability less than 1 in a million, just run the algorithm 20/p times (a million is roughly 2^20 ).
Suppose a mail order company has the resources to prepare two different versions of its catalog, and it wishes to target each version towards a particular sector of its customer base. The data it has is a list of its regular customers, along with their purchase histories. How should this set of customers be partitioned into two coherent groups? One way to do this is to create a graph with a node for each of the regular customers, and an edge between any two customers whose purchase patterns are similar. The goal is then to divide the nodes into two pieces which have very few edges between them. More formally, the minimum cut of an undirected graph G = (V, E) is a partition of the nodes into two groups V 1 and V 2 (that is, V = V 1 ∪ V 2 and, V 1 ∩ V 2 = ∅), so that the number of edges between V 1 and V 2 is minimized. In the graph below, for instance, the minimum cut has size two and partitions the nodes into V 1 = {a, b, e, f } and V 2 = {c, d, g, h}.
a b^ c^ d
e (^) f g (^) h
Here’s a randomized algorithm for finding the minimum cut:
An example is shown in Figure 4.1. Notice how some nodes end up having multiple edges between them.
a b^ c^ d
e (^) f g (^) h
14 edges to choose from Pick b − f (probability 1/14)
a c d
e g^ h
bf
13 edges to choose from Pick g − h (probability 1/13)
a c d
e
bf gh
12 edges to choose from Pick d − gh (probability 1/6)
10 edges to choose from Pick a − e (probability 1/10)
Pick ab − ef (probability 4/9)
Pick c − dgh (probability 3/5)
Figure 4.1. Karger’s algorithm at work.
When you give a search phrase to Google (or any other search engine), you immediately get back a list of documents containing that phrase. But there are billions of documents on the web – how does Google look through all of them so quickly? The answer is, it uses hashing.
Suppose you have a large collection of items x 1 ,... , xn that you want to store (for instance, all the documents on the web), where these items are drawn from some set U (for instance, the set of all conceivable documents). The requirements are:
Here’s a picture of the data structure.
The storage used is O(n). What about the query time?
Suppose query q is picked at random, so that it is equally likely to hash to any of the locations 1, 2 ,... , n. What is the expected query time?
Expected query time =
∑^ n
i=
Pr(q hashes to location i) · (length of list at T [i])
n
i
(length of list at T [i])
n
· n = 1
So the average query time is constant!
What is the worst case query time; that is, what is the length of the longest linked list in T? Equivalently, when you throw n balls in n bins, what is the size of the largest bin? We’ll see that with very high probability, no bin gets ≥ log n balls. For any bin i, let Ei be the event that it gets ≥ log n balls.
Pr(Ei) ≤
n log n
n
)log n .
(Do you see why?) It turns out, using all sorts of calculations, that this is at most 1/n^2. Therefore,
Pr(some bin gets ≥ log n balls) = Pr(E 1 ∪ E 2 ∪ · · · ∪ En) ≤ Pr(E 1 ) + · · · + Pr(En) ≤
n
For instance, if you throw a million balls into a million bins, then the chance that there is a bin with ≥ 20 balls is at most 1 in a million. Getting back to hashing, this means that the worst case query time is (with high probability) O(log n).
Here’s a variant on the balls and bins setup. As usual, you have before you a row of n bins, along with a collection of n identical balls. But now, when throwing each ball, you pick two bins at random and you put the ball in whichever of them is less full. It turns out, using an analysis that is too complicated to get into here, that under this small change, the maximum bin size will be just O(log log n) instead of O(log n). This inspires an alternative hashing scheme:
The storage requirement is still O(n), the average query time is still O(1), but now the worst case query time drops to O(log log n).