Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Randomized Algorithms: Selection Algorithm and Minimum Cut, Study notes of Computer Science

University of California - San Diego Computer Science

Two randomized algorithms: one for finding percentiles in an array and another for finding the minimum cut in an undirected graph. The selection algorithm uses a divide-and-conquer approach to find the kth smallest element in an array, while the minimum cut algorithm uses a randomized method to find the minimum cut in a graph. The analysis of both algorithms is provided.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-37b 🇺🇸

9 documents

1 / 9

This page cannot be seen from the preview

Don't miss anything!

CSE 103: Probability and statistics Winter 2010

Topic 4 — Randomized algorithms

4.1 Finding percentiles

4.1.1 The mean as a summary statistic

Suppose UCSD tracks this year’s graduating class in computer science and finds out everyone’s salary ten

years down the line. What might these numbers look like? Well, if there are (say) 100 students, the spread

might be roughly:

•A few people with zero salary (unemployed)

•A few grad students with salary around 20K

•A few part-timers with salary around 50K

•A whole lot of software engineers with salaries between 100K and 200K

The mean salary would then be something like 100K, which UCSD could report with pride in its brochure

for prospective students.

Now suppose that one student managed to strike it rich and become a billionaire. Accordingly, take the

spread of salaries above and convert one of the 200K salaries to 1000000K. What would be the new mean

salary? Answer: at least 10 million dollars! (Do you see why?) If UCSD were to report this number, nobo dy

would take it seriously, despite its being perfectly truthful.

The problem is that the mean is extremely sensitive to outliers – it is very easily thrown off by a single

number that is unusually small or unusually large. In many circumstances, therefore, the preferred summary

statistic is the median, or 50th percentile. For the salary data, for instance, the median would remain

unchanged (at around 100K) even if a few people were to become billionaires, or if a few more people were

to lose their jobs.

We’re also interested in other percentiles – the 25th, 75th, and so on. How can we compute these for a

very large data set (for instance, a data set giving the salary of everyone in the US)?

4.1.2 Selection

Here the problem, formally.

Selection

Input: An array S[1 ···n] of nnumbers; an integer kbetween 1 and n

Output: The kth smallest number in the array.

The median corresponds to k=⌈n/2⌉, while k= 1 retrieves the very smallest element. The pth percentile

(0 ≤p≤100) can be obtained with k=⌈pn/100⌉.

The most natural algorithm for this problem is:

Sort Sand return S[k]

4-1

Discover Study notes of Computer Science University of California - San Diego

Partial preview of the text

Download Randomized Algorithms: Selection Algorithm and Minimum Cut and more Study notes Computer Science in PDF only on Docsity!

CSE 103: Probability and statistics Winter 2010

Topic 4 — Randomized algorithms

4.1 Finding percentiles

4.1.1 The mean as a summary statistic

Suppose UCSD tracks this year’s graduating class in computer science and finds out everyone’s salary ten years down the line. What might these numbers look like? Well, if there are (say) 100 students, the spread might be roughly:

A few people with zero salary (unemployed)
A few grad students with salary around 20K
A few part-timers with salary around 50K
A whole lot of software engineers with salaries between 100K and 200K

The mean salary would then be something like 100K, which UCSD could report with pride in its brochure for prospective students. Now suppose that one student managed to strike it rich and become a billionaire. Accordingly, take the spread of salaries above and convert one of the 200K salaries to 1000000K. What would be the new mean salary? Answer: at least 10 million dollars! (Do you see why?) If UCSD were to report this number, nobody would take it seriously, despite its being perfectly truthful. The problem is that the mean is extremely sensitive to outliers – it is very easily thrown off by a single number that is unusually small or unusually large. In many circumstances, therefore, the preferred summary statistic is the median, or 50th percentile. For the salary data, for instance, the median would remain unchanged (at around 100K) even if a few people were to become billionaires, or if a few more people were to lose their jobs. We’re also interested in other percentiles – the 25th, 75th, and so on. How can we compute these for a very large data set (for instance, a data set giving the salary of everyone in the US)?

4.1.2 Selection

Here the problem, formally.

Selection Input: An array S[1 · · · n] of n numbers; an integer k between 1 and n Output: The kth smallest number in the array.

The median corresponds to k = ⌈n/ 2 ⌉, while k = 1 retrieves the very smallest element. The pth percentile (0 ≤ p ≤ 100) can be obtained with k = ⌈pn/ 100 ⌉. The most natural algorithm for this problem is:

Sort S and return S[k]

The running time here is dominated by that of sorting, which is O(n log n). This is pretty good, but we’d like something faster since we often need to compute percentiles of enormous data sets.

4.1.3 A randomized algorithm

Here’s a randomized (and recursive) procedure for selection. For any number v, imagine splitting array S into three categories: elements smaller than v, those equal to v (there might be duplicates), and those greater than v. Call these SL, Sv, and SR respectively. For instance, if the array S : 2 36 5 21 8 13 11 20 5 4 1

is split on v = 5, the three subarrays generated are

SL : 2 4 1 Sv : 5 5 SR : 36 21 8 13 11 20

The search can instantly be narrowed down to one of these sublists. If we want, say, the eighth-smallest element of S, we know it must be the third-smallest element of SR since |SL| + |Sv| = 5. That is, selection(S, 8) = selection(SR, 3). More generally, by checking k against the sizes of the subarrays, we can quickly determine which of them holds the desired element:

selection(S, k) =

selection(SL, k) if k ≤ |SL| v if |SL| < k ≤ |SL| + |Sv | selection(SR, k − |SL| − |Sv|) if k > |SL| + |Sv|.

The three sublists SL, Sv , SR can be computed from S in linear time, scanning left-to-right. We then recurse on the appropriate sublist. The effect of the split is thus to shrink the number of elements from |S| to at most max{|SL|, |SR|}. How much of an improvement is this, and what is the final running time?

4.1.4 Running time analysis

By how much does a single split reduce the size of the array? Well, this depends on the choice of v.

Worst case. When v is either the smallest or largest element in the array, the array shrinks by just one element. If we keep getting unlucky in this way, then

(time to process array of n elements) = (time to split) + (time to process array of n − 1 elements).

Since the time to split is linear, O(n), this works out to a total running time of n+(n−1)+(n−2)+· · · = O(n^2 ), which is really bad. Fortunately, this case is unlikely to occur. The probability of consistently picking an element v which is the smallest or largest, is miniscule: 2 n

n − 1

n − 2

2 n n! (do you see where this expression comes from?).

Best case. The best possible case is that v just happens to be the element we are looking for, that is, the kth smallest element in the array. In this case, the running time is O(n), the time for a single split. This case is also unlikely, but it is certainly a lot more likely than the worst case. In fact, the probability of it occurring is at least 1/n. (Why? When might it be more than 1/n?)

4.3 Two types of randomized algorithms

The two algorithms we’ve seen – for finding percentiles and for sorting – are both guaranteed to return the correct answer. But if you run them multiple times on the same input, their running times will fluctuate, even though the answer will be the same every time. Therefore we are interested in their expected running time. We call these Las Vegas algorithms. There’s another type of algorithm – called a Monte Carlo algorithm – that always has the same running time on any given input, but is not guaranteed to return the correct answer. It merely guarantees that it has some probability p > 0 of being correct. Therefore, if you run it multiple times on an input, you’ll get many different answers, of which roughly a p fraction will be correct. In many cases, it is possible to look through these answers and figure out which one(s) are right. In a Monte Carlo algorithm, how much does the probability of success increase if you run it multiple times, say k times? Pr(wrong every time) = (1 − p)k^ ≤ e−pk.

To make this less than some δ, it is enough to run the algorithm (1/p) log(1/δ) times. For instance, to make the failure probability less than 1 in a million, just run the algorithm 20/p times (a million is roughly 2^20 ).

4.4 Karger’s minimum cut algorithm

4.4.1 Clustering via graph cuts

Suppose a mail order company has the resources to prepare two different versions of its catalog, and it wishes to target each version towards a particular sector of its customer base. The data it has is a list of its regular customers, along with their purchase histories. How should this set of customers be partitioned into two coherent groups? One way to do this is to create a graph with a node for each of the regular customers, and an edge between any two customers whose purchase patterns are similar. The goal is then to divide the nodes into two pieces which have very few edges between them. More formally, the minimum cut of an undirected graph G = (V, E) is a partition of the nodes into two groups V 1 and V 2 (that is, V = V 1 ∪ V 2 and, V 1 ∩ V 2 = ∅), so that the number of edges between V 1 and V 2 is minimized. In the graph below, for instance, the minimum cut has size two and partitions the nodes into V 1 = {a, b, e, f } and V 2 = {c, d, g, h}.

a b^ c^ d

e (^) f g (^) h

4.4.2 Karger’s algorithm

Here’s a randomized algorithm for finding the minimum cut:

Repeat until just two nodes remain:
- Pick an edge of G at random and collapse its two endpoints into a single node
For the two remaining nodes u 1 and u 2 , set V 1 = {nodes that went into u 1 } and V 2 = {nodes in u 2 }

An example is shown in Figure 4.1. Notice how some nodes end up having multiple edges between them.

a b^ c^ d

e (^) f g (^) h

14 edges to choose from Pick b − f (probability 1/14)

a c d

e g^ h

bf

13 edges to choose from Pick g − h (probability 1/13)

a c d

e

bf gh

12 edges to choose from Pick d − gh (probability 1/6)

a c

e

bf

dgh

10 edges to choose from Pick a − e (probability 1/10)

c

bf

dgh

ae 9 edges to choose from

Pick ab − ef (probability 4/9)

c

dgh

abef 5 edges to choose from

Pick c − dgh (probability 3/5)

abef cdgh Done: just two nodes remain

Figure 4.1. Karger’s algorithm at work.

4.5 Hashing

4.5.1 The Google problem

When you give a search phrase to Google (or any other search engine), you immediately get back a list of documents containing that phrase. But there are billions of documents on the web – how does Google look through all of them so quickly? The answer is, it uses hashing.

4.5.2 The hashing framework

Suppose you have a large collection of items x 1 ,... , xn that you want to store (for instance, all the documents on the web), where these items are drawn from some set U (for instance, the set of all conceivable documents). The requirements are:

The total storage space used should be O(n).
Given a query q ∈ U, it should be possible to very rapidly determine whether q is one of the stored items xi.

4.5.3 A simple solution using randomization

Pick a completely random function h : U → { 1 , 2 ,... , n}. This is the hash function.
Create a table T of size n, each of whose entries is a pointer to a linked list, initialized to null.
Store each xi in the linked list at T [h(xi)]. We say xi hashes to location h(xi).
Given a query q, look through the linked list at T [h(q)] to see if it’s there.

Here’s a picture of the data structure.

n

T

linked list of all xi that hash to 3

The storage used is O(n). What about the query time?

4.5.4 Average query time

Suppose query q is picked at random, so that it is equally likely to hash to any of the locations 1, 2 ,... , n. What is the expected query time?

Expected query time =

∑^ n

i=

Pr(q hashes to location i) · (length of list at T [i])

n

i

(length of list at T [i])

n

· n = 1

So the average query time is constant!

4.5.5 Worst case query time, and a balls-in-bins problem

What is the worst case query time; that is, what is the length of the longest linked list in T? Equivalently, when you throw n balls in n bins, what is the size of the largest bin? We’ll see that with very high probability, no bin gets ≥ log n balls. For any bin i, let Ei be the event that it gets ≥ log n balls.

Pr(Ei) ≤

n log n

n

)log n .

(Do you see why?) It turns out, using all sorts of calculations, that this is at most 1/n^2. Therefore,

Pr(some bin gets ≥ log n balls) = Pr(E 1 ∪ E 2 ∪ · · · ∪ En) ≤ Pr(E 1 ) + · · · + Pr(En) ≤

n

For instance, if you throw a million balls into a million bins, then the chance that there is a bin with ≥ 20 balls is at most 1 in a million. Getting back to hashing, this means that the worst case query time is (with high probability) O(log n).

4.5.6 The power of two choices

Here’s a variant on the balls and bins setup. As usual, you have before you a row of n bins, along with a collection of n identical balls. But now, when throwing each ball, you pick two bins at random and you put the ball in whichever of them is less full. It turns out, using an analysis that is too complicated to get into here, that under this small change, the maximum bin size will be just O(log log n) instead of O(log n). This inspires an alternative hashing scheme:

Pick two completely random functions h 1 , h 2 : U → { 1 , 2 ,... , n}.
Create a table T of size n, each of whose entries is a pointer to a linked list, initialized to null.
For each xi, store it in either the linked list at T [h 1 (xi)] or T [h 2 (xi)], whichever is shorter.
Given a query q, look through both the linked list at T [h 1 (q)] and at T [h 2 (q)] to see if it’s there.

The storage requirement is still O(n), the average query time is still O(1), but now the worst case query time drops to O(log log n).

Randomized Algorithms: Selection Algorithm and Minimum Cut, Study notes of Computer Science

Related documents

Partial preview of the text

Download Randomized Algorithms: Selection Algorithm and Minimum Cut and more Study notes Computer Science in PDF only on Docsity!

Topic 4 — Randomized algorithms

4.1 Finding percentiles

4.1.1 The mean as a summary statistic

4.1.2 Selection

4.1.3 A randomized algorithm

4.1.4 Running time analysis

4.3 Two types of randomized algorithms

4.4 Karger’s minimum cut algorithm

4.4.1 Clustering via graph cuts

4.4.2 Karger’s algorithm

a c

e

bf

dgh

c

bf

dgh

ae 9 edges to choose from

c

dgh

abef 5 edges to choose from

abef cdgh Done: just two nodes remain

4.5 Hashing

4.5.1 The Google problem

4.5.2 The hashing framework

4.5.3 A simple solution using randomization

n

T

linked list of all xi that hash to 3

4.5.4 Average query time

4.5.5 Worst case query time, and a balls-in-bins problem

4.5.6 The power of two choices