Linear Time Selection/Median, Exams of Computer Programming

The problem of median-finding and selection in an unsorted array. It explores the time complexity of sorting algorithms like Mergesort and Heapsort and introduces the concept of randomized quickselect. the algorithm and its correctness and running time. It also discusses a deterministic algorithm for the same problem. a detailed analysis of the time complexity of both algorithms and their expected running time.

Typology: Exams

2022/2023

Available from 03/29/2023

ClemBSC
ClemBSC 🇺🇸

3.8

(32)

1.6K documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz
Topic: Linear time selection/median Date: 9/9/21
4.1 Introduction and Problem Definition
We saw last lecture a way to sort in time O(nlog n): Randomized Quicksort. There are also other
sorting algorithms with similar time bounds, most notably Mergesort and Heapsort (you should all
know both of these already). In this lecture we will discuss a related problem with some surprisingly
efficient algorithms: median-finding, or more generally, selection.
The median problem is the following: given an unsorted array, find and return the median element.
In other words, given an array of length n, find and return the (n/2)nd smallest element. The
selection problem is only slightly more general: given an array of length nand a value kn, find
and return the kth smallest element. From now on we’ll mostly talk about selection.
It is obvious that selection can be done in time O(nlog n): we can sort the array (using, e.g.,
mergesort), and then return the kth smallest element. Can we do any better?
It turns out that the answer is yes! We can do selection in O(n) time, both randomized (worst-case
expected time) and deterministic.
There are a few easy cases, which we can do to warm up. For example, suppose k= 1. Then
we are trying to find the smallest element, which we can do by simply scanning the array in O(n)
time and keeping track of the smallest. Similarly, if k=na simple scan also suffices. In general,
this strategy works whenever k=O(1) or k=nO(1), since we can just keep track of the k
smallest/largest elements we see while we do a scan.
This doesn’t work for k=n/2, though. If we kept track of the ksmallest elements, then when
considering a new element in the scan we would have to figure out its place in the smallest k, which
takes time Θ(log k) = Θ(log n) (upper bound via binary search, lower bound something we’ll see
next week). So the total time would be Θ(nlog k) = Θ(nlog n).
4.2 Randomized Quickselect
The idea here is to use randomized quicksort, but instead of recursing on both sides we only recurse
on the side which has the desired element. Slightly more formally, suppose we are given an array
Aof length nand an integer kn. Then Randomized Quickselect does the following:
1. If n= 1, return the element.
2. Pick a pivot element puniformly at random from A.
3. Compare each element of Ato p, creating subarrays Lof elements less than pand Gof
elements greater than p.
4. (a) If |L|=k1 then return p.
1
pf3
pf4

Partial preview of the text

Download Linear Time Selection/Median and more Exams Computer Programming in PDF only on Docsity!

601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Linear time selection/median Date: 9/9/

4.1 Introduction and Problem Definition

We saw last lecture a way to sort in time O(n log n): Randomized Quicksort. There are also other sorting algorithms with similar time bounds, most notably Mergesort and Heapsort (you should all know both of these already). In this lecture we will discuss a related problem with some surprisingly efficient algorithms: median-finding, or more generally, selection.

The median problem is the following: given an unsorted array, find and return the median element. In other words, given an array of length n, find and return the (n/2)nd smallest element. The selection problem is only slightly more general: given an array of length n and a value k ≤ n, find and return the kth smallest element. From now on we’ll mostly talk about selection.

It is obvious that selection can be done in time O(n log n): we can sort the array (using, e.g., mergesort), and then return the kth smallest element. Can we do any better?

It turns out that the answer is yes! We can do selection in O(n) time, both randomized (worst-case expected time) and deterministic.

There are a few easy cases, which we can do to warm up. For example, suppose k = 1. Then we are trying to find the smallest element, which we can do by simply scanning the array in O(n) time and keeping track of the smallest. Similarly, if k = n a simple scan also suffices. In general, this strategy works whenever k = O(1) or k = n − O(1), since we can just keep track of the k smallest/largest elements we see while we do a scan.

This doesn’t work for k = n/2, though. If we kept track of the k smallest elements, then when considering a new element in the scan we would have to figure out its place in the smallest k, which takes time Θ(log k) = Θ(log n) (upper bound via binary search, lower bound something we’ll see next week). So the total time would be Θ(n log k) = Θ(n log n).

4.2 Randomized Quickselect

The idea here is to use randomized quicksort, but instead of recursing on both sides we only recurse on the side which has the desired element. Slightly more formally, suppose we are given an array A of length n and an integer k ≤ n. Then Randomized Quickselect does the following:

  1. If n = 1, return the element.
  2. Pick a pivot element p uniformly at random from A.
  3. Compare each element of A to p, creating subarrays L of elements less than p and G of elements greater than p.
  4. (a) If |L| = k − 1 then return p.

(b) if |L| > k − 1 then return Quickselect(L, k). (c) If |L| < k − 1 then return Quickselect(G, k − |L| − 1).

Easy to argue correctness by arguing inductively that on every call to Quickselect(X, a), the original element we were looking for (the k’th smallest of A) must be the a’th smallest of X (do at home!). To argue running time, first note that the same intuition from quicksort continues to hold. We expect that our pivot splits the array approximately in half. This means that after O(log n) iterations we will find the element we are looking for. This might seem like it would give a bound of n log n, but in each iteration the number of comparisons we make also goes down by a factor of (approximately) 2, and thus the total number of comparisons is only O(n).

Let’s make this a little more formal. Let T (n) be the expected running time of Quickselect on an array of length n. As with quicksort, splitting the array around a pivot takes n − 1 comparisons. Each possible split is equally likely, i.e. |L| is uniformly distributed between 0 and n − 1 (and same with |G|). Note that T (n) ≤ T (n + 1) for all n. So whether we recurse in G or L depends on k and on the split, but since we are trying to provide an upper bound we can assume that we recurse on whichever has more elements (since that will make our algorithm take longer).

Thus we can write the following recurrence relation:

T (n) ≤ (n − 1) +

n∑− 1

i=

n

max(T (i), T (n − i − 1))

≤ (n − 1) +

n/ ∑ 2 − 1

i=

n T (n − i − 1) +

n∑− 1

i=n/ 2

n T (i) = (n − 1) +

n

n∑− 1

i=n/ 2

T (i)

Now let’s use our guess-and-check method, with the guess T (n) ≤ 4 n.

T (n) ≤ (n − 1) +

n

n∑− 1

i=n/ 2

4 i = (n − 1) + 4 ·

n

n∑− 1

i=n/ 2

i

= (n − 1) + 4 ·

n

n∑− 1

i=

i −

n/ ∑ 2 − 1

i=

i

= (n − 1) + 4 ·

n

n(n − 1) 2

(n/2)(n/ 2 − 1) 2

≤ (n − 1) + 4 ·

(n − 1) − n/ 2 − 1 2

≤ (n − 1) + 4

3 n 4

≤ 4 n.

4.3 Deterministic Algorithm

What if we want a deterministic algorithm? Somewhat amazingly, this turns out to be possible. The basic idea is to deterministically find a pivot that will result in a more-or-less even split, and

step 4 takes time at most 7n/10 (by Lemma 4.3.1). So the total running time is

T (n) ≤ T (7n/10) + T (n/5) + cn.

It’s a good exercise to draw out the recursion tree to see what’s going on, but we can also solve by guess-and-check. Then we will guess that T (n) ≤ 10 cn. When we check this, we get that

T (n) ≤ 10 c(7n/10) + 10c(n/5) + cn = 9cn + cn = 10cn.

4.4 Deterministic Quicksort

We can now use Quickselect to get a deterministic version of Quicksort which only uses O(n log n) comparisons in the worst case (recall that traditional Quicksort uses Θ(n^2 ) comparisons in the worst case, while randomized Quicksort uses O(n log n) in expectation). The algorithm is simple: when deciding on a pivot, use Quickselect to find the median, and then use that as a pivot. Clearly this splits the input in half, so the total number of comparisons is

T (n) = 2T (n/2) + cn = O(n log n),

where the cn term is the number of comparisons used for Quickselect plus the number used to split the array on the pivot.