Greedy Algorithms: Storing Files on Tape and Huffman Coding, Papers of Algorithms and Programming

Two applications of greedy algorithms: one for optimally storing files on a tape and the other for creating a prefix-free binary code using huffman's algorithm. The first problem involves minimizing the expected cost of accessing files based on their access frequencies and lengths. The second problem deals with creating an optimal prefix-free binary code for encoding characters based on their frequencies.

Typology: Papers

Pre 2010

Uploaded on 03/16/2009

koofers-user-d4u
koofers-user-d4u 🇺🇸

9 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Algorithms Non-Lecture A: Greedy Algorithms
The point is, ladies and gentleman, greed is good. Greed works, greed is right.
Greed clarifies, cuts through, and captures the essence of the evolutionary
spirit. Greed in all its forms, greed for life, money, love, knowledge has marked
the upward surge in mankind. And greed—mark my words—will save not only
Teldar Paper but the other malfunctioning corporation called the USA.
Michael Douglas as Gordon Gekko, Wall Street (1987)
There is always an easy solution to every human problem—
neat, plausible, and wrong.
H. L. Mencken, New York Evening Mail (November 16, 1917)
A Greedy Algorithms
A.1 Storing Files on Tape
Suppose we have a set of nfiles that we want to store on a tape. In the future, users will want to
read those files from the tape. Reading a file from tape isn’t like reading from disk; first we have to
fast-forward past all the other files, and that takes a significant amount of time. Let L[1 .. n]be an
array listing the lengths of each file; specifically, file ihas length L[i]. If the files are stored in order
from 1to n, then the cost of accessing the kth file is
cost(k) =
k
X
i=1
L[i].
The cost reflects the fact that before we read file kwe must first scan past all the earlier files on the
tape. If we assume for the moment that each file is equally likely to be accessed, then the expected
cost of searching for a random file is
E[cost] =
n
X
k=1
cost(k)
n=
n
X
k=1
k
X
i=1
L[i]
n.
If we change the order of the files on the tape, we change the cost of accessing the files; some
files become more expensive to read, but others become cheaper. Different file orders are likely
to result in different expected costs. Specifically, let π(i)denote the index of the file stored at
position ion the tape. Then the expected cost of the permutation πis
E[cost(π)] =
n
X
k=1
k
X
i=1
L[π(i)]
n.
Which order should we use if we want the expected cost to be as small as possible? The answer
is intuitively clear; we should store the files in order from shortest to longest. So let’s prove this.
Lemma 1. E[cost(π)] is minimized when L[π(i)] L[π(i+ 1)] for all i.
Proof: Suppose L[π(i)] > L[π(i+ 1)] for some i. To simplify notation, let a=π(i)and b=π(i+ 1).
If we swap files aand b, then the cost of accessing aincreases by L[b], and the cost of accessing b
decreases by L[a]. Overall, the swap changes the expected cost by (L[b]L[a])/n. But this change
is an improvement, because L[b]< L[a]. Thus, if the files are out of order, we can improve the
expected cost by swapping some mis-ordered adjacent pair.
1
pf3
pf4
pf5
pf8

Partial preview of the text

Download Greedy Algorithms: Storing Files on Tape and Huffman Coding and more Papers Algorithms and Programming in PDF only on Docsity!

The point is, ladies and gentleman, greed is good. Greed works, greed is right. Greed clarifies, cuts through, and captures the essence of the evolutionary spirit. Greed in all its forms, greed for life, money, love, knowledge has marked the upward surge in mankind. And greed—mark my words—will save not only Teldar Paper but the other malfunctioning corporation called the USA. — Michael Douglas as Gordon Gekko, Wall Street (1987)

There is always an easy solution to every human problem— neat, plausible, and wrong. — H. L. Mencken, New York Evening Mail (November 16, 1917)

A Greedy Algorithms

A.1 Storing Files on Tape

Suppose we have a set of n files that we want to store on a tape. In the future, users will want to read those files from the tape. Reading a file from tape isn’t like reading from disk; first we have to fast-forward past all the other files, and that takes a significant amount of time. Let L[1 .. n] be an array listing the lengths of each file; specifically, file i has length L[i]. If the files are stored in order from 1 to n, then the cost of accessing the kth file is

cost (k) =

∑^ k

i=

L[i].

The cost reflects the fact that before we read file k we must first scan past all the earlier files on the tape. If we assume for the moment that each file is equally likely to be accessed, then the expected cost of searching for a random file is

E[ cost ] =

∑^ n

k=

cost (k) n

∑^ n

k=

∑^ k

i=

L[i] n

If we change the order of the files on the tape, we change the cost of accessing the files; some files become more expensive to read, but others become cheaper. Different file orders are likely to result in different expected costs. Specifically, let π(i) denote the index of the file stored at position i on the tape. Then the expected cost of the permutation π is

E[ cost (π)] =

∑^ n

k=

∑^ k

i=

L[π(i)] n

Which order should we use if we want the expected cost to be as small as possible? The answer is intuitively clear; we should store the files in order from shortest to longest. So let’s prove this.

Lemma 1. E[ cost (π)] is minimized when L[π(i)] ≤ L[π(i + 1)] for all i.

Proof: Suppose L[π(i)] > L[π(i + 1)] for some i. To simplify notation, let a = π(i) and b = π(i + 1). If we swap files a and b, then the cost of accessing a increases by L[b], and the cost of accessing b decreases by L[a]. Overall, the swap changes the expected cost by (L[b] − L[a])/n. But this change is an improvement, because L[b] < L[a]. Thus, if the files are out of order, we can improve the expected cost by swapping some mis-ordered adjacent pair. 

This example gives us our first greedy algorithm. To minimize the total expected cost of accessing the files, we put the file that is cheapest to access first, and then recursively write everything else; no backtracking, no dynamic programming, just make the best local choice and blindly plow ahead. If we use an efficient sorting algorithm, the running time is clearly O(n log n), plus the time required to actually write the files. To prove the greedy algorithm is actually correct, we simply prove that the output of any other algorithm can be improved by some sort of swap. Let’s generalize this idea further. Suppose we are also given an array f [1 .. n] of access frequencies for each file; file i will be accessed exactly f [i] times over the lifetime of the tape. Now the total cost of accessing all the files on the tape is

Σ cost (π) =

∑^ n

k=

f [π(k)] ·

∑^ k

i=

L[π(i)]

∑^ n

k=

∑^ k

i=

f [π(k)] · L[π(i)]

Now what order should store the files if we want to minimize the total cost? We’ve already proved that if all the frequencies are equal, then we should sort the files by increasing size. If the frequencies are all different but the file lengths L[i] are all equal, then intuitively, we should sort the files by decreasing access frequency, with the most-accessed file first. And in fact, this is not hard to prove by modifying the proof of Lemma 1. But what if the sizes and the frequencies are both different? In this case, we should sort the files by the ratio L/f.

Lemma 2. Σ cost (π) is minimized when L[π(i)] F [π(i)]

L[π(i + 1)] F [π(i + 1)]

for all i.

Proof: Suppose L[π(i)]/F [π(i)] > L[π(i + 1)]/F [π(i + i)] for some i. To simplify notation, let a = π(i) and b = π(i + 1). If we swap files a and b, then the cost of accessing a increases by L[b], and the cost of accessing b decreases by L[a]. Overall, the swap changes the total cost by L[b]F [a] − L[a]F [a]. But this change is an improvement, since

L[a] F [a]

L[b] F [b]

=⇒ L[b]F [a] − L[a]F [a] < 0.

Thus, if the files are out of order, we can improve the total cost by swapping some mis-ordered adjacent pair. 

A.2 Scheduling Classes

The next example is slightly less trivial. Suppose you decide to drop out of computer science at the last minute and change your major to Applied Chaos. The Applied Chaos department has all of its classes on the same day every week, referred to as “Soberday” by the students (but interestingly, not by the faculty). Every class has a different start time and a different ending time: AC 101 (‘Toilet Paper Landscape Architecture’) starts at 10:27pm and ends at 11:51pm; AC 666 (‘Immanentizing the Eschaton’) starts at 4:18pm and ends at 7:06pm, and so on. In the interests of graduating as quickly as possible, you want to register for as many classes as you can. (Applied Chaos classes don’t require any actual work .) The University’s registration computer won’t let you register for overlapping classes, and no one in the department knows how to override this ‘feature’. Which classes should you take? More formally, suppose you are given two arrays S[1 .. n] and F [1 .. n] listing the start and finish times of each class. Your task is to choose the largest possible subset X ∈ { 1 , 2 ,... , n} so that for any pair i, j ∈ X, either S[i] > F [j] or S[j] > F [i]. We can illustrate the problem by drawing each class as a rectangle whose left and right x-coordinates show the start and finish times. The goal is to find a largest subset of rectangles that do not overlap vertically.

This algorithm clearly runs in O(n log n) time. To prove that this algorithm actually gives us a maximal conflict-free schedule, we use an ex- change argument, similar to the one we used for tape sorting. We are not claiming that the greedy schedule is the only maximal schedule; there could be others. (See the figures on the previous page.) All we can claim is that at least one of the maximal schedules is the one that the greedy algorithm produces.

Lemma 3. At least one maximal conflict-free schedule includes the class that finishes first.

Proof: Let f be the class that finishes first. Suppose we have a maximal conflict-free schedule X that does not include f. Let g be the first class in X to finish. Since f finishes before g does, f cannot conflict with any class in the set S \ {g}. Thus, the schedule X′^ = X ∪ {f } \ {g} is also conflict-free. Since X′^ has the same size as X, it is also maximal. 

To finish the proof, we call on our old friend, induction.

Theorem 4. The greedy schedule is an optimal schedule.

Proof: Let f be the class that finishes first, and let L be the subset of classes the start after f finishes. The previous lemma implies that some optimal schedule contains f , so the best schedule that contains f is an optimal schedule. The best schedule that includes f must contain an optimal schedule for the classes that do not conflict with f , that is, an optimal schedule for L. The greedy algorithm chooses f and then, by the inductive hypothesis, computes an optimal schedule of classes from L. 

The proof might be easier to understand if we unroll the induction slightly.

Proof: Let 〈g 1 , g 2 ,... , gk〉 be the sequence of classes chosen by the greedy algorithm. Suppose we have a maximal conflict-free schedule of the form

〈g 1 , g 2 ,... , gj− 1 , cj , cj+1,... , cm〉,

where the classes ci are different from the classes chosen by the greedy algorithm. By construction, the jth greedy choice gj does not conflict with any earlier class g 1 , g 2 ,... , gj− 1 , and since our schedule is conflict-free, neither does cj. Moreover, gj has the earliest finish time among all classes that don’t conflict with the earlier classes; in particular, gj finishes before cj. This implies that gj does not conflict with any of the later classes cj+1,... , cm. Thus, the schedule

〈g 1 , g 2 ,... , gj− 1 , gj , cj+1,... , cm〉,

is conflict-free. (This is just a generalization of Lemma 3, which considers the case j = 1.) By induction, it now follows that there is an optimal schedule 〈g 1 , g 2 ,... , gk, ck+1,... , cm〉 that includes every class chosen by the greedy algorithm. But this is impossible unless k = m; if there were a class ck+1 that does not conflict with gk, the greedy algorithm would choose more than k classes. 

A.3 General Structure

The basic structure of this correctness proof is exactly the same as for the tape-sorting problem: an inductive exchange argument.

  • Assume that there is an optimal solution that is different from the greedy solution.
  • Find the ‘first’ difference between the two solutions.
  • Argue that we can exchange the optimal choice for the greedy choice without degrading the solution.

This argument implies by induction that there is an optimal solution that contains the entire greedy solution. Sometimes, as in the scheduling problem, an additional step is required to show no optimal solution strictly improves the greedy solution.

A.4 Huffman codes

A binary code assigns a string of 0s and 1s to each character in the alphabet. A binary code is prefix-free if no code is a prefix of any other. 7-bit ASCII and Unicode’s UTF-8 are both prefix-free binary codes. Morse code is a binary code, but it is not prefix-free; for example, the code for S (· · ·) includes the code for E (·) as a prefix. Any prefix-free binary code can be visualized as a binary tree with the encoded characters stored at the leaves. The code word for any symbol is given by the path from the root to the corresponding leaf; 0 for left, 1 for right. The length of a codeword for a symbol is the depth of the corresponding leaf. (Note that the code tree is not a binary search tree. We don’t care at all about the sorted order of symbols at the leaves. (In fact. the symbols may not have a well-defined order!) Suppose we want to encode messages in an n-character alphabet so that the encoded message is as short as possible. Specifically, given an array frequency counts f [1 .. n], we want to compute a prefix-free binary code that minimizes the total encoded length of the message:^2

∑^ n

i=

f [i] · depth (i).

In 1952, David Huffman developed the following greedy algorithm to produce such an optimal code:

HUFFMAN: Merge the two least frequent letters and recurse.

For example, suppose we want to encode the following helpfully self-descriptive sentence, discov- ered by Lee Sallows:^3

This sentence contains three a’s, three c’s, two d’s, twenty-six e’s, five f’s, three g’s, eight h’s, thirteen i’s, two l’s, sixteen n’s, nine o’s, six r’s, twenty-seven s’s, twenty-two t’s, two u’s, five v’s, eight w’s, four x’s, five y’s, and only one z.

To keep things simple, let’s forget about the forty-four spaces, nineteen apostrophes, nineteen commas, three hyphens, and one period, and just encode the letters. Here’s the frequency table:

A C D E F G H I L N O R S T U V W X Y Z 3 3 2 26 5 3 8 13 2 16 9 6 27 22 2 5 8 4 5 1 (^2) This looks almost exactly like the cost of a binary search tree, but the optimization problem is very different: code trees are not search trees! (^3) A. K. Dewdney. Computer recreations. Scientific American, October 1984. Douglas Hofstadter published a few earlier examples of Lee Sallows’ self-descriptive sentences in his Scientific American column in January 1982.

Lemma 5. Let x and y be the two least frequent characters (breaking ties between equally frequent characters arbitrarily). There is an optimal code tree in which x and y are siblings.

Proof: I’ll actually prove a stronger statement: There is an optimal code in which x and y are siblings and have the largest depth of any leaf. Let T be an optimal code tree with depth d. Since T is a full binary tree, it has at least two leaves at depth d that are siblings. (Verify this by induction!) Suppose those two leaves are not x and y, but some other characters a and b. Let T ′^ be the code tree obtained by swapping x and a. The depth of x increases by some amount ∆, and the depth of a decreases by the same amount. Thus,

cost (T ′) = cost (T ) − (f [a] − f [x])∆.

By assumption, x is one of the two least frequent characters, but a is not, which implies that f [a] ≥ f [x]. Thus, swapping x and a does not increase the total cost of the code. Since T was an optimal code tree, swapping x and a does not decrease the cost, either. Thus, T ′^ is also an optimal code tree (and incidentally, f [a] actually equals f [x]). Similarly, swapping y and b must give yet another optimal code tree. In this final optimal code tree, x and y as maximum-depth siblings, as required. 

Now optimality is guaranteed by our dear friend the Recursion Fairy! Essentially we’re relying on the following recursive definition for a full binary tree: either a single node, or a full binary tree where some leaf has been replaced by an internal node with two leaf children.

Theorem 6. Huffman codes are optimal prefix-free binary codes.

Proof: If the message has only one or two different characters, the theorem is trivial. Otherwise, let f [1 .. n] be the original input frequencies, where without loss of generality, f [1] and f [2] are the two smallest. To keep things simple, let f [n + 1] = f [1] + f [2]. By the previous lemma, we know that some optimal code for f [1 .. n] has characters 1 and 2 as siblings. Let T ′^ be the Huffman code tree for f [3 .. n + 1]; the inductive hypothesis implies that T ′^ is an optimal code tree for the smaller set of frequencies. To obtain the final code tree T , we replace the leaf labeled n + 1 with an internal node with two children, labelled 1 and 2. I claim that T is optimal for the original frequency array f [1 .. n]. To prove this claim, we can express the cost of T in terms of the cost of T ′^ as follows. (In these equations, depth (i) denotes the depth of the leaf labelled i in either T or T ′; if the leaf appears in both T and T ′, it has the same depth in both trees.)

cost (T ) =

∑^ n

i=

f [i] · depth (i)

n∑+

i=

f [i] · depth (i) + f [1] · depth (1) + f [2] · depth (2) − f [n + 1] · depth (n + 1)

= cost (T ′) + f [1] · depth (1) + f [2] · depth (2) − f [n + 1] · depth (n + 1) = cost (T ′) + (f [1] + f [2]) · depth (T ) − f [n + 1] · ( depth (T ) − 1) = cost (T ′) + f [1] + f [2]

This equation implies that minimizing the cost of T is equivalent to minimizing the cost of T ′; in particular, attaching leaves labeled 1 and 2 to the leaf in T ′^ labeled n + 1 gives an optimal code tree for the original frequencies. 

Oh, I almost forgot! To actually implement Huffman codes efficiently, we keep the characters in a min-heap, keyed by frequency. We also construct the code tree by keeping three arrays of indices, listing the left and right children and the parent of each node. The root of the tree is the node with index 2 n − 1.

BUILDHUFFMAN(f [1 .. n]): for i ← 1 to n L[i] ← 0 ; R[i] ← 0 INSERT(i, f [i]) for i ← n to 2 n − 1 x ← EXTRACTMIN( ) y ← EXTRACTMIN( ) f [i] ← f [x] + f [y] L[i] ← x; R[i] ← y P [x] ← i; P [y] ← i INSERT(i, f [i]) P [2n − 1] ← 0

The algorithm performs O(n) min-heap operations. If we use a balanced binary tree as the heap, each operation requires O(log n) time, so the total running time of BUILDHUFFMAN is O(n log n). Finally, here are simple algorithms to encode and decode messages:

HUFFMANENCODE(A[1 .. k]): m ← 1 for i ← 1 to k HUFFMANENCODEONE(A[i]) HUFFMANENCODEONE(x): if x < 2 n − 1 HUFFMANENCODEONE(P [x]) if x = L[P [x]] B[m] ← 0 else B[m] ← 1 m ← m + 1

HUFFMANDECODE(B[1 .. m]): k ← 1 v ← 2 n − 1 for i ← 1 to m if B[i] = 0 v ← L[v] else v ← R[v] if L[v] = 0 A[k] ← v k ← k + 1 v ← 2 n − 1

©^ c Copyright 2006 Jeff Erickson. Released under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License (http://creativecommons.org/licenses/by-nc-sa/2.5/)