Lecture Notes on Approximation Algorithms, Lecture notes of Algorithms and Programming

These lecture notes cover the topic of approximation algorithms, which are used to find solutions that are as close as possible to the optimal solution for optimization problems. The notes discuss the definition of approximation algorithms, their applications, and linear programming techniques for designing them. The notes also cover the weighted vertex cover problem and its solution using integer programming. from Cornell University's Fall 2018 CS 6820: Algorithms course.

Typology: Lecture notes

2017/2018

Uploaded on 05/11/2023

skips
skips 🇺🇸

4.4

(11)

222 documents

1 / 17

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Cornell University, Fall 2018 CS 6820: Algorithms
Lecture notes on approximation algorithms November 2018
1 Introduction: Approximation Algorithms
For many important optimization problems, there is no known polynomial-time algorithm to
compute the exact optimum. In fact, when we discuss the topic of NP-completeness later in the
semester, we’ll see that a great many such problems are all equivalently hard to solve, in the
sense that the existence of a polynomial-time algorithm for solving any one of them would imply
polynomial-time algorithms for all the rest.
The study of approximation algorithms arose as a way to circumvent the apparent hardness of
these problems by relaxing the algorithm designer’s goal: instead of trying to compute an exactly
optimal solution, we aim to compute a solution whose value is as close as possible to that of the
optimal solution. However, in some situations it is desirable to run an approximation algorithm
even when there exists a polynomial-time algorithm for computing an exactly optimal solution.
For example, the approximation algorithm may have the benefit of faster running time, a lower
space requirement, or it may lend itself more easily to a parallel or distributed implementation.
These considerations become especially important when computing on “big data,” where the
input size is so astronomical that a running time which is a high-degree polynomial of the input
size (or even quadratic, for that matter) cannot really be considered an efficient algorithm, at
least on present-day hardware.
To make the definition of approximation algorithms precise, we say that an algorithm ALG
for a maximization problem is an α-approximation algorithm (or that its approximation factor
is α) if the following inequality holds for every input instance x:
ALG(x)OPT(x)α·ALG(x).
Here OPT(x) denotes the value of the problem’s objective function when evaluated on the op-
timal solution to input instance x, and ALG(x) denotes the algorithm’s output when it runs
on input instance x. Note that the definition only requires the algorithm to output a number
(approximating the value of the optimal solution) and not the approximate solution itself. In
most cases, it is possible to design the algorithm so that it also outputs a solution attaining the
value ALG(x), but in these notes we adopt a definition of approximation algorithm that does
not require the algorithm to do so.
Similarly, for a minimization problem, an α-approximation algorithm must satisfy
OPT(x)ALG(x)α·OPT(x).
Note that in both cases the approximation factor αis a number greater than or equal to 1.
2 Approximation Algorithms Based on Linear Program-
ming
Linear programming is an extremely versatile technique for designing approximation algorithms,
because it is one of the most general and expressive problems that we know how to solve in
polynomial time. In this section we’ll discuss three applications of linear programming to the
design and analysis of approximation algorithms.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Lecture Notes on Approximation Algorithms and more Lecture notes Algorithms and Programming in PDF only on Docsity!

Cornell University, Fall 2018 CS 6820: Algorithms Lecture notes on approximation algorithms November 2018

1 Introduction: Approximation Algorithms

For many important optimization problems, there is no known polynomial-time algorithm to compute the exact optimum. In fact, when we discuss the topic of NP-completeness later in the semester, we’ll see that a great many such problems are all equivalently hard to solve, in the sense that the existence of a polynomial-time algorithm for solving any one of them would imply polynomial-time algorithms for all the rest. The study of approximation algorithms arose as a way to circumvent the apparent hardness of these problems by relaxing the algorithm designer’s goal: instead of trying to compute an exactly optimal solution, we aim to compute a solution whose value is as close as possible to that of the optimal solution. However, in some situations it is desirable to run an approximation algorithm even when there exists a polynomial-time algorithm for computing an exactly optimal solution. For example, the approximation algorithm may have the benefit of faster running time, a lower space requirement, or it may lend itself more easily to a parallel or distributed implementation. These considerations become especially important when computing on “big data,” where the input size is so astronomical that a running time which is a high-degree polynomial of the input size (or even quadratic, for that matter) cannot really be considered an efficient algorithm, at least on present-day hardware. To make the definition of approximation algorithms precise, we say that an algorithm ALG for a maximization problem is an α-approximation algorithm (or that its approximation factor is α) if the following inequality holds for every input instance x:

ALG(x) ≤ OPT(x) ≤ α · ALG(x).

Here OPT(x) denotes the value of the problem’s objective function when evaluated on the op- timal solution to input instance x, and ALG(x) denotes the algorithm’s output when it runs on input instance x. Note that the definition only requires the algorithm to output a number (approximating the value of the optimal solution) and not the approximate solution itself. In most cases, it is possible to design the algorithm so that it also outputs a solution attaining the value ALG(x), but in these notes we adopt a definition of approximation algorithm that does not require the algorithm to do so. Similarly, for a minimization problem, an α-approximation algorithm must satisfy OPT(x) ≤ ALG(x) ≤ α · OPT(x).

Note that in both cases the approximation factor α is a number greater than or equal to 1.

2 Approximation Algorithms Based on Linear Program-

ming

Linear programming is an extremely versatile technique for designing approximation algorithms, because it is one of the most general and expressive problems that we know how to solve in polynomial time. In this section we’ll discuss three applications of linear programming to the design and analysis of approximation algorithms.

2.1 LP Rounding Algorithm for Weighted Vertex Cover

In an undirected graph G = (V, E), if S ⊆ V is a set of vertices and e is an edge, we say that S covers e if at least one endpoint of e belongs to S. We say that S is a vertex cover if it covers every edge. In the weighted vertex cover problem, one is given an undirected graph G = (V, E) and a weight wv ≥ 0 for each vertex v, and one must find a vertex cover of minimum combined weight. We can express the weighted vertex cover problem as an integer program, by using decision variables xv for all v ∈ V that encode whether v ∈ S. For any set S ⊆ V we can define a vector x, with components indexed by vertices of G, by specifying that

xv =

1 if v ∈ S 0 otherwise.

S is a vertex cover if and only if the constraint xu + xv ≥ 1 is satisfied for every edge e = (u, v). Conversely, if x ∈ { 0 , 1 }V^ satisfies xu + xv ≥ 1 for every edge e = (u, v) then the set S = {v | xv = 1} is a vertex cover. Thus, the weighted vertex cover problem can be expressed as the following integer program.

min

v∈V wvxv s.t. xu + xv ≥ 1 ∀e = (u, v) ∈ E

xv ∈ { 0 , 1 } ∀v ∈ V

To design an approximation algorithm for weighted vertex cover, we will transform this integer program into a linear program by relaxing the constraint xv ∈ { 0 , 1 } to allow the variables xv to take fractional values. min

v∈V wvxv s.t. xu + xv ≥ 1 ∀e = (u, v) ∈ E

xv ≥ 0 ∀v ∈ V

It may seem more natural to replace the constraint xv ∈ { 0 , 1 } with xv ∈ [0, 1] rather than xv ≥ 0, but the point is that an optimal solution of the linear program will never assign any of the variables xv a value strictly greater than 1, because the value of any such variable could always be reduced to 1 without violating any constraints, and this would only improve the objective function

v wvxv. Thus, writing the constraint as^ xv^ ≥^ 0 rather than^ xv^ ∈^ [0,^ 1] is without loss of generality. It is instructive to present an example of a fractional solution of (2) that achieves a strictly lower weight than any integer solution. One such example is when G is a 3-cycle with vertices u, v, w, each having weight 1. Then the vector x =

2 ,^

1 2 ,^

1 2

satisfies all of the constraints of (2) and the objective function evaluates to 32 at x. In contrast, the minimum weight of a vertex cover of the 3-cycle is 2. We can solve the linear program (2) in polynomial time, but as we have just seen, the solution may be fractional. In that case, we need to figure out how we are going to post-process the fractional solution to obtain an actual vertex cover. In this case, the natural idea of rounding to the nearest integer works. Let x be an optimal solution of the linear program (2) and define

x˜v =

1 if xv ≥ 1 / 2 0 otherwise.

The dual LP insists that we maximize the combined price of all edges, subject to the con- straint that each vertex has enough wealth to pay for all the edges it covers. Rather than exactly maximizing the combined price of all edges, we will set edge prices using a natural (but subopti- mal) greedy heuristic: go through the edges in arbitrary order, increasing the price of each one as much as possible without violating the dual constraints. This results in the following algorithm.

Algorithm 1 Primal-dual algorithm for vertex cover

1: Initialize S = ∅, ye = 0 ∀e ∈ E, sv = 0 ∀v ∈ V. 2: for all e ∈ E do 3: δ = min{wu − su, wv − sv} 4: ye = ye + δ 5: su = su + δ 6: sv = sv + δ 7: if su = wu then 8: S = S ∪ {u} 9: end if 10: if sv = wv then 11: S = S ∪ {v} 12: end if 13: end for 14: return S

The variables sv keep track of the sum

e∈δ(v) ye^ (i.e., the left-hand side of the dual constraint corresponding to vertex v) as it grows during the execution of the algorithm. The rule for updat- ing S by inserting each vertex v such that sv = wv is inspired by the principle of complementary slackness from the theory of linear programming duality: if x∗^ is an optimal solution of a primal linear program and y∗^ is an optimal solution of the dual, then for every i such that x∗ i 6 = 0 the ith^ dual constraint must be satisfied with equality by y∗; similarly, for every j such that y j∗ 6 = 0, the jth^ primal constraint is satisfied with equality by x∗. Thus, it is natural that our decisions of which vertices to include in our vertex cover (primal solution) should be guided by keeping track of which dual constraints are tight (sv = wv). It is clear that each iteration of the main loop runs in constant time, so the algorithm runs in linear time. At the end of the loop processing edge e = (u, v), at least one of the vertices u, v must belong to S. Therefore, S is a vertex cover. To conclude the analysis we need to prove that the approximation factor is 2. To do so, we note the following loop invariants — statements that hold at the beginning and end of each execution of the for loop, though not necessarily in the middle. Each of them is easily proven by induction on the number of iterations of the for loop.

  1. y is a feasible vector for the dual linear program.
  2. sv =

e∈δ(v) ye.

  1. S = {v | sv = wv}.

v∈V sv^ = 2^

e∈E ye. Now the proof of the approximation factor is easy. Recalling that

e∈E ye^ ≤^

v∈OPT wv^ by weak duality, we find that ∑

v∈S

wv =

v∈S

sv ≤

v∈V

sv = 2

e∈E

ye ≤ 2

v∈OPT

wv.

2.3 Greedy Algorithm for Weighted Set Cover

Vertex cover is a special case of the set cover problem, in which there is a set U of n elements, and there are m subsets S 1 ,... , Sm ⊆ U , with positive weights w 1 ,... , wm. The goal is to choose a subcollection of the⋃ m subsets (indexed by an index set I ⊆ { 1 ,... , m}), such that

i∈I Si^ =^ U^ , and to minimize the combined weight^

i∈I wi.^ We will analyze the following natural greedy algorithm that chooses sets according to a “minimum weight per new element covered” criterion. (The variable T in the pseudocode below keeps track of the set of elements that are not yet covered by

i∈I Si.)

Algorithm 2 Greedy algorithm for set cover

1: Initialize I = ∅, T = U. 2: while T 6 = ∅ do 3: i = arg mink

wk |T ∩Sk |

∣ 1 ≤ k ≤ m, T ∩ Sk 6 = ∅

4: I = I ∪ {i}. 5: T = T \ Si. 6: end while 7: return I

It is clear that the algorithm runs in polynomial time and outputs a valid set cover. To analyze the approximation ratio, we will use the linear programming relaxation of set cover and its dual.

min

∑m i=1 wixi s.t.

i:j∈Si xi^ ≥^1 ∀j^ ∈^ U xi ≥ 0 ∀i = 1,... , m

max

j∈U yj s.t.

j∈Si yj^ ≤^ wi^ ∀i^ = 1,... , m yj ≥ 0 ∀j ∈ U

It will be helpful to rewrite the greedy set cover algorithm by adding some extra lines that do not influence the choice of which sets to place in I , but merely compute extra data relevant to the analysis. Specifically, in the course of choosing sets to include in I , we also compute a vector z indexed by elements of U. This is not a feasible solution of the dual LP, but at the end of the algorithm we scale it down to obtain another vector y that is feasible for the dual LP. The scale factor α will constitute an upper bound on the algorithm’s approximation ratio. This is called the method of dual fitting.

values zj assigned during that iteration of the while loop were set equal to the cost-effectiveness of that set. Summing the bounds (9) for q = 0,... , p − 1, we find that

j∈Si

zj ≤ wi ·

p

p − 1

< wi ·

∫ (^) p

1

dt t

= wi · (1 + ln p).

The lemma follows upon dividing both sides by α.

3 Randomized Approximation Algorithms

Randomized techniques give rise to some of the simplest and most elegant approximation algo- rithms. This section gives several examples.

3.1 A Randomized 2-Approximation for Max-Cut

In the max-cut problem, one is given an undirected graph G = (V, E) and a positive weight we for each edge, and one must output a partition of V into two subsets A, B so as to maximize the combined weight of the edges having one endpoint in A and the other in B. We will analyze the following extremely simple randomized algorithm: assign each vertex at random to A to B with equal probability, such that the random decisions for the different vertices are mutually independent. Let E(A, B) denote the (random) set of edges with one endpoint in A and the other endpoint in B. The expected weight of our cut is

E

e∈E(A,B)

we

e∈E

we · Pr(e ∈ E(A, B)) =

e∈E

we.

Since the combined weight of all edges in the graph is an obvious upper bound on the weight of any cut, this shows that the expected weight of the cut produced by our algorithm is at least half the weight of the maximum cut.

3.1.1 Derandomization using pairwise independent hashing

In analyzing the expected weight of the cut defined by our randomized algorithm, we never really used the full power of our assumption that the random decisions for the different vertices are mutually independent. The only property we needed was that for each pair of vertices u, v, the probability that u and v make different decisions is exactly 12. It turns out that one can achieve this property using only k = dlog 2 (n)e independent random coin tosses, rather than n independent random coin tosses. Let F 2 denote the field { 0 , 1 } under the operations of addition and multiplication modulo 2. Assign to each vertex v a distinct vector x(v) in the vector space Fk 2 ; our choice of k = dlog 2 (n)e ensures that the vector space contains enough elements to assign a distinct one to each vertex. Now let r be a uniformly random vector in Fk 2 , and partition the vertex set V into the subsets

Ar = {v | r · x(v) = 0} Br = {v | r · x(v) = 1}.

For any edge e = (u, v), the probability that e ∈ E(Ar, Br) is equal to the probability that r · (x(v) − x(u)) is nonzero. For any fixed nonzero vector w ∈ Fk 2 , we have Pr(r · w 6 = 0) = (^12) because the set of r satisfying r · w = 0 is a linear subspace of Fk 2 of dimension k − 1, hence exactly 2k−^1 of the 2k^ possible vectors r have zero dot product with w and the other 2k−^1 of them have nonzero dot product with w. Thus, if we sample r ∈ Fk 2 uniformly at random, the expected weight of the cut defined by (Ar, Br) is at least half the weight of the maximum cut. The vector space Fk 2 has only 2k^ = O(n) vectors in it, which suggests a deterministic alter- native to our randomized algorithm. Instead of choosing r at random, we compute the weight of the cut (Ar, Br) for every r ∈ Fk 2 and take the one with maximum weight. This is at least as good as choosing r at random, so we get a deterministic 2-approximation algorithm at the cost of increasing the running time by a factor of O(n).

3.1.2 Derandomization using conditional expectations

A different approach for converting randomization approximation algorithms into deterministic ones is the method of conditional expectations. In this technique, rather than making all of our random decisions simultaneously, we make them sequentially. Then, instead of making the decisions by choosing randomly between two alternatives, we evaluate both alternatives according to the conditional expectation of the objective function if we fix the decision (and all preceding ones) but make the remaining ones at random. Then we choose the alternative that optimizes this conditional expectation. To apply this technique to the randomized max-cut algorithm, we imagine maintaining a partition of the vertex set into three sets A, B, C while the algorithm is running. Sets A, B are the two pieces of the partition we are constructing. Set C contains all the vertices that have not yet been assigned. Initially C = V and A = B = ∅. When the algorithm terminates C will be empty. At an intermediate stage when we have constructed a partial partition (A, B) but C contains some unassigned vertices, we can imagine assigning each element of C randomly to A or B with equal probability, independently of the other elements of C. If we were to do this, the expected weight of the random cut produced by this procedure would be

w(A, B, C) =

e∈E(A,B)

we +

e∈E(A,C)

we +

e∈E(B,C)

we +

e∈E(C,C)

we.

This suggests the following deterministic algorithm that considers vertices one by one, assigning them to either A or B using the function w(A, B, C) to guide its decisions.

Algorithm 4 Derandomized max-cut algorithm using method of conditional expectations

1: Initialize A = B = ∅, C = V. 2: for all v ∈ V do 3: Compute w(A + v, B, C − v) and w(A, B + v, C − v). 4: if w(A + v, B, C − v) > w(A, B + v, C − v) then 5: A = A + v 6: else 7: B = B + v 8: end if 9: C = C − v 10: end for 11: return A, B

3.1.3 Semidefinite programming and the Goemans-Williamson algorithm

So far, in our discussion of max-cut, we have made no mention of linear programming. It’s worth considering for a moment the question of whether the natural linear programming relaxation of the max-cut problem can achieve an approximation factor better than 2. It’s actually not so easy to write down the natural linear programming relaxation of max-cut. We can define decision variables {xv | v ∈ V }, taking values in [0, 1], with the intended semantics that xv = 0 if v ∈ A and∑ xv = 1 if v ∈ B. The trouble is that the natural way to write the objective function is

e∈E |xu^ −^ xv|, which is not a linear function because the absolute value function is non-linear. A workaround is to define a variable ye for each edge e, with the intended semantics that ye = 1 if e crosses the cut (A, B) and otherwise ye = 0. This suggests the following linear programming relaxation of max-cut.

max

e∈E ye s.t. ye ≤ xu + xv ∀e = (u, v) ∈ E

ye ≤ (1 − xu) + (1 − xv) ∀e = (u, v) ∈ E

0 ≤ xv ≤ 1 ∀v ∈ V

As noted earlier, for every partition of V into two sets A, B we can set xv = 0 if v ∈ A, xv = 1 if v ∈ B, and ye = 1 if and only if e crosses the cut (A, B); this yields a valid integer solution of the linear program (10) such that the LP objective value equals the number of edges crossing the cut. Unfortunately, for every graph the linear program (10) also has a fractional solution whose objective value equals m, the number of edges in the graph. Namely, setting xv = 12 for all v and ye = 1 for all e satisfies the constraints of the linear program and achieves the objective value m. Since there exists graphs whose maximum cut contains barely more than half the edges (e.g., a complete graph on n vertices), solving this LP relaxation of max-cut only yields a 2-approximation, the same approximation factor achieved by the much simpler algorithms presented above. For many years, it was not known whether any polynomial-time approximation algorithm for max-cut could achieve an approximation factor better than 2. Then in 1994, Michel Goe- mans and David Williamson discovered an algorithm with approximation factor roughly 1.14, based on semidefinite programming (SDP). Since then, SDP has found an increasing number of applications algorithm design, not only in approximation algorithms (where SDP has many other applications besides max-cut), but also in machine learning and high-dimensional statistics, coding theory, and other areas. So what is semidefinite programming? An SDP is an optimization problem that seeks the maximum of a linear function over the set of symmetric positive semidefinite n × n matrices, subject to linear inequality constraints. See Lemma 2 below for the definition and some ba- sic properties of symmetric positive semidefinite matrices. For now, we limit ourselves to the following remarks motivating why semidefinite programming is a powerful tool.

  1. Semidefinite programming is solvable (to any desired precision) in polynomial time. The solution may contain irrational numbers, even if the input is made up exclusively of rational numbers, which is why the parenthetical remark about “to any desired precision” was necessary.
  1. One of the main themes of algorithm design is, “When you have a discrete optimization problem that you don’t know how to solve, formulate a related continuous optimization problem that you do know how to solve, and then try to figure out how to transform a solution of the continuous problem into a solution (either exact or approximate) or the original discrete problem. An obvious consequence of this theme is: any time someone discovers a continuous optimization problem that can be solved by an efficient algorithm, that’s a potential opportunity to design better algorithms for discrete optimization problems too. This remark alone could justify the importance of SDP’s in algorithm design.
  2. Any linear program can be reformulated as a semidefinite program, by optimizing over the set of diagonal positive semidefinite matrices. Thus, SDP is at least as powerful as LP.
  3. Often, one thinks of LP as relaxing a discrete optimization problem by allowing { 0 , 1 }- valued quantities to take continuous (scalar) values. In the same spirit, SDP can be thought of as further relaxing the problem by allowing scalar quantities to take (potentially high- dimensional) vector values.

To define semidefinite programming, we start with a lemma about real symmetric matrices. Any matrix satisfying the equivalent conditions listed in the lemma is called a symmetric positive semidefinite (PSD) matrix. The notation A  0 denotes the fact that A is PSD.

Lemma 2. For a real symmetric matrix A, the following properties are equivalent.

  1. Every eigenvalue of A is non-negative.
  2. A can be expressed as a weighted sum of the form

A =

∑^ m

i=

ciyiyT i (11)

where the coefficients ci are non-negative.

  1. A = XXT^ for some matrix X.
  2. There exists a vector space containing vectors x 1 ,... , xn such that A is the matrix of dot products of these vectors, i.e. for 1 ≤ i, j ≤ n, the dot product xi · xj occurs in the ith^ row and jth^ column of A.
  3. For every vector z, A satisfies the inequality zTAz ≥ 0.

Proof. To prove that the first property implies the second, we use the fact that every real sym- metric matrix is orthogonally diagonalizable, i.e. that A can be written in the form A = QDQT where Q is an orthogonal matrix and D is a diagonal matrix whose entries are the eigenvalues of A. Defining ci to be the ith^ diagonal entry of D, and yi to be the ith^ column of Q, the equation A = QDQT^ expands as A =

∑n i=1 ciyiy

T i , as desired. It is elementary to prove that (2) → (3) → (5) → (1), as follows. If A is represented by a weighted sum as in (11), then A = XXT^ where X is a matrix with k columns whose ith^ column is

ci · yi. If A = XXT, then for every z we have zTAz = (zTX)(XTz) = ‖XTz‖^2 ≥ 0. If the inequality zTAz ≥ 0 is satisfied for every vector z, then in particular it is satisfied whenever z is an eigenvector of A with eigenvalue λ. This implies that 0 ≤ zTAz = zT(λz) = λ‖z‖^2 and hence that λ ≥ 0. Having proven that (1) → (2) in the preceding paragraph, we may now conclude that

3.2 A Randomized 2-Approximation for Vertex Cover

For the unweighted vertex cover problem (the special case of weighted vertex cover in which wv = 1 for all v) the following incredibly simple algorithm is a randomized 2-approximation.

Algorithm 6 Randomized approximation algorithm for unweighted vertex cover

1: Initialize S = ∅. 2: for all e = (u, v) ∈ E do 3: if neither u nor v belongs to S then 4: Randomly choose u or v with equal probability. 5: Add the chosen vertex into S. 6: end if 7: end for 8: return S

Clearly, the algorithm runs in linear time and always outputs a vertex cover. To analyze its approximation ratio, as usual, we define an appropriate loop invariant. Let OPT denote any vertex cover of minimum cardinality. Let Si denote the contents of the set S after completing the ith^ iteration of the loop. We claim that for all i,

E[|Si ∩ OPT|] ≥ E[|Si \ OPT|]. (15)

The proof is by induction on i. In a loop iteration in which e = (u, v) is already covered by Si− 1 , we have Si = Si− 1 so (15) clearly holds. In a loop iteration in which e = (u, v) is not yet covered, we know that at least one of u, v belongs to OPT. Thus, the left side of (15) has probability at least 1/2 of increasing by 1, while the right side of (15) has probability at most 1/2 of increasing by 1. This completes the proof of the induction step. Consequently, letting S denote the random vertex cover generated by the algorithm, we have E[|S ∩ OPT|] ≥ E[|S \ OPT|] from which it easily follows that E[|S|] ≤ 2 · |OPT|. The same algorithm design and analysis technique can be applied to weighted vertex cover. In that case, we choose a random endpoint of an uncovered edge (u, v) with probability inversely proportional to the weight of that endpoint.

Algorithm 7 Randomized approximation algorithm for weighted vertex cover

1: Initialize S = ∅. 2: for all e = (u, v) ∈ E do 3: if neither u nor v belongs to S then 4: Randomly choose u with probability (^) wuw+vwv and v with probability (^) wuw+uwv. 5: Add the chosen vertex into S. 6: end if 7: end for 8: return S

The loop invariant is

E

[

v∈Si∩OPT

wv

]

≥ E

v∈Si\OPT

wv

In a loop iteration when (u, v) is uncovered, the expected increase in the left side is at least wuwv wu+wv whereas the expected increase in the right side is at most^

wuwv wu+wv.

4 Linear Programming with Randomized Rounding

Linear programming and randomization turn out to be a very powerful when used in combination. We will illustrate this by presenting an algorithm of Raghavan and Thompson for a problem of routing paths in a network to minimize congestion. The analysis of the algorithm depends on the Chernoff bound, a fact from probability theory that is one of the most useful tools for analyzing randomized algorithms.

4.1 The Chernoff bound

The Chernoff bound is a very useful theorem concerning the sum of a large number of independent random variables. Roughly speaking, it asserts that for any fixed β > 1, the probability of the sum exceeding its expected value by a factor greater than β tends to zero exponentially fast as the expected sum tends to infinity.

Theorem 3. Let X 1 ,... , Xn be independent random variables taking values in [0, 1], let X denote their sum, and let μ = E [X]. For every β > 1 ,

Pr (X ≥ βμ) < e−μ[β^ ln(β)−(β−1)]. (16)

For every β < 1 , Pr (X ≤ βμ) < e−μ[β^ ln(β)−(β−1)]. (17)

Proof. The key idea in the proof is to make use of the moment-generating function of X, defined to be the following function of a real-valued parameter t:

MX (t) = E

[

etX^

]

From the independence of X 1 ,... , Xn, we derive:

MX (t) = E

[

etX^1 etX^2 · · · etXn^

]

∏^ n

i=

E

[

etXi^

]

To bound each term of the product, we reason as follows. Let Yi be a { 0 , 1 }-valued random variable whose distribution, conditional on the value of Xi, satisfies Pr(Yi = 1 | Xi) = Xi. Then for each x ∈ [0, 1] we have

E

[

etYi

∣ (^) Xi = x

]

= xet^ + (1 − x)e^0 ≥ etx^ = E

[

etXi

∣ (^) Xi = x

]

where the inequality in the middle of the line uses the fact that etx^ is a convex function. Since this inequality holds for every value of x, we can integrate over x to remove the conditioning, obtaining E

[

etYi^

]

≥ E

[

etXi^

]

Letting μi denote E[Xi] = Pr(Yi = 1) we find that [ etXi^

]

[

etYi^

]

= μiet^ + (1 − μi) = 1 + μi(et^ − 1) ≤ exp

μi(et^ − 1)

where exp(x) denotes ex, and the last inequality holds because 1 + x ≤ exp(x) for all x. Now substituting this upper bound back into (18) we find that

E

[

etX^

]

∏^ n

i=

exp

μi(et^ − 1)

= exp

μ(et^ − 1)

The first step in designing our approximation algorithm is to come up with a linear program- ming relaxation. To do so, we define a decision variable xi,e for each i = 1,... , k and each e ∈ E, denoting whether or not e belongs to Pi, and we allow this variable to take fractional values. The resulting linear program can be written as follows, using δ+(v) to denote the set of edges leaving v and δ−(v) to denote the set of edges entering v.

min r

s.t.

e∈δ+(v) xi,e^ −^

e∈δ−(v) xi,e^ =

1 if v = si − 1 if v = ti 0 if v 6 = si, ti

∀i = 1,... , k, v ∈ V

∑k i=1 xi,e^ ≤^ ce^ ·^ r^ ∀e^ ∈^ E xi,e ≥ 0 ∀i = 1,... , k, e ∈ E

When (xi,e) is a { 0 , 1 }-valued vector obtained from a collection of paths P 1 ,... , Pk by setting xi,e = 1 for all e ∈ Pi, the first constraint ensures that Pi is a path from si to ti while the second one ensures that the congestion of each edge is bounded above by r. Our approximation algorithm solves the linear program (24), does some postprocessing of the solution to obtain a probability distribution over paths for each terminal pair (si, ti), and then outputs an independent random sample from each of these distributions. To describe the postprocessing step, it helps to observe that the first LP constraint says that for every i ∈ { 1 ,... , k}, the values xi,e define a network flow of value 1 from si to ti. Define a flow to be acyclic if there is no directed cycle C with a positive amount of flow on each edge of C. The first step of the postprocessing is to make the flow (xi,e) acyclic, for each i. If there is an index i ∈ { 1 ,... , k} and a directed cycle C such that xi,e > 0 for every edge e ∈ C, then we can let δ = min{xi,e | e ∈ C} and we can modify xi,e to xi,e − δ for every e ∈ C. This modified solution still satisfies all of the LP constraints, and has strictly fewer variables xi,e taking nonzero values. After finitely many such modifications, we must arrive at a solution in which each of the flow (xi,e), 1 ≤ i ≤ k is acyclic. Since this modified solution is also an optimal solution of the linear program, we may assume without loss of generality that in our original solution (xi,e) the flow was acyclic for each i. Next, for each i ∈ { 1 ,... , k} we take the acyclic flow (xi,e) and represent it as a probability distribution over paths from si to ti, i.e. a set of ordered pairs (P, πP ) such that P is a path from si to ti, πP is a positive number interpreted as the probability of sampling P , and the sum of the probabilities πP over all paths P is equal to 1. The distribution can be constructed using the following algorithm.

Algorithm 8 Postprocessing algorithm to construct path distribution

1: Given: Source si, sink ti, acyclic flow xi,e of value 1 from si to ti. 2: Initialize Di = ∅. 3: while there is a path P from si to ti such that xi,e > 0 for all e ∈ P do 4: πP = min{xi,e | e ∈ P } 5: Di = Di ∪ {(P, πP )}. 6: for all e ∈ P do 7: xi,e = xi,e − πP 8: end for 9: end while 10: return Di

Each iteration of the while loop strictly reduces the number of edges with xi,e > 0, hence the algorithm must terminate after selecting at most m paths. When it terminates, the flow (xi,e) has value zero (as otherwise there would be a path from si to ti with positive flow on each edge) and it is acyclic because (xi,e) was initially acyclic and we never put a nonzero amount of flow on an edge whose flow was initially zero. The only acyclic flow of value zero is the zero flow, so when the algorithm terminates we must have xi,e = 0 for all e. Each time we selected a path P , we decreased the value of the flow by exactly πP. The value was initially 1 and finally 0, so the sum of πP over all paths P is exactly 1 as required. For any given edge e, the value xi,e decreased by exactly πP each time we selected a path P containing e, hence the combined probability of all paths containing e is exactly xi,e. Performing the postprocessing algorithm 8 for each i, we obtain probability distributions D 1 ,... , Dk over paths from si to ti, with the property that the probability of a random sample from Di traversing edge e is equal to xi,e. Now we draw one independent random sample from each of these k distributions and output the resulting k-tuple of paths, P 1 ,... , Pk. We claim that with probability at least 1/2, the parameter maxe∈E {`e/ce} is at most αr, where α = (^) log log(23 log(2mm)). This follows by a direct application of Corollary 4 of the Chernoff bound. For any given edge e, we can define independent random variables X 1 ,... , Xk by specifying that

Xi =

(ce · r)−^1 if e ∈ Pi 0 otherwise.

These are independent and the expectation of their sum is

∑k i=1 xi,e/(ce^ ·^ r), which is at most 1 because of the second LP constraint above. Applying Corollary 4 with N = 2m, we find that the probability of X 1 + · · · + Xk exceeding α is at most 1/(2m). Since X 1 + · · · Xk = e(ce · r)−^1 , this means that the probability ofe/ce exceeding αr is at most 1/(2m). Summing the probabilities of these failure events for each of the m edges of the graph, we find that with probability at least 1 /2, none of the failure events occur and maxe∈E {e/ce} is bounded above by αr. Now, r is a lower bound on the parameter maxe∈E {e/ce} for any k-tuple of paths with the specified source- sink pairs, since any such k-tuple defines a valid LP solution and r is the optimum value of the LP. Consequently, our randomized algorithm achieves approximation factor α with probability at least 1/2.