Set Cover Problem: Unweighted and Weighted Formulations and Approximation Algorithms, Study notes of Advanced Algorithms

The set cover problem, a well-known np-complete problem in computer science. The problem is presented in both ordinary and hitting set formulations, and optimal approximation algorithms based on greedy strategies and integer programming formulations are introduced. The document also covers weighted set covering and its application in routing data packages.

Typology: Study notes

2012/2013

Uploaded on 04/23/2013

atasi
atasi 🇮🇳

4.6

(32)

134 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 8: Set Cover
Abstract
This lecture focused on the problem of “Set Cover”, which is known
as one of the first proved 21 NP-complete problems[2]. Two formula-
tions will be given and one optimal approximation algorithm based on
a greedy strategy is introduced. Further, the problem is generalized
to weighted elements and an approximation algorithm derived from
an Integer Programming(IP) formulation is presented.
1 Unweighted Set Covering
There are two different ways to look at the set covering problem. First
we introduce the ordinary formulation, then we introduce the hitting set
formulation.
1.1 Ordinary Formulation
We define P, a collection of subsets of a set Xas follows:
X={x1. . . xn}(1)
P={P1. . . Pr}(2)
where xirepresents a skill, Pja person, and xiPjif person Pjpossesses
the skill xi, for i= 1, ...n and j= 1, ...r.
The goal is to as an employer find the minimum number of people to cover
all the skills, namely to find the smallest set RPsuch that the people in
R covers all the skills:
PRP=X(3)
An example with 6 person and 12 skills will be illustrated. In the example,
by picking a set containing P1,P2,P4,P6we could ensure that this set covers
every skill. However, the set containing P3,P4,P5has the size 3, smaller
than the former set with size 4.
1
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Set Cover Problem: Unweighted and Weighted Formulations and Approximation Algorithms and more Study notes Advanced Algorithms in PDF only on Docsity!

Lecture 8: Set Cover

Abstract This lecture focused on the problem of “Set Cover”, which is known as one of the first proved 21 NP-complete problems[2]. Two formula- tions will be given and one optimal approximation algorithm based on a greedy strategy is introduced. Further, the problem is generalized to weighted elements and an approximation algorithm derived from an Integer Programming(IP) formulation is presented.

1 Unweighted Set Covering

There are two different ways to look at the set covering problem. First we introduce the ordinary formulation, then we introduce the hitting set formulation.

1.1 Ordinary Formulation

We define P , a collection of subsets of a set X as follows:

X = {x 1... xn} (1) P = {P 1... Pr} (2)

where xi represents a skill, Pj a person, and xi ∈ Pj if person Pj possesses the skill xi, for i = 1, ...n and j = 1, ...r. The goal is to as an employer find the minimum number of people to cover all the skills, namely to find the smallest set R ⊆ P such that the people in R covers all the skills: ∪P ∈RP = X (3) An example with 6 person and 12 skills will be illustrated. In the example, by picking a set containing P 1 , P 2 , P 4 , P 6 we could ensure that this set covers every skill. However, the set containing P 3 , P 4 , P 5 has the size 3, smaller than the former set with size 4.

Figure 1: an example of set covering problem in terms of the ordinary for- mulation

1.2 Hitting Set Formulation

The following formulation is equivalent:

P = {p 1... pr} (4) X = {X 1... Xn} (5)

where Xi represents a skill, pj a person, and pi ∈ Xj if person pj possesses the skill Xi, for i = 1, ...n and j = 1, ...r. The goal as well is to as an employer find the minimum number of people to cover all the skills, namely to find the smallest set H ⊆ P such that the people in H covers all the skills:

H = arcminH (|H|), s.t.|H ∩ Xi| 6 = 0, ∀i = 1, 2 , ...n (6)

1.3 A Solution to the Problem

As mentioned, this problem is NP-hard. However, we can obtain an O(logn)− approximation algorithm and it can be proved[1] that no better asymptotic approximation factor can be achieved unless P = N P.

The total cost C =

∑ x∈X Cx is the number of the people that are hired by this algorithm because for each stage 1 cost is charged over skills. Now we show that C is O(C∗logn). Firstly, we should have:

C =

x∈X

Cx ≤

S∈H∗

x∈S

Cx (9)

This inequality holds because every skill is counted exactly once during the process above while in the optimal solution as a cover set, every skill will be counted at least more than once. Secondly, we bound the righthand side by two lemmas.

Lemma 1.3.1 : For all sets S belonging to P ,

x∈S

Cx ≤ H(|S|) (10)

Proof of Lemma: Fix S ∈ P for all i = 1,... , |C|, and let ui = |Si − (S 1 ∪ S 2 ∪... ∪ Si− 1 )| be the number of elements in S remaining uncovered after S 1 ,... , Si have been selected by the algorithm, thus uo = |S|. Clearly ui− 1 > ui and ui− 1 − ui elements are covered for the first time by Si:

x∈S

Cx =

∑^ k

i=

(ui− 1 − ui)

Si − (S 1 ∪ S 2 ∪... ∪ Si− 1 )

but we have

|Si − (S 1 ∪ S 2 ∪... ∪ Si− 1 )| ≥ |S − (S 1 ∪ S 2 ∪... ∪ Si− 1 )| = ui− 1 (12)

x∈S

Cx ≤

∑^ k

i=

ui− 1 − ui ui− 1

∑^ k

i=

(H(ui− 1 ) − H(ui))

= H(u 0 ) − H(uk) ≤ H(|S|) (13)

where (14) is based on the following lemma from the calculus.

Lemma 1.3.2 Given two positive integers a and b, if a ≥ b,

H(b) − H(a) =

∑^ b i=a+

i

b − a b

Thus, by applying the lemma:

C ≤

S∈H∗

x∈S

Cx

S∈H∗

H(|S|)

= C∗^ × ln(|S|) (15)

The proof of the theorem is done.

2 Weighted Set Covering

Based on the hitting set formulation, we assign weights to the people, which can be considered as the salary. Denote the weight as wi for the person pi, where i = 1, 2 , ...r. Some people are hired more expensively while others are not. The goal is to find a subset ∑ H ⊆ P , but instead of minimizing the size, we minimize

i∈H wi.^ Next we formulate it to be an integer programming problem as follows.

2.1 Integer Programming Formulation for Weighted

Set Covering

For 1 ≤ i ≤ n, set the indicator variable Vi:

Vi =

{ 1 if pi ∈ H 0 otherwise

The goal is to minimize

∑n i=1 wi^ ·^ Vi, subject to^ ∀Xj^ ∈^ X,^

∑ pi∈Xj Vj^ ≥^ 1.

Now we give the analysis that this randomized algorithm is an optimal approximation algorithm with respect to n, namely O(logn)−approximation algorithm. First we can calculate the expected weight(cost) of the resulting partial cover is:

E[

∑^ n

i=

(Wi · Vˆi)] =

∑^ n

i=

(Wi · E[Vi])

∑^ n

i=

Wi · Vˆi

= OPT cost of LP ≤ OPT cost of IP (18)

Then we calculate the probability that Xi is covered. Suppose Xi con- tains p 1 ,... , pk. We know that:

∑^ k j=

V^ ˆj ≥ 1 (19)

Pr[Xi is covered] = 1 − Pr[not any is chosen] = Pr[p 1 isn’t chosen ∧ p 2 isn’t chosen,...

... ∧ pk isn’t chosen] = 1 − (1 − Vˆ 1 ) ·... · (1 − Vˆk)

= 1 −

∏^ k

j=

(1 − Vˆj )

≥ 1 − (1 − 1 /k)k^ ≥ 1 −

e

Next we show that if we repeat the round for 2logm times, with high probability all of the skills are covered, where m is the number of skills. Since we have proved for any skill, the probability that it is covered in one round is at least one half. Thus,

P rob[skill i is not covered after logm rounds] ≤ 1 −(

)2 lg^ m^ = 1−

m^2

Therefore, the probability that there exists some skill not covered after 2logm rounds:

P r[some skill not covered] ≤

i∈S

Pr[i is not covered] ≤

S∈S

m^2

m

With high probability, after 2logm rounds all the skills are covered. As calculated before, the weight for each round in expectation is at most OPT, where OPT is the optimal size for IP problem. In all, after 2logm rounds, the weight in expectation is at most OP T ∗ 2 logm, which proves that the algorithm is an O(logm) − approximation algorithm.

3 An Application for Set Covering

Motivation Due to the large size of the web users, it is extremely impor- tant to find an efficient and effective way to route data packages. During the routing, there are two targets we want to optimize: the first one is the length of headers which contains the information of the paths towards the destination in the packages; the second one is the size of the tables that map the pairs of source and destination to the shortest routing paths respect to the pairs. Next we introduce a graph spanner based method to reduce the storage of whole shortest paths (which is usually in the order of n^2 , where n is the number of nodes) to n^1 .5, with the guarantee that the shortest paths are degraded by a factor of 3. This graph spanners based method uses the set cover solution we elaborated before. Given an undirected graph G = (V, E), the goal is to construct a spannning graph G′^ = (V, E′) such that:

din G′ (u, v) ≤ 3 din G(u, v), ∀u, v ∈ V (23)

and the number of edges E′^ in G′^ is O(n^1.^5 log n) instead of O(n^2 ) The idea is that for each vertex v, we only store m closest neighbors; if we set m = n^0.^5 , then at least we should to store n · n^0.^5 = n^1.^5 vertices along the shortest paths. This forms the spanning graph G′. Then, we are going to define a set L of landmarks for the packages and routing tables, which should intuitively satisfy the following two conditions:

  • L is not too big to maintain;

References

[1] Uriel Feige. A threshold of ln n for approximating set cover. J. ACM, pages 634–652, July 1998.

[2] R. M. Karp. Reducibility Among Combinatorial Problems. In R. E. Miller and J. W. Thatcher, editors, Complexity of Computer Computa- tions, pages 85–103. Plenum Press, 1972.