Efficient Implementation of Disjoint Set Union-Find: Kruskal's Algorithm and Cost Analysis, Study notes of Data Structures and Algorithms

The efficient implementation of disjoint set union-find, a data structure used in kruskal's algorithm for finding a minimum spanning tree. Three operations: makeset, union, and find, and their costs. The authors propose two heuristics, union by rank and path compression, to improve the performance. The analysis of the union-find algorithm is also provided, showing that the total running time to perform a sequence of operations is o((m+n) log∗ n), where log∗ n is the number of times the log function must be applied to n before the result is less than or equal to 1.

Typology: Study notes

Pre 2010

Uploaded on 07/28/2009

koofers-user-6zr
koofers-user-6zr 🇺🇸

10 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
UC Berkeley—CS 170: Efficient Algorithms and Intractable Problems Handout 12
Lecturer: David Wagner March 11, 2003
Notes 12 for CS 170
1 Disjoint Set Union-Find
Kruskal’s algorithm for finding a minimum spanning tree used a structure for maintaining
a collection of disjoint sets. Here, we examine efficient implementations of this structure.
It supports the following three operations:
MAKESET(x) - create a new set containing the single element x.
UNION(x,y) - replace the two sets containing xand yby their union.
FIND(x) - return the name of the set containing the element x. For our purposes this
will be a canonical element in the set containing x.
We will consider how to implement this efficiently, where we measure the cost of do-
ing an arbitrary sequence of mUNION and FIND operations on ninitial sets created by
MAKESET. The minimum possible cost would be O(m+n), i.e., cost O(1) for each call
to MAKESET, UNION, or FIND. Our ultimate implementation will be nearly this cheap,
and indeed be this cheap for all practical values of mand n.
The simplest implementation one could imagine is to represent each set as a linked list,
where we keep track of both the head and the tail. The canonical element is the tail of
the list (the final element reached by following the pointers in the other list elements), and
UNION simply concatenates lists. In this case FIND has maximal cost proportional to the
length of the list, since following each pointer costs O(1), and UNION has cost O(1), to
point the tail of one set to the head of the other. The worst case cost is attained by doing
nUNIONs, to get a single set, and then mFINDs on the head of the list, for a total cost
of O(mn), much larger than our target O(m+n).
To do a better job, we need a more clever data structure. Let us think about how to
improve the above simple one. First, instead of taking the union by concatenating lists, we
simply make the tail of one list point to the tail of the other, as illustrated below. That
way the maximum cost of FIND on any element of the union will have cost proportional to
the maximum of the two list lengths (plus one, if both have the same length), rather than
the sum.
UNION
More generally, we see that a sequence of UNIONs will result in a tree representing each
set, with the root of the tree as the canonical element. To simplify coding, we will mark the
root by setting the pointer in the root to point to itself. This leads to the following initial
implementations of MAKESET and FIND:
pf3
pf4

Partial preview of the text

Download Efficient Implementation of Disjoint Set Union-Find: Kruskal's Algorithm and Cost Analysis and more Study notes Data Structures and Algorithms in PDF only on Docsity!

UC Berkeley—CS 170: Efficient Algorithms and Intractable Problems Handout 12 Lecturer: David Wagner March 11, 2003

Notes 12 for CS 170

1 Disjoint Set Union-Find

Kruskal’s algorithm for finding a minimum spanning tree used a structure for maintaining a collection of disjoint sets. Here, we examine efficient implementations of this structure. It supports the following three operations:

  • MAKESET(x) - create a new set containing the single element x.
  • UNION(x,y) - replace the two sets containing x and y by their union.
  • FIND(x) - return the name of the set containing the element x. For our purposes this will be a canonical element in the set containing x.

We will consider how to implement this efficiently, where we measure the cost of do- ing an arbitrary sequence of m UNION and FIND operations on n initial sets created by MAKESET. The minimum possible cost would be O(m + n), i.e., cost O(1) for each call to MAKESET, UNION, or FIND. Our ultimate implementation will be nearly this cheap, and indeed be this cheap for all practical values of m and n. The simplest implementation one could imagine is to represent each set as a linked list, where we keep track of both the head and the tail. The canonical element is the tail of the list (the final element reached by following the pointers in the other list elements), and UNION simply concatenates lists. In this case FIND has maximal cost proportional to the length of the list, since following each pointer costs O(1), and UNION has cost O(1), to point the tail of one set to the head of the other. The worst case cost is attained by doing n UNIONs, to get a single set, and then m FINDs on the head of the list, for a total cost of O(mn), much larger than our target O(m + n). To do a better job, we need a more clever data structure. Let us think about how to improve the above simple one. First, instead of taking the union by concatenating lists, we simply make the tail of one list point to the tail of the other, as illustrated below. That way the maximum cost of FIND on any element of the union will have cost proportional to the maximum of the two list lengths (plus one, if both have the same length), rather than the sum.

UNION

More generally, we see that a sequence of UNIONs will result in a tree representing each set, with the root of the tree as the canonical element. To simplify coding, we will mark the root by setting the pointer in the root to point to itself. This leads to the following initial implementations of MAKESET and FIND:

procedure MAKESET(x) ... initial implementation p(x) := x

function FIND(x) ... initial implementation if x 6 = p(x) then return FIND(p(x)) else return x It is convenient to add a fourth operation LINK(x,y) where x and y are required to be two roots. LINK changes the parent pointer of one of roots, say x, and makes it point to y. It returns the root of the composite tree y. Then UNION(x,y) = LINK(FIND(x), FIND(y)). But this by itself is not enough to reduce the cost; if we are so unlucky as to make the root of the bigger tree point to the root of the smaller tree, n UNION operations can still lead to a single chain of length n, and the same cost as above. This motivates the first of our two heuristics: UNION BY RANK. This simply means that we keep track of the depth (or RANK) of each tree, and make the shorter tree point to the root of the taller tree; code is shown below. Note that if we take the UNION of two trees of the same RANK, the RANK of the UNION is one larger than the common RANK, and otherwise equal to the max of the two RANKs. This will keep the RANK of tree of n nodes from growing past O(log n), but m UNIONs and FINDs can then still cost O(mlogn).

procedure MAKESET(x) ... final implementation p(x) := x RANK(x) := 0

function LINK(x,y) if RAN K(x) > RAN K(y) then swap x and y if RAN K(x) = RAN K(y) then RAN K(y) = RAN K(y) + 1 p(x) := y return(y) The second heuristic, PATH COMPRESSION, is motivated by observing that since each FIND operation traverses a linked list of vertices on the way to the root, one could make later FIND operations cheaper by making each of these vertices point directly to the root:

function FIND(x) ... final implementation if x 6 = p(x) then p(x) := FIND(p(x)) return(p(x)) else return(x)







(^) 





PATH  COMPRESSION

We will prove below that any sequence of m UNION and FIND operations on n elements take at most O((m+n) log∗^ n) steps, where log∗^ n is the number of times you must iterate the

We will show that each find operation takes O(log∗^ n) time, plus some additional time that is paid for using the tokens of the vertices that are visited during the find operation. In the end, we will have used at most O((m + n) log∗^ n) time. Let us define the token distribution. If an element u has (at the end of the m operations) rank in the range (k, 2 k] then we will give (at the beginning) 2k^ tokens to it.

Lemma 3 We are distributing a total of at most n log∗^ n tokens. Proof: Consider the group of elements of rank in the range (k, 2 k]: we are giving 2k^ tokens to them, and there are at most n/ 2 k^ elements in the group, so we are giving a total of n tokens to that group. In total we have at most log∗^ n groups, and the lemma follows. 2

We need one more observation to keep in mind.

Lemma 4 At any time, for every u that is not a root, rank[u] < rank[p[u]]. Proof: After the initial series of makeset, this is an invariant that is maintained by each find and each union operation. 2

We can now prove our main result

Theorem 5 Any sequence of operations involving m find operations can be completed in O((m + n) log∗^ n) time.

Proof: Apart from the work needed to perform find, each operation only requires constant time (for a total of O(m) time). We now claim that each find takes O(log∗^ n) time, plus time that is paid for using tokens (and we also want to prove that we do not run out of tokens). The accounting is done as follows: the running time of a find operation is a constant times the number of pointers that are followed until we get to the root. When we follow a pointer from u to v (where v = p[u]) we charge the cost to find if u and v belong to different groups, or if u is a root, or if u is a child of a root; and we charge the cost to u if u and v are in the same group (charging the cost to u means removing a token from u’s allowance). Since there are at most log∗^ n groups, we are charging only O(log∗^ n) work to find. How can we make sure we do not run out of coins? When find arrives at a node u and charges u, it will also happen that u will move up in the tree, and become a child of the root (while previously it was a grand-child or a farther descendent); in particular, u now points to a vertex whose rank is larger than the rank of the vertex it was pointing to before. Let k be such that u belongs to the range group (k, 2 k], then u has 2k^ coins at the beginning. At any time, u either points to itself (while it is a root) or to a vertex of higher rank. Each time u is charged by a find operation, u gets to point to a parent node of higher and higher rank. Then u cannot be charged more than 2k^ time, because after that the parent of u will move to another group. 2