Spectral Relaxation for K-means Clustering: A More Efficient Approach, Study notes of Algorithms and Programming

The k-means clustering algorithm and its limitations, specifically its tendency to converge to local minima. It then introduces spectral relaxation, a method for formulating the sum-of-squares minimization problem in k-means as a trace maximization problem with special constraints. This relaxation leads to optimal global solutions and improves the clustering process. The document also includes mathematical derivations and references to related work.

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-05q-1
koofers-user-05q-1 🇺🇸

10 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data
Exploration— Clustering
Instructor: Jieping Ye
1 Introduction
One important method for data compression and classification is to organize data points in
clusters: A cluster is a subset of the set of data points that are close together, using some
distance measure.
A loose definition of clustering could be the process of organizing data into groups whose
members are similar in some way. A cluster is therefore a collection of data points which are
“similar” between them and are “dissimilar” to the data points belonging to other clusters
One can compute the mean value of each cluster separately, and use the means as represen-
tatives of the clusters. Equivalently, the means can be used as basis vectors, and all the data
points are represented by their coordinates with respect to this basis.
Clustering algorithms can be applied in many fields:
Marketing: finding groups of customers with similar behavior given a large database of
customer data containing their properties and past buying records;
Biology: classification of plants and animals given their features;
WWW: document clustering; clustering weblog data to discover groups of similar access
patterns.
There are several methods for computing a clustering. One of the most important is the
k-means algorithm.
2 K-means Clustering
We assume that we have ndata points {xi}n
i=1 IRm, which we organize as columns in a
matrix
X= [x1, x2,·· ·, xn]IRm×n.
Let Π = {πj}k
j=1 denote a partitioning of the data in Xinto kclusters:
πj={v|xvbelongs to cluster j}.
Let the mean, or the centroid, of the cluster be
cj=1
njX
vπj
xv,
where njis the number of elements in πj.
We describe K-means algorithm based on the Euclidean distance measure.
pf3
pf4
pf5

Partial preview of the text

Download Spectral Relaxation for K-means Clustering: A More Efficient Approach and more Study notes Algorithms and Programming in PDF only on Docsity!

CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data

Exploration— Clustering

Instructor: Jieping Ye

1 Introduction

  • One important method for data compression and classification is to organize data points in clusters: A cluster is a subset of the set of data points that are close together, using some distance measure.
  • A loose definition of clustering could be the process of organizing data into groups whose members are similar in some way. A cluster is therefore a collection of data points which are “similar” between them and are “dissimilar” to the data points belonging to other clusters
  • One can compute the mean value of each cluster separately, and use the means as represen- tatives of the clusters. Equivalently, the means can be used as basis vectors, and all the data points are represented by their coordinates with respect to this basis.
  • Clustering algorithms can be applied in many fields:
    • Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;
    • Biology: classification of plants and animals given their features;
    • WWW: document clustering; clustering weblog data to discover groups of similar access patterns.
  • There are several methods for computing a clustering. One of the most important is the k-means algorithm.

2 K-means Clustering

  • We assume that we have n data points {xi}ni=1 ∈ IRm, which we organize as columns in a matrix X = [x 1 , x 2 , · · · , xn] ∈ IRm×n.
  • Let Π = {πj }kj=1 denote a partitioning of the data in X into k clusters:

πj = {v | xv belongs to cluster j}.

  • Let the mean, or the centroid, of the cluster be

cj =

nj

∑ v∈πj

xv,

where nj is the number of elements in πj.

  • We describe K-means algorithm based on the Euclidean distance measure.
  • The tightness or coherence of cluster πj can be measured as the sum qj =

v∈πj

||xv − cj ||^2.

  • The closer the vectors are to the centroid, the smaller the value of qj. The quality of a clustering can be measured as the overall coherence,

Q(Π) =

∑^ k

j=

∑ v∈πj

||xv − cj ||^2.

  • In the k-means algorithm we seek a partitioning that has optimal coherence, in the sense that it is the solution of the minimization problem minΠ Q(Π).
  • The K-means algorithm
    • Initialization: Choose k initial centroids.
    • Form k clusters by assigning all data points to the closest centroid.
    • Recompute the centroid of each cluster.
    • Repeat the second and third steps until convergence.
  • The initial partitioning is often chosen randomly. The algorithms usually has rather fast convergence, but one cannot guarantee that the algorithm finds the global minimum.

3 Spectral Relaxation for K-means Clustering

  • Despite the popularity of K-means clustering, one of its major drawbacks is that it is prone to local minima. Much research has been done on computing refined initial points and adding explicit constraints to the sum-of-squares cost function for K-means clustering so that the search can converge to better local minimum.
  • Zha et al. tackled the problem from a different angle: formulate the sum-of-squares mini- mization in K-means as a trace maximization problem with special constraints: relaxing the constraints leads to a maximization problem that possesses optimal global solutions. - Spectral Relaxation for K-means Clustering. H. Zha, X. He, D. Ding, and H. Simon. NIPS 2001.

3.1 Spectral Relaxation

  • Recall that the n data points {xi}ni=1 ∈ IRm, which we organize as columns in a matrix X = [x 1 , x 2 , · · · , xn] ∈ IRm×n. are partitioned into k clusters: Π = {πj }kj=1 as πj = {v | xv belongs to cluster j}. The mean, or the centroid, of the cluster is

cj =

nj

∑ v∈πj

xv,

where nj is the number of elements in πj.

  • We may derive the following lower bound for the minimum of the sum-of-squares cost function:

min Π Q(Π) ≥ trace(XT^ X) − max Y T^ Y =Ik

trace

( Y T^ XT^ XY

)

min ∑{m,n}

i=k+

σ i^2 (X),

where σi(X) is the i-th largest singular value of X.

  • It is easy to see from the above derivation that we can replace X with X − aeT^ where a is an arbitrary vector. If we choose a to the mean of all data in X, then relaxed maximization problem is equivalent to Principal Component Analysis (PCA).
  • Let Y ∗^ be the n-by-k matrix consisting of the k largest eigenvectors of XT^ X. Each row of Y ∗ corresponds to a data vector. This can be considered as transforming the original data vectors which live in a m-dimensional space to new data vectors which now live in a k-dimensional space. One might be attempted to compute the cluster assignment by applying the ordinary K-means method to those data vectors in the reduced dimension space.
  • More references
    • K-means Clustering via Principal Component Analysis. Chris Ding and Xiaofeng He. ICML 2004.
    • A Unified View of Kernel k-means, Spectral Clustering and Graph Partitioning. I.S. Dhillon, Y. Guan, and B. Kulis. UTCS Technical Report #TR-04-25. http://www.cs.utexas.edu/users/kulis/pubs/spectral techreport.pdf

4 Matrix Approximations using Clustering

  • Given any partitioning Π = {πj }kj=1 of the data in X into k clusters, we can approximate a document vector by the closest mean (centroid) vector. In other words, if a document vector is in cluster πj , we can approximate it by the mean vector cj.
  • This leads to the matrix approximation X ≈ Xˆ such that, for 1 ≤ i ≤ n, its i-th column is the mean vector closest to the data point xi.
  • We can express Xˆ as low-rank matrix approximation as follows:

Xˆ = [c 1 , c 2 , · · · , ck] I

where I ∈ IRk×n^ indicates the cluster membership. More specifically, Iij = 1, if xj belongs to the i-th cluster πi and Iij = 0 otherwise. Denote C = [c 1 , c 2 , · · · , ck].

  • The matrix approximation Xˆ has rank at most k. It is thus natural to compare the approx- imation power of Xˆ to that of the best possible rank-k approximation to the data matrix X based on SVD. - Best rank-k approximation: Xk = UkΣkV (^) kT , where Uk and Vk consist of the top k left and right singular vectors of X, respectively, and Σk contains the top k singular values.
  • Empirical studies showed that, for each fixed k, the approximation error for the k-truncated SVD is significantly lower than that for Xˆ.

5 Concept Decompositions

  • It can be shown that by approximating each document vector by a linear combination of the concept vectors it is possible to obtain significantly better matrix approximations. - Concept Decompositions for Large Sparse Text Data using Clustering. I.S. Dhillon and D.S. Modha. Machine Learning, 2001.
  • Given any partitioning Π = {πj }kj=1 of the data in X into k clusters. Let {cj }kj=1 denote the k cluster centroids. Define the concept matrix as a m × k matrix such that, for 1 ≤ j ≤ k, the j-th column of the matrix is the centroid (concept) vector cj , that is, C = [c 1 , c 2 , · · · , ck]. Assuming linear independence of the k concept vectors, it follows that the concept matrix has rank k.
  • For any partitioning of the data, we define the corresponding concept decomposition X˜k of the data matrix X as the least-squares approximation of X onto the column space of the concept matrix C. We can write the concept decomposition as an m × n matrix X˜k = CZ∗; where Z∗^ is a k × n matrix that is to be determined by solving the following least-squares problem: Z∗^ = arg min Z

||X − CZ||^2 F.

  • It is well known that a closed-form solution exists for the least-squares problem above, namely,

Z∗^ =

( CT^ C

)− 1 CT^ X.

  • Although the above equation is intuitively pleasing, it does not constitute an efficient and numerically stable way to compute the matrix Z∗. Instead, we can use the QR decomposition of the concept matrix. - Let C = QR be the thin QR decomposition of C, then Z∗^ =

( RT^ R

)− 1 RT^ QT^ X. It follows that X˜k = CZ∗^ = QR

( RT^ R

)− 1 RT^ QT^ X = QQT^ X. Here we assume that R is nonsingular, i.e., the k cluster centroids in C are linearly independent.

  • Show that the concept decomposition X˜k is a better matrix approximation than Xˆk.

5.1 Empirical Observations

  • The approximation power (when measured using the Frobenius norm) of concept decompo- sitions is comparable to the best possible approximations by truncated SVD. An important advantage of concept decompositions is that they are computationally more efficient and require much less memory than truncated SVD.
  • When applied for document clustering, concept decompositions produce concept vectors which are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors obtained from SVD are global in the word space and are dense. Nonetheless, the subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them.