Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Spectral Relaxation for K-means Clustering: A More Efficient Approach, Study notes of Algorithms and Programming

Arizona State University (ASU) - Tempe Algorithms and Programming

The k-means clustering algorithm and its limitations, specifically its tendency to converge to local minima. It then introduces spectral relaxation, a method for formulating the sum-of-squares minimization problem in k-means as a trace maximization problem with special constraints. This relaxation leads to optimal global solutions and improves the clustering process. The document also includes mathematical derivations and references to related work.

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-05q-1 🇺🇸

10 documents

1 / 5

This page cannot be seen from the preview

Don't miss anything!

CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data

Exploration— Clustering

Instructor: Jieping Ye

1 Introduction

•One important method for data compression and classification is to organize data points in

clusters: A cluster is a subset of the set of data points that are close together, using some

distance measure.

•A loose definition of clustering could be the process of organizing data into groups whose

members are similar in some way. A cluster is therefore a collection of data points which are

“similar” between them and are “dissimilar” to the data points belonging to other clusters

•One can compute the mean value of each cluster separately, and use the means as represen-

tatives of the clusters. Equivalently, the means can be used as basis vectors, and all the data

points are represented by their coordinates with respect to this basis.

•Clustering algorithms can be applied in many fields:

–Marketing: finding groups of customers with similar behavior given a large database of

customer data containing their properties and past buying records;

–Biology: classification of plants and animals given their features;

–WWW: document clustering; clustering weblog data to discover groups of similar access

patterns.

•There are several methods for computing a clustering. One of the most important is the

k-means algorithm.

2 K-means Clustering

•We assume that we have ndata points {xi}n

i=1 ∈IRm, which we organize as columns in a

matrix

X= [x1, x2,·· ·, xn]∈IRm×n.

•Let Π = {πj}k

j=1 denote a partitioning of the data in Xinto kclusters:

πj={v|xvbelongs to cluster j}.

•Let the mean, or the centroid, of the cluster be

cj=1

njX

v∈πj

xv,

where njis the number of elements in πj.

•We describe K-means algorithm based on the Euclidean distance measure.

Discover Study notes of Algorithms and Programming Arizona State University (ASU) - Tempe

Partial preview of the text

Download Spectral Relaxation for K-means Clustering: A More Efficient Approach and more Study notes Algorithms and Programming in PDF only on Docsity!

CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data

Exploration— Clustering

Instructor: Jieping Ye

1 Introduction

One important method for data compression and classification is to organize data points in clusters: A cluster is a subset of the set of data points that are close together, using some distance measure.
A loose definition of clustering could be the process of organizing data into groups whose members are similar in some way. A cluster is therefore a collection of data points which are “similar” between them and are “dissimilar” to the data points belonging to other clusters
One can compute the mean value of each cluster separately, and use the means as represen- tatives of the clusters. Equivalently, the means can be used as basis vectors, and all the data points are represented by their coordinates with respect to this basis.
Clustering algorithms can be applied in many fields:
- Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;
- Biology: classification of plants and animals given their features;
- WWW: document clustering; clustering weblog data to discover groups of similar access patterns.
There are several methods for computing a clustering. One of the most important is the k-means algorithm.

2 K-means Clustering

We assume that we have n data points {xi}ni=1 ∈ IRm, which we organize as columns in a matrix X = [x 1 , x 2 , · · · , xn] ∈ IRm×n.
Let Π = {πj }kj=1 denote a partitioning of the data in X into k clusters:

πj = {v | xv belongs to cluster j}.

Let the mean, or the centroid, of the cluster be

cj =

nj

∑ v∈πj

xv,

where nj is the number of elements in πj.

We describe K-means algorithm based on the Euclidean distance measure.

The tightness or coherence of cluster πj can be measured as the sum qj =

∑

v∈πj

||xv − cj ||^2.

The closer the vectors are to the centroid, the smaller the value of qj. The quality of a clustering can be measured as the overall coherence,

Q(Π) =

∑^ k

j=

∑ v∈πj

||xv − cj ||^2.

In the k-means algorithm we seek a partitioning that has optimal coherence, in the sense that it is the solution of the minimization problem minΠ Q(Π).
The K-means algorithm
- Initialization: Choose k initial centroids.
- Form k clusters by assigning all data points to the closest centroid.
- Recompute the centroid of each cluster.
- Repeat the second and third steps until convergence.
The initial partitioning is often chosen randomly. The algorithms usually has rather fast convergence, but one cannot guarantee that the algorithm finds the global minimum.

3 Spectral Relaxation for K-means Clustering

Despite the popularity of K-means clustering, one of its major drawbacks is that it is prone to local minima. Much research has been done on computing refined initial points and adding explicit constraints to the sum-of-squares cost function for K-means clustering so that the search can converge to better local minimum.
Zha et al. tackled the problem from a different angle: formulate the sum-of-squares mini- mization in K-means as a trace maximization problem with special constraints: relaxing the constraints leads to a maximization problem that possesses optimal global solutions. - Spectral Relaxation for K-means Clustering. H. Zha, X. He, D. Ding, and H. Simon. NIPS 2001.

3.1 Spectral Relaxation

Recall that the n data points {xi}ni=1 ∈ IRm, which we organize as columns in a matrix X = [x 1 , x 2 , · · · , xn] ∈ IRm×n. are partitioned into k clusters: Π = {πj }kj=1 as πj = {v | xv belongs to cluster j}. The mean, or the centroid, of the cluster is

cj =

nj

∑ v∈πj

xv,

where nj is the number of elements in πj.

We may derive the following lower bound for the minimum of the sum-of-squares cost function:

min Π Q(Π) ≥ trace(XT^ X) − max Y T^ Y =Ik

trace

( Y T^ XT^ XY

)

min ∑{m,n}

i=k+

σ i^2 (X),

where σi(X) is the i-th largest singular value of X.

It is easy to see from the above derivation that we can replace X with X − aeT^ where a is an arbitrary vector. If we choose a to the mean of all data in X, then relaxed maximization problem is equivalent to Principal Component Analysis (PCA).
Let Y ∗^ be the n-by-k matrix consisting of the k largest eigenvectors of XT^ X. Each row of Y ∗ corresponds to a data vector. This can be considered as transforming the original data vectors which live in a m-dimensional space to new data vectors which now live in a k-dimensional space. One might be attempted to compute the cluster assignment by applying the ordinary K-means method to those data vectors in the reduced dimension space.
More references
- K-means Clustering via Principal Component Analysis. Chris Ding and Xiaofeng He. ICML 2004.
- A Unified View of Kernel k-means, Spectral Clustering and Graph Partitioning. I.S. Dhillon, Y. Guan, and B. Kulis. UTCS Technical Report #TR-04-25. http://www.cs.utexas.edu/users/kulis/pubs/spectral techreport.pdf

4 Matrix Approximations using Clustering

Given any partitioning Π = {πj }kj=1 of the data in X into k clusters, we can approximate a document vector by the closest mean (centroid) vector. In other words, if a document vector is in cluster πj , we can approximate it by the mean vector cj.
This leads to the matrix approximation X ≈ Xˆ such that, for 1 ≤ i ≤ n, its i-th column is the mean vector closest to the data point xi.
We can express Xˆ as low-rank matrix approximation as follows:

Xˆ = [c 1 , c 2 , · · · , ck] I

where I ∈ IRk×n^ indicates the cluster membership. More specifically, Iij = 1, if xj belongs to the i-th cluster πi and Iij = 0 otherwise. Denote C = [c 1 , c 2 , · · · , ck].

The matrix approximation Xˆ has rank at most k. It is thus natural to compare the approx- imation power of Xˆ to that of the best possible rank-k approximation to the data matrix X based on SVD. - Best rank-k approximation: Xk = UkΣkV (^) kT , where Uk and Vk consist of the top k left and right singular vectors of X, respectively, and Σk contains the top k singular values.
Empirical studies showed that, for each fixed k, the approximation error for the k-truncated SVD is significantly lower than that for Xˆ.

5 Concept Decompositions

It can be shown that by approximating each document vector by a linear combination of the concept vectors it is possible to obtain significantly better matrix approximations. - Concept Decompositions for Large Sparse Text Data using Clustering. I.S. Dhillon and D.S. Modha. Machine Learning, 2001.
Given any partitioning Π = {πj }kj=1 of the data in X into k clusters. Let {cj }kj=1 denote the k cluster centroids. Define the concept matrix as a m × k matrix such that, for 1 ≤ j ≤ k, the j-th column of the matrix is the centroid (concept) vector cj , that is, C = [c 1 , c 2 , · · · , ck]. Assuming linear independence of the k concept vectors, it follows that the concept matrix has rank k.
For any partitioning of the data, we define the corresponding concept decomposition X˜k of the data matrix X as the least-squares approximation of X onto the column space of the concept matrix C. We can write the concept decomposition as an m × n matrix X˜k = CZ∗; where Z∗^ is a k × n matrix that is to be determined by solving the following least-squares problem: Z∗^ = arg min Z

||X − CZ||^2 F.

It is well known that a closed-form solution exists for the least-squares problem above, namely,

Z∗^ =

( CT^ C

)− 1 CT^ X.

Although the above equation is intuitively pleasing, it does not constitute an efficient and numerically stable way to compute the matrix Z∗. Instead, we can use the QR decomposition of the concept matrix. - Let C = QR be the thin QR decomposition of C, then Z∗^ =

( RT^ R

)− 1 RT^ QT^ X. It follows that X˜k = CZ∗^ = QR

( RT^ R

)− 1 RT^ QT^ X = QQT^ X. Here we assume that R is nonsingular, i.e., the k cluster centroids in C are linearly independent.

Show that the concept decomposition X˜k is a better matrix approximation than Xˆk.

5.1 Empirical Observations

The approximation power (when measured using the Frobenius norm) of concept decompo- sitions is comparable to the best possible approximations by truncated SVD. An important advantage of concept decompositions is that they are computationally more efficient and require much less memory than truncated SVD.
When applied for document clustering, concept decompositions produce concept vectors which are localized in the word space, are sparse, and tend towards orthonormality. In contrast, the singular vectors obtained from SVD are global in the word space and are dense. Nonetheless, the subspaces spanned by the concept vectors and the leading singular vectors are quite close in the sense of small principal angles between them.

Spectral Relaxation for K-means Clustering: A More Efficient Approach, Study notes of Algorithms and Programming

Related documents

Partial preview of the text

Download Spectral Relaxation for K-means Clustering: A More Efficient Approach and more Study notes Algorithms and Programming in PDF only on Docsity!

CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data

Exploration— Clustering

1 Introduction

2 K-means Clustering

Q(Π) =

3 Spectral Relaxation for K-means Clustering

3.1 Spectral Relaxation

)

4 Matrix Approximations using Clustering

5 Concept Decompositions

||X − CZ||^2 F.

5.1 Empirical Observations