







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Two popular clustering algorithms: k-means and hierarchical agglomerative clustering (hac). K-means is a centroid-based method that aims to find clusters with small inter-example distances compared to distances to other clusters. Hac is a non-parametric algorithm that defines clusters directly from the examples and maintains the distances between pairs of clusters. Both algorithms require an underlying distance metric and can be used for various data types.
Typology: Study notes
1 / 13
This page cannot be seen from the preview
Don't miss anything!








We turn now to the increasingly important topic of unsupervised learning. The first problem we study is that of clustering. One challenge will be to de- cide on a criterion by which to judge the quality of a hypothesis. We first discuss a parametric method, the K-means clustering algorithm. Then we look at hierarchical agglomerative clustering (HAC), which is a non-parametric method. Parametric methods are often preferred for high-dimensional spaces, where non-parametric methods can suffer from the “curse of dimensionality.” Looking ahead, we will then formulate clustering as a probabilistic model using a mixture-of-Gaussians model. An approach to training within this probabilistic framework is delayed until next class.
1 The Clustering Problem
The clustering problem is defined as follows: given a set of examples D = x 1 ,... , xn, with example x ∈ X and feature space X, and a number K of clusters, learn a hypothesis,
h : D → { 1 ,... , k} (1) that assigns each example to a cluster. Throughout we assume that X ⊆ Rm, so that an example is a vector in m-dimensional Real space. Features may be discrete or continuous. Note that we are not necessarily interested in good performance on new examples. Rather, the task at hand for clustering is to identify structure in existing data. (This said, the K-means and probabilistic approaches will also immediately assign a cluster to new examples.) This is an example of unsupervised learning, where the training set D includes descriptions of examples but no labels. For example, think about being given a collection of feature vectors describing different animals and being asked to identify informative clusters of the different animals. Loosely speaking, the intuitive task is to find some “hidden structure,” or patterns, in the data. One way to think about the clustering problem is that the problem is to partition examples into clusters so that an example is associated with other ex- amples that are closer (under some distance metric) to itself than with examples in other clusters.
Another way to think about clustering, and unsupervised learning more gen- erally, is to find a probabilistic model of the data, such that the observed data is generated with high probability according to the model. The parameters of such a probabilistic model can be viewed as providing a description of the data, representing hidden structure. This will prove to be a general and pow- erful approach and represents much of the current research frontier in machine learning. A general challenge for the problem of unsupervised learning is to find a good measure of the performance of a learning algorithm. In supervised learning, we measure performance by running an algorithm on the training set to produce a classifier, and then measuring the accuracy on a test set. This method does not work for unsupervised learning, because we do not have target labels with which to judge generalization performance. For clustering, we want to ask whether a learning algorithm does a good job of grouping similar examples together. One possible way to judge a clustering algorithm is to manually examine the clusters produced and see if they are intuitively coherent. But this is clearly not a systematic, or formalized method. We would prefer some objective way of telling whether an unsupervised learning algorithm produces a good model of our data (e.g., a clustering that captures meaningful structure, or a parameterized probabilistic model that nicely explains the data.) Probabilistic approaches will be more successful in providing such a quantified analysis then other approaches. Still, we consider non probabilistic approaches first and in particular two very well known algo- rithms: K-means clustering and hierarchical agglomerative clustering.
2 K-means Clustering
(^1) Intuitively, we would like to find clusters such that the inter-example distances
between examples in a cluster are small compared with the distances to examples in other clusters. We formalize this in K-means clustering by introducing a set of prototypes {μ 1 ,... , μK }, with μk ∈ X, to represent the center of each of K clusters. The idea is to find prototypes that minimize the total squared distance (for some distance metric) from each example to its closest prototype. The examples for which the same prototype is the closest form a cluster, and in the minimum error solution each prototype will be positioned at the center (appropriately defined) of each cluster. K-means is a parametric method because the learned hypothesis is rep- resented by a small number of parameters: in this case the set of prototype vectors. Formally, we associate a binary indicator vector rik ∈ { 0 , 1 } with each ex- ample xi and cluster k, s.t.
k rik^ = 1 for every^ i, and with^ rik^ = 1 to indicate that example xi is associated with prototype (and thus cluster) k.
(^1) This section is based in part on Bishop (2007).
each prototype is O(Kmn), and the work to determine the new prototype posi- tions is O(mn). There will be some number of iterations T before convergence, and T tends (empirically) to be considerably less than n. Sometimes K-means clustering performs very poorly. In particular, this occurs when the inductive hypothesis fails to hold. Consider again the example of the disc inside the ring from the lecture. The ring has the “wrong shape,” and K-means is incapable of identifying it as a cluster and will produce completely the wrong result. (Would a transformation of the data help?)
Figure 1: Local minima with K-means
As with most greedy hill-climbing algorithms, local minima can be a problem for K-means. Figure 1 shows a situation that could be problematic. We can recognize that there are two clusters, as well as an outlying example. Indeed, this solution does provide a minimum error clustering However, there is another clustering that is a local minimum. The outlying example is assigned to one cluster on its own, and all other examples are assigned to the other cluster, with a mean at A.
3 Hierarchical Agglomerative Clustering
The second clustering algorithm we study is hierarchical agglomerative cluster- ing (HAC).^3 HAC is an example of an instance-based or non-parametric algorithm; the clusters are defined directly by the examples they contain. The hypothesis is not defined by a small number of parameters that instantiate a class of models, but rather by an arbitrary (hierarchical) clustering of the data. That HAC is non-parametric brings benefits and disadvantages: on the plus side, it does not restrict possible clusters to those that fit a parametric model and avoids the strong inductive hypothesis of K-means. On the negative side, it can fail miserably with high dimensional data (the so called “curse of dimen- sionality”), and is also more computationally expensive. As its name suggests, HAC generates a hierarchy, which can be useful and natural in some applications. This is achieved by “glomming” (or merging) clusters together to create larger and larger clusters. The algorithm begins (^3) This section is based in part on Duda et al., 2001.
with singleton clusters containing individual examples. At each iteration, the algorithm chooses two clusters to merge, and replaces them with their union. The algorithm is as follows:
HAC({x 1 ,... , xn}, K) = B = {{x 1 },... , {xn}} Repeat until |B| = K Let ci, cj be the two closest clusters in B Remove ci and cj from B Insert ci ∪ cj into B
In order to complete the specification of the algorithm, we need to supply a distance function (or metric) between clusters, so that we can determine which pair is the closest. There are several possible ways to measure the distance between two clusters c and c′. The choice of distance function greatly affects the behavior of the HAC algorithm. Possibilities include the following:
(min) dmin(c, c′) = minx∈c,y∈c′ d(x, y) (max) dmax(c, c′) = maxx∈c,y∈c′ d(x, y) (mean) dmean(c, c′) = (^) |c|·|^1 c′|
x∈c,y∈c′^ d(x,^ y) (centroid) dcent(c, c′) = d
1 |c|
x∈c x,^
1 |c′|
y∈c′^ y
All of these measures rely in turn on some underlying distance metric d : Rm^ ×Rm^ → R≥ 0 between individual examples— by varying the underlying met- ric, one can get many more possibilities. For example, a simple and common metric is based on the∑ L^1 (or Manhattan distance) norm, d(x, y) = ||x − y|| 1 = m j=1 |xj^ −^ yj^ |, and another common alternative is based on the^ L (^2) (or Eu-
clidean) norm, d(x, y) = ||x − y|| 2 =
m j=1(xj^ −^ yj^ )
The HAC algorithm needs to do some bookkeeping to maintain the dis- tances between pairs of clusters in B— the same distance should never have to be computed more than once. In addition, it may be a good idea to store all the distances between pairs of examples. For example, one algorithmic approach is to determine the pairwise distances between all examples once and for all, re- quiring O(n^2 m) work. Given this, for the min distance metric between clusters, in each round we can identify the pair of examples in two different clusters with the minimal distance, which is an additional O(n^2 ) work. For K clusters we need to run for n − K rounds, and so for O(n^3 + n^2 m) = O(n^3 ) steps altogether, since K is typically a small constant. This is a naive implementation and you should be able to think of things to do to speed up the computation. But we can certainly see that the algorithm is scaling at least as O(n^2 m), compared to O(KmnT ) for K-means with T rounds, and can be considerably slower than K-means since T is typically much less than n. What are the main properties of the HAC algorithm? Since it is a non- parametric method, it makes no prior assumptions about the actual shape of the clusters and is able to learn clusters with complex shape.
For clustering, we will consider two particularly simple parametric forms. The first is the mixture-of-Gaussian model and well suited for continuous do- mains. This will assume that the examples are generated from a mixture on multivariate Gaussian distributions. The second is a Naive Bayes model, and well suited to discrete attributes. This assumes that all attributes are condi- tionally independent of each other given that the example falls into a particular cluster.
Random variables A and B are written in bold and take on values either from a discrete or continuous domain. In this primer we assume the values are discrete. If the domain is continuous then integrals take the role of summations, and probabilities become probability density functions. If the set of possible values of A is A, then we have
a∈A P^ (A^ =^ a) = 1 (the sum rule), and P (A = a) ≥ 0 for all possible values a. The joint probability P (A, B) is the probability that A and B taken on particular values simultaneously. We have P (A, B) = P (B, A), which means that P (A = a, B = b) = P (B = b, A = a) for all a ∈ A and b ∈ B, where B is the domain of values that can be adopted by B. P (A | B) is the conditional probability of A given B, defined as P (A | B) = P (A, B)/P (B). From this, we have the product rule
P (A, B) = P (A)P (B | A) = P (B)P (A | B) (3)
From this we also obtain Bayes rule,
P (B | A) =
The above rules continue to hold when conditioning, for example when con- ditioning on B then we have
a P^ (A^ =^ a^ |^ B) = 1. Similarly, when introducing random variable C we have
P (A, B | C) = P (A | C)P (B | A, C) (5)
Two random variables are independent if P (A, B) = P (A)P (B), from which it follows that P (A | B) = P (A). Another useful concept is marginalizing, which holds that P (B) =
a P^ (A^ = a, B) and gives the unconditional probability of P (B = b), for some value b, without any knowledge about the value of A.
∑For example, the denominator in Bayes rule,^ P^ (A), can be determined as b P^ (A,^ B^ =^ b) =^
b P^ (A^ |^ B^ =^ b)P^ (B^ =^ b).
In the case of unsupervised learning, the basic probabilistic approach is to con- sider a parameterized probabilistic model,
P (X = x | θ), (6)
parameterized by a vector of parameters θ, where random variable X adopts values in feature space X. For a discrete feature space, then P (X = x | θ) is the probability that random variable X adopts value x. For a continuous feature space, then P (X = x | θ) denotes the probability density function. The result of learning is, in the simplest case, a vector of learned parameters θ. This is the case for the maximum likelihood method introduced today and also the more advanced maximum a posteriori method. A more sophisticated, full Bayesian approach, instead reasons directly about a distribution on possible parameters. We leave this to one side for now. Given such a parameterization θ then many interesting inference problems are possible. We dig more into this next class. For now, let us consider the very special cases of classification and clustering.
4.2.1 Classification
In the case of supervised learning, in the probabilistic method we consider a probabilistic model,
P (X = x, Y = y | θ) (7)
where random variable Y adopts values drawn from possible labels Y. Given learned parameters θ, then for classification, we can label a new ex- ample x by finding
arg max y∈Y
P (Y = y | X = x, θ) = arg max y∈Y
P (Y = y, X = x | θ) P (X = x | θ)
= arg max y∈Y P (Y = y, X = x | θ) (9)
where the idea is to find the class label for which the conditional probability is maximized, given example x. The first equality follows from the definition of conditional probability, and the second by noting that the denominator is constant for all choices of y.
4.2.2 Clustering
For clustering, the idea is to associate a so-called “latent” variable with each example, The latent variable is hidden, in that it is not present in the data, and is interpreted as the cluster associated with an example. Let Y denote this latent variable, where the random variable Y takes on values in { 1 ,... , K} if we are looking for K clusters. Given this, and assuming that the data is used to learn a parameterized model (note that it is not clear yet how this can be done— one of the variables
4.3.1 Example: Univariate Gaussian
First, the univariate (single variable) Gaussian (Normal) distribution has den- sity function:
P (X = x | θ) =
(2πσ^2 )^1 /^2
exp
(x − μ)^2 2 σ^2
The parameters are θ = (μ, σ^2 ), where μ = E[X] and σ^2 is the variance E[(X − E[X])^2 ] = E[X^2 ] − μ^2. Here, E[·] is the standard expectation operator. The form of the univariate Gaussian is quite familiar:
0
-10 -5 0 5 10
f(x)
It is symmetric about the mean. It has a single mode, at the mean. The density decays the further away you get from the mean. The lower the variance, the faster the density decays, and the sharper the mode. Given data D = {x 1 ,... , xn}, with xi ∈ R, then we can now solve for the parameters θML that maximize the likelihood of the data. We take the log, and obtain for a single instance
ln P (X = xi | θ) = −
ln(2π) −
ln(σ^2 ) −
2 σ^2
(xi − μ)^2 (15)
Summing across all data points,
ln P (D | θ) = −
2 σ^2
∑^ n
i=
(xi − μ)^2 −
n 2
ln(σ^2 ) −
n 2
ln(2π) (16)
Take the partial derivative with respect to μ, and equate to zero,
1 σ^2
∑^ n
i=
(xi − μ) = 0 (17)
and so
μML =
n
∑^ n
i=
xi (18)
Similarly, taking the partial derivative with respect to parameter σ^2 and setting to zero, we obtain
1 2(σ^2 )^2
∑^ n
i=
(xi − μML)^2 −
n 2 σ^2
which solves to
σML^2 =
n
∑^ n
i=
(xi − μML)^2 (20)
Both expressions are just what we might expect. The maximum likelihood estimate of the mean is the sample mean, and the maximum likelihood estimate of the variance is the variance of the sample considered as the population.^5
We can now consider the problem of clustering. Consider a model with K clusters. We will assume that the data is generated by first assigning a cluster to an example (this is the latent variable) and then, conditioned on the cluster, assigning a feature vector according to a multivariate Gaussian distribution. The result is the well known mixture-of-Gaussian model:
P (X = x | θ) =
k=
πk · N (X = x | μk, Σk), (21)
where parameters {π 1 ,... , πK } define the probability for generating an example from the kth cluster, and N (X = x | μk, Σk) is the density function associated with multi-variate Gaussian with mean μk and covariance Σk, so that the re- maining parameters are {μk, Σk}k=1,...,K. Each multivariate Gaussian is called a component of the distribution. Equivalently, we can make latent variable Y (taking on values Y = { 1 ,... , K}) explicit, and consider model
P (X = x, Y = y | θ) = P (Y = y | π)P (X = x | Y = y, μ, Σ), (22)
where P (X = x | Y = y, μ, Σ) is the multi-variate Gaussian distribution corre- sponding to cluster y. Given this, and given maximum likelihood parameters θML, then we would assign to example x the cluster that maximizes,
arg max y∈Y
P (Y = y | X = x, θML) = arg max y∈Y
P (Y = y, X = x | θML) (23)
= arg max y∈Y P (Y = y | θML)P (X = x | Y = y, θML) (24)
This can be easily evaluated. The first term is the prior πk (in θML) asso- ciated with cluster y. The second term is given by the multivariate Gaussian, parameterized according to θML.
(^5) Digging a bit more deeply, this estimator of variance turns out to be downwards biased, and the so called “Laplace approximation” of making the denominator n − 1 instead of n can be introduced to fix this problem if necessary. The difference becomes negligible for large enough n. This indicates a small problem with the approach of maximum likelihood.
and includes (K − 1) + mK + m(m + 1)K/2 parameters. In principle, we can now find values for these parameters that maximizes the probability (or likelihood) of the data, i.e. P (D | θ). Given this we could then cluster the data using the approach explained above in Eq. (24). Next time we will see a powerful algorithm (the EM algorithm) to learn these parameters. The key challenge is that the cluster assignments are a latent variable (and not seen in the data). Without this complication it is a fairly simple matter to fit maximum likelihood parameters for a mixture-of-Gaussians model.