Clustering: A Data Analysis Technique for Grouping Similar Objects, Summaries of Process Control

Clustering is a data analysis technique used to group a set of physical or abstract objects into classes of similar objects. The process involves identifying clusters, which are collections of data objects that are similar to one another within the same cluster and dissimilar to objects in other clusters. Clustering can be used for data compression, data exploration, and data preprocessing. various clustering algorithms, including hierarchical methods, density-based methods, and grid-based methods, and their advantages and disadvantages.

Typology: Summaries

2021/2022

Uploaded on 01/16/2022

Jaqua
Jaqua 🇦🇺

9 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
UNIT : 4 - CLUSTERING
4.1
Cluster Analysis:
The process of grouping a set of physical or abstract objects into classes of similar objects
is called clustering.
A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters.
A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression.
Cluster analysis tools based on k-means, k-medoids, and several methods have also been
built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and
SAS.
4.1.1 Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value,and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Clustering: A Data Analysis Technique for Grouping Similar Objects and more Summaries Process Control in PDF only on Docsity!

UNIT : 4 - CLUSTERING

4.1 Cluster Analysis:

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Cluster analysis tools based on k-means, k-medoids, and several methods have also been built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS.

4.1.1 Applications:

Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing. In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations. Clustering may also help in the identification of areas of similar land use in an earth observation database and in the identification of groups of houses in a city according to house type, value,and geographic location, as well as the identification of groups of automobile insurance policy holders with a high average claim cost. Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity.

Clustering can also be used for outlier detection,Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce.

4.1.2 Typical Requirements Of Clustering In Data Mining:

Scalability: Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed.  Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types.  Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to develop algorithms thatcan detect clusters of arbitrary shape.  Minimal requirements for domain knowledge to determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often difficult to determine, especially for data sets containing high-dimensional objects. This not only burdens users, but it also makes the quality of clustering difficult to control.  Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality.

 Density-Based Methods

 Grid-Based Methods

4.2.1 Partitioning Methods:

A partitioning method constructs k partitions of the data, where each partition represents a cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the following requirements: Each group must contain at least one object, and Each object must belong to exactly one group. A partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. The general criterion of a good partitioning is that objects in the same cluster are close or related to each other, whereas objects of different clusters are far apart or very different.

4.2.2 Hierarchical Methods:

A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed.  The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one or until a termination condition holds.  The divisive approach, also called he top-down approach, starts with all of the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds.

Hierarchical methods suffer from the fact that once a step (merge or split) is done,it can never be undone. This rigidity is useful in that it leads to smaller computation costs by not having to worry about a combinatorial number of different choices. There are two approaches to improving the quality of hierarchical clustering:  Perform careful analysis of object ―linkages at each hierarchical partitioning, such as in Chameleon, or  Integrate hierarchical agglomeration and other approaches by first using a hierarchical agglomerative algorithm to group objects into micro clusters, and then performing macroclustering on the microclusters using another clustering method such as iterative relocation.

4.2.3 Density-based methods:

 Most partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes.  Other clustering methods have been developed based on the notion of density. Their general idea is to continue growing the given cluster as long as the density in the neighborhood exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise (outliers)and discover clusters of arbitrary shape.  DBSCAN and its extension, OPTICS, are typical density-based methods that grow clusters according to a density-based connectivity analysis. DENCLUE is a method that clusters objects based on the analysis of the value distributions of density functions.

4.2.4 Grid-Based Methods:

 Grid-based methods quantize the object space into a finite number of cells that form a grid structure.

 First, it randomly selects k of the objects, each of which initially represents a cluster mean or center. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster. This process iterates until the criterion function converges. Typically, the square-error criterion is used, defined as Where E is the sum of the square error for all objects in the data set pis the point in space representing a given object Mi is the mean of cluster Ci

The k-means partitioning algorithm:

The k -means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster.

Clustering of a set of objects based on the k -means method The k -Medoids Method: The k-means algorithm is sensitive to outliers because an object with an extremely large value may substantially distort the distribution of data. This effect is particularly exacerbated due to the use of the square-error function. Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to represent the clusters, using one representative object per cluster. Each remaining object is clustered with the representative object to which it is the most similar. The partitioning method is then performed based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point. That is, an absolute-error criterion is used, defined as Where E is the sum of the absolute error for all objects in the data set P is the point in space representing a given object in cluster Cj oj is the representative object of Cj

Four cases of the cost function for k -medoids clustering

The k -Medoids Algorithm:

The k-medoids algorithm for partitioning based on medoid or central objects.

The k -medoids method is more robust than k -means in the presence of noise and outliers, because a medoid is less influenced by outliers or other extreme values than a mean. However, its processing is more costly than the k -means method. 4.4 Hierarchical Clustering Methods: A hierarchical clustering method works by grouping data objects into a tree of clusters. The quality of a pure hierarchical clustering method suffers from its inability to perform adjustment once a merge or split decision has been executed. That is, if a particular merge or split decision later turns out to have been a poor choice, the method cannot backtrack and correct it. Hierarchical clustering methods can be further classified as either agglomerative or divisive, depending on whether the hierarchical decomposition is formed in a bottom-up or top-down fashion.

4.4.1 Agglomerative hierarchical clustering:

This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. Most hierarchical clustering methods belong to this category. They differ only in their definition of intercluster similarity.

4.4.2 Divisive hierarchical clustering:

This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It sub divides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained or the diameter of each cluster is within a certain threshold.

2. OPTICS: Ordering Points To Identify the Clustering Structure  Produces a special order of the database wrt its density-based clustering structure – This cluster-ordering contains info equiv to the density-based clusterings corresponding to a broad range of parameter settings  Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure  Can be represented graphically or using visualization techniques  Solid mathematical foundation  Good for data sets with large amounts of noise  Allows a compact mathematical description of arbitrarily shaped clusters in high dimensional data sets  Significant faster than existing algorithm (faster than DBSCAN by a factor of up to 45) But needs a large number of parameters 4.6 Grid-Based Clustering Methods  STING  CLIQUE

STING: A Statistical Information Grid

 Each cell at a high level is partitioned into a number of smaller cells in the next lower level  Statistical info of each cell is calculated and stored beforehand and is used to answer queries  Parameters of higher level cells can be easily calculated from parameters of lower level cell like count, mean, s, min, max type of distribution—normal, uniform, etc.  Use a top-down approach to answer spatial data queries  Start from a pre-selected layer—typically with a small number of cells  For each cell in the current level compute the confidence interval

Advantages:

  • Query-independent, easy to parallelize, incremental update
  • O(K), where K is the number of grid cells at the lowest level

Disadvantages:

  • All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected

CLIQUE (Clustering In QUEst)

  • Automatically identifying sub spaces of a high dimensional data space that allow better clustering than original space
  • CLIQUE can be considered as both density-based and grid-based It partitions each dimension into the same number of equal length interval It partitions an m-dimensional data space into non-overlapping rectangular units A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter A cluster is a maximal set of connected dense units within a subspace

CLIQUE: The Major Steps

 Partition the data space and find the number of points that lie inside each cell of the partition.  Identify the subspaces that contain clusters using the Apriori principle  Identify clusters  Determine dense units in all subspaces of interests  Determine connected dense units in all subspaces of interests.  Generate minimal description for the clusters  Determine maximal regions that cover a cluster of connected dense units for each cluster  Determination of minimal cover for each cluster