Cluster analysis- DATA MINING CONCEPT BY Aashish R Dandekar (SVPCET,Nagpur), Summaries of Technology

Cluster analysis, a fundamental method in data mining, groups similar data points into clusters, revealing hidden patterns within the data. It is crucial for exploratory data analysis. Different algorithms, like k-means, hierarchical clustering, and density-based clustering, can be used based on the analysis requirements and data nature.

Typology: Summaries

2023/2024

Uploaded on 07/06/2024

Aashish-R-Dandekar
Aashish-R-Dandekar 🇮🇳

1 document

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1 Aashish R. Dandekar
CSE(DS), svpcet,Nagpur
Department of CSE(DS)
Cluster Analysis
Cluster Analysis: An Overview
Cluster analysis, also referred to as clustering, is a fundamental method in data mining
aimed at grouping similar data points. The primary objective of cluster analysis is to
partition a dataset into clusters, where each cluster contains data points that are more
similar to each other than to those in other clusters. This technique is pivotal in
exploratory data analysis, enabling the identification of hidden patterns or relationships
within the data. Various algorithms can be employed for cluster analysis, including k-
means, hierarchical clustering, and density-based clustering. The selection of an
appropriate algorithm hinges on the specific requirements of the analysis and the
nature of the data.
Cluster analysis operates on the principle of unsupervised learning, dealing with
unlabeled data. A cluster represents a group of similar data points. For instance,
consider a dataset containing information on different types of vehicles, such as cars,
buses, and bicycles. Since this is unsupervised learning, the dataset lacks predefined
class labels. Cluster analysis can be used to organize this unlabeled data into labeled
clusters, such as a cluster for cars, a cluster for buses, and so on.
The essence of cluster analysis lies in organizing data points into clusters, each
containing similar objects. This process can be particularly useful for converting
unlabeled data into labeled data, thereby facilitating further analysis.
Properties of Clustering
1. Clustering Scalability: Given the vast amounts of data in contemporary
databases, a clustering algorithm must be scalable to handle extensive datasets
effectively. Non-scalable data can lead to inaccurate results.
2. High Dimensionality: The algorithm should be capable of handling high-
dimensional spaces, even when dealing with small-sized data.
pf3
pf4
pf5
pf8

Partial preview of the text

Download Cluster analysis- DATA MINING CONCEPT BY Aashish R Dandekar (SVPCET,Nagpur) and more Summaries Technology in PDF only on Docsity!

1 Aashish R. Dandekar Department of CSE(DS) Cluster Analysis Cluster Analysis: An Overview Cluster analysis, also referred to as clustering, is a fundamental method in data mining aimed at grouping similar data points. The primary objective of cluster analysis is to partition a dataset into clusters, where each cluster contains data points that are more similar to each other than to those in other clusters. This technique is pivotal in exploratory data analysis, enabling the identification of hidden patterns or relationships within the data. Various algorithms can be employed for cluster analysis, including k- means, hierarchical clustering, and density-based clustering. The selection of an appropriate algorithm hinges on the specific requirements of the analysis and the nature of the data. Cluster analysis operates on the principle of unsupervised learning, dealing with unlabeled data. A cluster represents a group of similar data points. For instance, consider a dataset containing information on different types of vehicles, such as cars, buses, and bicycles. Since this is unsupervised learning, the dataset lacks predefined class labels. Cluster analysis can be used to organize this unlabeled data into labeled clusters, such as a cluster for cars, a cluster for buses, and so on. The essence of cluster analysis lies in organizing data points into clusters, each containing similar objects. This process can be particularly useful for converting unlabeled data into labeled data, thereby facilitating further analysis. Properties of Clustering

  1. Clustering Scalability : Given the vast amounts of data in contemporary databases, a clustering algorithm must be scalable to handle extensive datasets effectively. Non-scalable data can lead to inaccurate results.
  2. High Dimensionality : The algorithm should be capable of handling high- dimensional spaces, even when dealing with small-sized data.

2 Aashish R. Dandekar

  1. Algorithm Usability with Multiple Data Types : Clustering algorithms should accommodate various data types, including discrete, categorical, interval-based, and binary data.
  2. Dealing with Unstructured Data : Clustering algorithms must handle missing values, noisy data, and errors effectively. This capability is crucial for organizing unstructured data into coherent clusters, thereby aiding data experts in processing and discovering new patterns.
  3. Interpretability : The outcomes of clustering should be interpretable, comprehensible, and actionable, reflecting the ease with which the data can be understood. Clustering Methods Clustering methods can be categorized into the following types:
  • Partitioning Method : This method partitions the data into clusters. For a dataset with ppp objects, nnn partitions are created, each represented by a cluster, where n<pn < pn<p. Two conditions must be met: each object should belong to only one group, and no group should be empty. Iterative relocation, which involves moving objects between groups to improve partitioning, is a common technique in this method.
  • Hierarchical Method : This method creates a hierarchical decomposition of the data objects. It can be further classified based on how the decomposition is formed: o Agglomerative Approach (Bottom-Up) : Initially, each object forms its own cluster. Clusters are merged iteratively based on their similarity until a termination condition is met. o Divisive Approach (Top-Down) : All objects start in a single cluster, which is then split iteratively into smaller clusters until a termination condition is met. Improving hierarchical clustering quality involves careful analysis of object linkages at each partitioning step and employing hierarchical agglomerative algorithms for integrating hierarchical agglomeration.
  • Density-Based Method : This method focuses on the density of data points. A cluster grows continuously as long as the density within the neighborhood exceeds a specified threshold. Each data point in the cluster must be within a given radius containing a minimum number of points.
  • Grid-Based Method : This method quantizes the object space into a finite number of cells forming a grid structure. It offers fast processing time, which depends only on the number of cells in each dimension.

4 Aashish R. Dandekar Types of Data:

  1. Interval-Scaled Variables o Continuous measurements on a roughly linear scale (e.g., weight, height, temperature). o Affected by measurement units (e.g., meters vs. inches). o Standardization is often necessary to give equal weight to all variables.
  2. Binary Variables o Variables that can take only two values (e.g., gender: male/female). o Simple matching coefficient for symmetric binary variables. o Jaccard coefficient for asymmetric binary variables.
  3. Nominal Variables o Categorical variables with more than two states (e.g., color: red, yellow, blue). o Methods: Simple matching, converting to multiple binary variables.
  4. Ordinal Variables o Variables where order is important (e.g., ranks). o Can be treated as interval-scaled variables after ranking and normalization.
  5. Ratio-Scaled Variables o Positive measurements on a nonlinear scale (e.g., exponential growth). o Often require logarithmic transformation before treating as interval-scaled variables.
  6. Mixed-Type Variables o Data containing a combination of the above types. o Requires specialized preprocessing techniques to handle mixed data types effectively. Types of Data Structures in Cluster Analysis
  7. Data Matrix (Object by Variable Structure) o Rows represent objects, and columns represent variables.

5 Aashish R. Dandekar

  1. Dissimilarity Matrix (Object by Object Structure) o Stores proximities (e.g., distances) between all pairs of nnn objects in an n×nn \times nn×n matrix. Types of Outliers in Data Mining
  2. Global (Point) Outliers o Data points that deviate significantly from the overall distribution. o Detection: Statistical methods (z-score, Mahalanobis distance), machine learning algorithms (isolation forest, one-class SVM). o
  3. Collective Outliers o Groups of data points that deviate collectively from the overall distribution. o Detection: Clustering algorithms, density-based methods.

7 Aashish R. Dandekar

  • Handles clusters of arbitrary shape and noise effectively.
  • Requires two parameters: eps (radius) and MinPts (minimum points within eps). Steps:
  1. Identify core points with more than MinPts neighbors within eps.
  2. Form clusters by recursively including density-connected points.
  3. Label points that are not part of any cluster as noise.

Grid-Based Method for Distance-Based Outlier Detection Grid-Based Outlier Detection involves:

  1. Constructing grid cells.
  2. Assigning data points to appropriate cells.
  3. Merging dense cells.
  4. Pruning safe regions and applying outlier detection to candidate cells.
  5. Using density measures to assign outlierness degrees.

8 Aashish R. Dandekar