Machine Learning Map from Scikit-learn: Clustering Techniques and Evaluation, Lecture notes of Machine Learning

[Week 8] Clustering and Dimensionality Reduction

Typology: Lecture notes

2018/2019

Uploaded on 04/20/2019

kefart
kefart 🇺🇸

4.4

(11)

55 documents

1 / 68

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
The University of Sydney Page 1
COMP5310: Principles of
Data Science
W8: Clustering and
Dimensionality Reduction
Presented by Ali Anaissi
School of IT
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44

Partial preview of the text

Download Machine Learning Map from Scikit-learn: Clustering Techniques and Evaluation and more Lecture notes Machine Learning in PDF only on Docsity!

COMP5310: Principles of

Data Science

W8: Clustering and

Dimensionality Reduction

Presented by Ali Anaissi

School of IT

  • Overview of Week

Unsupervised Learning:

  • We’ll focus on unsupervised machine learning techniques Association rule mining
  • Dimensionality reduction
  • Clustering
  • Outlier detection
  • Etc.

Machine Learning Map from Scikit-learn

http://scikit-learn.org/stable/tutorial/machine_learning_map/

Clustering: Group Similar Objects

  • Group data points into clusters such that
    • Data points in one cluster are more similar to one another.
    • Data points in separate clusters are less similar to one another.
    • Distance function specifies the “closeness” of two objects. Inter-cluster distances are maximized Intra-cluster distances are minimized

The University of Sydney Page 8

Similarity and Dissimilarity Between Objects

  • Distances are normally used to measure the similarity or dissimilarity between two data objects
  • Some popular ones include: Minkowski distance : where i = ( x i1, x i2, …, x ip) and j = ( x j1, x j2, …, x jp) are two p - dimensional data objects, and q is a positive integer
  • If q = 1 , d is Manhattan distance q q p p q q j x i x j x i x j x i d ( i , j ) (| x | | | ... | | ) 1 1 2 2        ( , ) | | | | ... | | 1 1 2 2 p jp x i x j x i x j x i d i jx      

Data Structures

  • Data matrix n-observations with p-attributes (measurements).
  • Dissimilarity matrix d(i,j) is the dissimilarity between objects i and j
    • expresses the pairwise dissimilarities (distances) between observations in the data set
    • the desired data input to some clustering algorithm                 ( , 1 ) ( , 2 ) ... 0 : : : ) ( 3 , 2 ) d n d n ... d(3,1 d 0 d(2,1) 0 0 attributes/dimensions tuples/objects objects objects x 11 ... x 1f ... x 1p ... ... ... ... ... x i ... x if ... x ip ... ... ... ... ... x n ... x nf ... x np é ë ê ê ê ê ê ê ê ê ù û ú ú ú ú ú ú ú ú

Clustering for Understanding

  • Group related documents for browsing
  • Group genes, proteins, or cells that have similar functionality
  • Group stocks with similar price fluctuations
  • etc

Hierarchical Clustering

Original Data Items Hierarchical Data Items

Hierarchical Clustering

Strategies for hierarchical clustering generally fall into two types:

  • Agglomerative : This is a "bottom up" approach: each object starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
  • Divisive : This is a "top down" approach: all objects start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Hierarchical Algorithm Steps in Hierarchical Algorithm:

  • The first step generates the distance calculation matrix for each data item as shown in table below, in this case: {a}, {b}, {c}, {d}, {e}, {f}. a b c d e f a 0 184 222 177 216 231 b 184 0 45 123 128 200 c 222 45 0 129 121 203 d 177 123 129 0 46 83 e 216 128 121 46 0 83 f 231 200 203 83 83 0

Hierarchical Algorithm

  • Next step is to merge the closest data items.
    • In this case: {b , c} are merged.
    • Therefore, the first clustering process generates: {a}, {b , c}, {d},{e},{f}. a b c d e f a 0 184 222 177 216 231 b 184 0 45 123 128 200 c 222 45 0 129 121 203 d 177 123 129 0 46 83 e 216 128 121 46 0 83 f 231 200 203 83 83 0 a b,c d e f a (^0)? 177 216 231 b,c? 0??? d 177? 0 46 83 e 216? 46 0 83 f 231? 83 83 0

Hierarchical Algorithm with Single Linkage

– Repeat the distance calculation process based on single linkage

– Apply merging process based on previous merge results.

– In this case: {d , e} are merged.

– The final results are: {a}, {b, c} {d, e} {a}, {b, c}, {d, e, f}

{a}, {b, c, d, e, f} {a, b, c, d, e, f}

a b c d e f a 0 184 222 177 216 231 b 184 0 45 123 128 200 c 222 45 0 129 121 203 d 177 123 129 0 46 83 e 216 128 121 46 0 83 f 231 200 203 83 83 0 a b, c d e f a 0 184 177 216 231 b, c 184 0 123 121 200 d 177 123 0 46 83 e 216 121 46 0 83 f 231 200 83 83 0

Resultant Hierarchical Clustering

Original Data Items Hierarchical Data Items