clustering dataming, Lecture notes of Applications of Computer Sciences

clustering Computer science dataming

Typology: Lecture notes

2016/2017

Uploaded on 04/15/2017

silm-ahmed
silm-ahmed 🇬🇧

1 document

1 / 54

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
14.11.2001 Data mining: Clusteri
ng
1
Intro/Ass. Rules
Intro/Ass. Rules
Intro/Ass. Rules
Intro/Ass. Rules
Episodes
Episodes
Episodes
Episodes
Text Mining
Text Mining
Text Mining
Text Mining
Home Exam
Home Exam
24./26.10.
30.10.
Clustering
Clustering
Clustering
Clustering
KDD Process
KDD Process
KDD Process
KDD Process
Appl./Summary
Appl./Summary
Appl./Summary
Appl./Summary
14.11.
21.11.
7.11.
28.11.
Course on Data Mining (581550-4)
Course on Data Mining (581550-4)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36

Partial preview of the text

Download clustering dataming and more Lecture notes Applications of Computer Sciences in PDF only on Docsity!

Intro/Ass. Rules

Intro/Ass. Rules

Intro/Ass. Rules

Intro/Ass. Rules

Episodes

Episodes

Episodes

Episodes

Text Mining

Text Mining

Text Mining

Text Mining

Home Exam

Home Exam

Clustering

Clustering

Clustering

Clustering

KDD Process

KDD Process

KDD Process

KDD Process

Appl./Summary

Appl./Summary

Appl./Summary

Appl./Summary

Course on Data Mining (581550-4)

Course on Data Mining (581550-4)

Today 14.11.

Today 14.11.

Today 14.11.

Today 14.11.

Today's subject

Today's subject :

o

Classification, clustering

Classification, clustering

Next week's program

Next week's program :

o

Lecture: Data mining process

Lecture: Data mining process

o

Exercise: Classification,

Exercise: Classification,

clustering

clustering

o

Seminar: Classification,

Seminar: Classification,

clustering

clustering

Course on Data Mining (581550-4)

Course on Data Mining (581550-4)

What is cluster analysis?

What is cluster analysis?

Similarity and dissimilarity

Similarity and dissimilarity

Types of data in cluster analysis

Types of data in cluster analysis

Major clustering methods

Major clustering methods

Partitioning methods

Partitioning methods

Hierarchical methods

Hierarchical methods

Outlier analysis

Outlier analysis

Summary

Summary

Cluster analysis

Cluster analysis

Overview

Overview

Cluster:

Cluster: a collection of data

objects

o

similar to one another within

the same cluster

o dissimilar to the objects in the

other clusters

Aim of clustering:

Aim of clustering: to group a set

of data objects into clusters

What is cluster analysis?

What is cluster analysis?

Applications of clustering

Applications of clustering

Marketing:

Marketing: discovering of distinct customer

groups in a purchase database

Land use:

Land use: identifying of areas of similar land use

in an earth observation database

Insurance:

Insurance: identifying groups of motor insurance

policy holders with a high average claim cost

City-planning:

City-planning: identifying groups of houses

according to their house type, value, and

geographical location

What is good clustering?

What is good clustering?

A

good clustering

good clustering method will produce high quality

clusters with

o

high intra-class

intra-class similarity

o low

inter-class

inter-class similarity

The

quality

quality of a clustering result depends on

o the similarity measure used

o implementation of the similarity measure

The quality

quality of a clustering method is also measured by its

ability to discover some or all of the hidden

hidden patterns

Requirements of clustering in

Requirements of clustering in

data mining (2)

data mining (2)

Ability to deal with noise and

Ability to deal with noise and

outliers

outliers

Insensitivity to order of input

Insensitivity to order of input

records

records

High dimensionality

High dimensionality

Incorporation of user-specified

Incorporation of user-specified

constraints

constraints

Interpretability and usability

Interpretability and usability

Similarity and dissimilarity

Similarity and dissimilarity

between objects (1)

between objects (1)

There is no single definition

no single definition of

similarity or dissimilarity between data

objects

The definition of similarity or

similarity or

dissimilarity between objects

dissimilarity between objects

depends on

o

the type of the data considered

o what kind of similarity we are

looking for

Type of data in cluster analysis

Type of data in cluster analysis

Interval-scaled variables

Interval-scaled variables

Binary variables

Binary variables

Nominal, ordinal, and ratio

Nominal, ordinal, and ratio

variables

variables

Variables of mixed types

Variables of mixed types

Complex data types

Complex data types

Interval-scaled variables (1)

Interval-scaled variables (1)

Continuous measurements

Continuous measurements of a

roughly linear scale

For example, weight, height and age

The

measurement unit

measurement unit can affect

the cluster analysis

To avoid dependence on the

measurement unit, we should

standardize

standardize the data

Interval-scaled variables (3)

Interval-scaled variables (3)

One group of popular distance measures for interval-

scaled variables are Minkowski distances

Minkowski distances

where i = (x

i

, x

i

, …, x

ip

) and j = (x

j

, x

j

, …, x

jp

) are

two p -dimensional data objects, and q is a positive

integer

q

q

p p

q q

j

x

i

x

j

x

i

x

j

x

i

d ( i , j ) (| x | | | ... | | )

1 1 2 2

Interval-scaled variables (4)

Interval-scaled variables (4)

If q = 1 , the distance measure is Manhattan (

Manhattan ( or

city block) distance

city block) distance

If q = 2 , the distance measure is

Euclidean

Euclidean

distance

distance

( , ) | | | | ... | |

1 1 2 2 p jp

x

i

x

j

x

i

x

j

x

i

d i jx      

2 2

2 2

2

1 1 p jp

x

i

x

j

x

i

x

j

x

i

d i jx      

Binary variables (2)

Binary variables (2)

Simple matching coefficient

Simple matching coefficient (invariant similarity, if

the binary variable is symmetric

symmetric ):

Jaccard coefficient

Jaccard coefficient (noninvariant similarity, if the

binary variable is asymmetric

asymmetric ):

a b c d

b c

d i j

  

( , )

a b c

b c

d i j

 

( , )

Binary variables (3)

Binary variables (3)

Example

Example : dissimilarity between binary variables:

a patient record table

eight attributes, of which

o gender is a symmetric attribute, and

o the remaining attributes are asymmetric binary

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y P N N N N