











Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Material Type: Notes; Professor: Bellaachia; Class: Data Mining; Subject: Computer Science; University: George Washington University; Term: Unknown 1989;
Typology: Study notes
1 / 19
This page cannot be seen from the preview
Don't miss anything!












2.1. Definitions
2.2. General Applications o Text mining: Document categorization Detection of topics Summarization o Text Mining: Web log analysis Detection of groups of similar access patterns
xn1 ... xnf ... x np
xi1 ... xif ... xip
x 11 ... x1f ... x1p
d n d n ...
d(3,1 d 0
d(2,1) 0
4.1. Standardize data
Where
m (^) f =^1 n^ (x 1 f + x 2 f +...+ xnf ).
s (^) f =^1 n^ (| x 1 f − mf |+| x 2 f − mf |+...+| xnf − mf |)
f
if f if (^) s
x m z
Also, one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures
4.2. Binary variables
sum a c b d p
c d c d
a b a b
sum
0
1
1 0
a b c d
d i j b c
(, )= +
a b c
d i j b c
(, )= +
4.3. Nominal Variables
Name Gender (^) Fever Cough Test-1 Test-2 Test-3 Test- Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N
d jim mary
d jack jim
d jack mary
4.5. Ratio-scaled variables
4.6. Variables of mixed types
f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w. f is interval-based: use the normalized distance f is ordinal or ratio-scaled o compute ranks rif and o treat zif as interval-scaled
( ) 1
( ) ( )
f ij
p f
f ij
f ij
p
=
=
1
1 −
f
if
5.1. Major approaches
5.2. Partitioning approach
A few variants of the k-means which differ in o Selection of the initial k means o Dissimilarity calculations o Strategies to calculate cluster means Handling categorical data: k-modes (Huang’98) o Replacing means of clusters with modes o (^) Using new dissimilarity measures to deal with categorical objects o Using a frequency-based method to update modes of clusters o A mixture of categorical and numerical data: k-prototype method
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
8.2. Divisive Analysis: DIANA
8.3. Analysis of hierarchical clustering
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
9.2. Distance-Based Approach