Cluster Analysis - Basic Statistics for Behavioral Sciences - Lecture Notes, Study notes of Statistics for Psychologists

Cluster Analysis, Hierarchical Cluster Analyses, Stimulus, Given Proximity, Pairwise Combinations, Dissimilarity Data, Euclidean Distance, Minkowski Metric, Correlation are some points from this helpful lecture notes.

Typology: Study notes

2011/2012

Uploaded on 11/21/2012

ashakiran
ashakiran 🇮🇳

4.5

(27)

261 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Ch. 14: Cluster Analysis (CA)
I. Situation
A. Given proximity or distance data obtained from all possible
pairwise combinations of stimuli or regular subject-by-
variable data from a usual data collection, CA makes
clusters
based on the distance between stimuli (variables).
B. Many similar techniques have been developed in different
areas (biology, sociology, psychology). SAS lists 11 methods.
C. CA methods can be classified into hierarchical or non-
hierarchical cluster analyses.
1. Hierarchical Cluster Analysis
a) Starts with n-clusters (each stimulus is a cluster).
b) The two closest clusters are merged to form a new
cluster that replaces the two old clusters.
c) Merging two closest clusters is repeated until only
one cluster is left.
d) Different methods have different ways to compute the
distance between two clusters.
2. Non-hierarchical Cluster Analysis: n-clusters will be
separated into g-clusters without using a hierarchical
method.
II. Similarity(Proximity) and Dissimilarity(Distance) data
A. Euclidean distance
d(x, y) =
)(
)'()()()'(
1
1
2
yx
Syxyxyxyx
p
iii
==
=
.
where
x = (x1, x2, . . xp)’,
y = (y1, y2, . . yp)’,
p = the number of stimuli, and
S = sample covariance matrix.
B. Minkowski metric (general formula)
d(x, y) =
r
p
i
r
ii
yx
/1
1
||
=
.
where p = the number of stimuli, and r = the order of power.
If r=2, then it is the Euclidean distance. If p=2, r=1, then
it is the city block (Manhattan distance)
C. Correlation(closeness of the shapes)
The squared Euclidean distance d2(x, y) =
=
p
iii yx
1
2
)(
can be
expressed as,
d2(x, y) =
)1(2
)()(
22 xyyxyx
rvvyxp
vv ++
,
Docsity.com
pf3

Partial preview of the text

Download Cluster Analysis - Basic Statistics for Behavioral Sciences - Lecture Notes and more Study notes Statistics for Psychologists in PDF only on Docsity!

Ch. 14: Cluster Analysis (CA) I. Situation A. Given proximity or distance data obtained from all possible pairwise combinations of stimuli or regular subject-by- variable data from a usual data collection, CA makes clusters based on the distance between stimuli (variables). B. Many similar techniques have been developed in different areas (biology, sociology, psychology). SAS lists 11 methods. C. CA methods can be classified into hierarchical or non- hierarchical cluster analyses.

  1. Hierarchical Cluster Analysis a) Starts with n-clusters (each stimulus is a cluster). b) The two closest clusters are merged to form a new cluster that replaces the two old clusters. c) Merging two closest clusters is repeated until only one cluster is left. d) Different methods have different ways to compute the distance between two clusters.
  2. Non-hierarchical Cluster Analysis: n-clusters will be separated into g-clusters without using a hierarchical method.

II. Similarity(Proximity) and Dissimilarity(Distance) data A. Euclidean distance

d( x , y ) = ( )'( ) ( ) ( )'^1 ( ) 1

x y x y x y^2 x y S x y

p

i

− − = ii = − − − =

where x = (x 1 , x 2 ,.. xp)’, y = (y 1 , y 2 ,.. yp)’, p = the number of stimuli, and S = sample covariance matrix. B. Minkowski metric (general formula)

d( x , y ) =

p^ r

i

r xi yi

1 /

1

=

where p = the number of stimuli, and r = the order of power. If r=2, then it is the Euclidean distance. If p=2, r=1, then it is the city block (Manhattan distance) C. Correlation(closeness of the shapes)

The squared Euclidean distance d^2 ( x , y ) = ∑

=

p

i

xi yi 1

( )^2 can be

expressed as,

d^2 ( x , y ) = ( vxvy )^2 + p ( xy )^2 + 2 vxvy ( 1 − r xy ),

where

vx = ∑

=

p

i

xi x 1

( )^2 ,

p

x x

p

i

∑ i

= =^1 , vy =

=

p

i

yi y 1

( )^2 ,

p

y y

p

i

∑ i

= =^1 , and

rxy = Pearson Product-moment correlation coefficient.

III. Different methods of making clusters in Hierarchical Cluster Analysis A. Single Linkage: makes clusters based on the minimum distance between one stimulus in one cluster and one stimulus in the other cluster. B. Complete Linkage: makes clusters based on the maximum distance in the other cluster. C. Average Linkage: makes clusters based on the average distance between pairs of stimuli or clusters. D. Centroid Linkage: makes clusters based on the Euclidean distance between their means (centroids). E. Median Method: to avoid the impact of cluster size, we can use the midpoint of two clusters. F. Ward’s Method: makes clusters based on the minimum between- cluster SSE.

IV. Ward’s minimum-variance method A. Model

  1. DAB =

A B

A B

n n

Y Y

( )^2

, where

DAB : squared distance between cluster A and cluster B, _ _ YA, YB: mean vectors for cluster A and cluster B, and NA, nB: sample size for cluster A and cluster B.

∑ (^ Y^ A^ −^ YB )=|| YA − YB || = Euclidean length.

  1. For any distance or dissimilarity data, d( x , y ) = Σ(x-y)^2 /2, Ward’s Method joins two clusters A and B which minimizes the IAB (Increase in SSE), which is the same as minimizing the between-cluster distances.

IAB = SSEAB – (SSEA + SSEB), where

SSEA = ∑

=

n A

i

yi yA yi yA 1