Standardizing Data and Computing Proximities: Methods and Measures, Study notes of Mathematical Statistics

Various methods for standardizing cases or variables using proximities software. It covers standardization methods like z, range, rescale, max, mean, and sd. Additionally, it discusses transformations like absolute, reverse, and rescale. The document also introduces proximity measures for continuous, frequency count, and binary data, such as euclid, correlation, and rr.

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
PROXIMITIES
Standardizing Cases or Variables
Either cases or variables can be standardized. The following methods of
standardization are available:
Z
PROXIMITIES subtracts the mean from each value for the variable or case being
standardized and then divides by the standard deviation of the values. If a standard
deviation is 0, PROXIMITIES sets all values for the case or variable to 0.
RANGE
PROXIMITIES divides each value for the variable or case being standardized by
the range of the values. If the range is 0, PROXIMITIES leaves all values
unchanged.
RESCALE
From each value for the variable or case being standardized, PROXIMITIES
subtracts the minimum value and then divides by the range. If a range is 0,
PROXIMITIES sets all values for the case or variable to 0.50.
MAX
PROXIMITIES divides each value for the variable or case being standardized by
the maximum of the values. If the maximum of a set of values is 0, PROXIMITIES
uses an alternate process to produce a comparable standardization: it divides by the
absolute magnitude of the smallest value and adds 1.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Standardizing Data and Computing Proximities: Methods and Measures and more Study notes Mathematical Statistics in PDF only on Docsity!

1

Standardizing Cases or Variables

Either cases or variables can be standardized. The following methods of standardization are available:

Z

PROXIMITIES subtracts the mean from each value for the variable or case being standardized and then divides by the standard deviation of the values. If a standard deviation is 0, PROXIMITIES sets all values for the case or variable to 0.

RANGE

PROXIMITIES divides each value for the variable or case being standardized by the range of the values. If the range is 0, PROXIMITIES leaves all values unchanged.

RESCALE

From each value for the variable or case being standardized, PROXIMITIES subtracts the minimum value and then divides by the range. If a range is 0, PROXIMITIES sets all values for the case or variable to 0.50.

MAX

PROXIMITIES divides each value for the variable or case being standardized by the maximum of the values. If the maximum of a set of values is 0, PROXIMITIES uses an alternate process to produce a comparable standardization: it divides by the absolute magnitude of the smallest value and adds 1.

MEAN

PROXIMITIES divides each value for the variable or case being standardized by the mean of the values. If a mean is 0, PROXIMITIES adds one to all values for the case or variable to produce a mean of 1.

SD

PROXIMITIES divides each value for the variable or case being standardized by the standard deviation of the values. PROXIMITIES does not change the values if their standard deviation is 0.

Transformations

Three transformations are available for the values PROXIMITIES computes or reads:

ABSOLUTE

Take the absolute values of the proximities.

REVERSE

Transform similarity values into dissimilarities, or vice versa, by changing the signs of the coefficients.

RESCALE

RESCALE standardizes the proximities by first subtracting the value of the smallest and then dividing by the range. If you specify more than one transformation, PROXIMITIES does them in the order listed above: first ABSOLUTE, then REVERSE, then RESCALE.

CHEBYCHEV

The distance between two items is the maximum absolute difference between the values for the items.

CHEBYCHEV (^) I x y , (^) T = max i x (^) iyi

BLOCK

The distance between two items is the sum of the absolute differences between the values for the items.

BLOCK x y x (^) i yi i

I , T =^ ∑ −

MINKOWSKI( p )

The distance between two items is the p th root of the sum of the absolute differences to the p th power between the values for the items.

MINKOWSKI x y x (^) i yi i

p p I , T =^ −

% ' &

( 0 ∑ )

1

POWER (^) I p r , T

The distance between two items is the r th root of the sum of the absolute differences to the p th power between the values for the items.

POWER x y x (^) i yi i

p r I , T =^ % − ' &

( 0 ∑ )

1

Measures for Frequency Count Data

CHISQ

The magnitude of this dissimilarity measure depends on the total frequencies of the two cases or variables whose proximity is computed. Expected values are from the model of independence of cases (or variables), x and y.

CHISQ x y

x E x E x

y E y E y

i i

i i

i i

i i

I , T

P I TU I T

P I TU I T

∑ ∑

2 2

PH

This is the CHISQ measure normalized by the square root of the combined frequency. Therefore, its value does not depend on the total frequencies of the two cases or variables whose proximity is computed.

PH

CHISQ

x y

x y N

I T

I T

Measures for Binary Data

PROXIMITIES constructs a 2 × 2 contingency table for each pair of items in turn. It uses this table to compute a proximity measure for the pair. Item 2 Present Absent Item 1 Present a b Absent c d

PROXIMITIES computes all binary measures from the values of a , b , c, and d. These values are tallies across variables (when the items are cases) or tallies across cases (when the items are variables).

Rogers and Tanimoto Similarity Measure

RT x y a d a d b c

I , T I T

Sokal and Sneath Similarity Measure 2

SS2 x y a a b c

I , T I T

Kulczynski Similarity Measure 1

This measure has a minimum value of 0 and no upper limit. It is undefined when there are no nonmatches (^) I b = 0 and c = (^0) T. Therefore, PROXIMITIES assigns an artificial upper limit of 9999.999 to K1 when it is undefined or exceeds this value.

K1 x y

a b c

I , T =

Sokal and Sneath Similarity Measure 3

This measure has a minimum value of 0, has no upper limit, and is undefined when there are no nonmatches I b = 0 and c = 0 T. As with K1, PROXIMITIES assigns an artificial upper limit of 9999.999 to SS3 when it is undefined or exceeds this value.

SS3 x y

a d b c

I , T =

Conditional Probabilities

The following three binary measures yield values that you can interpret in terms of conditional probability. All three are similarity measures.

Kulczynski Similarity Measure 2

This yields the average conditional probability that a characteristic is present in one item given that the characteristic is present in the other item. The measure is an average over both items acting as predictors. It has a range of 0 to 1.

K2 x y

a a b a a c I , T

I T I T

Sokal and Sneath Similarity Measure 4

This yields the conditional probability that a characteristic of one item is in the same state (present or absent) as the characteristic of the other item. The measure is an average over both items acting as predictors. It has a range of 0 to 1.

SS4 x y

a a b a a c d b d d c d I , T

I T I T I T I T

Hamann Similarity Measure

This measure gives the probability that a characteristic has the same state in both items (present in both or absent from both) minus the probability that a characteristic has different states in the two items (present in one and absent from the other). HAMANN has a range of –1 to +1 and is monotonically related to SM, SS1, and RT.

HAMANN x y

a d b c a b c d

I , T

I T I T

Predictability Measures

The following four binary measures assess the association between items as the predictability of one given the other. All four measures yield similarities.

Yule’s Q (Similarity)

This is the 2 × 2 version of Goodman and Kruskal’s ordinal measure gamma. Like Yule’s Y , Q is a function of the cross-product ratio for a 2 × 2 table and has a range of –1 to +1.

Q x y ad bc ad bc

I , T =^

Other Binary Measures

The remaining binary measures available in PROXIMITIES are either binary equivalents of association measures for continuous variables or measures of special properties of the relation between items.

Ochiai Similarity Measure

This is the binary form of the cosine. It has a range of 0 to 1 and is a similarity measure.

OCHIAI x y

a a b

a a c

I , T =

% '&^

( 0 )^ +

% '&^

( 0 )

Sokal and Sneath Similarity Measure 5

This is a similarity measure. Its range is 0 to 1.

SS5 x y

ad a b a c b d c d

I , T I TI TI TI T

Fourfold Point Correlation (Similarity)

This is the binary form of the Pearson product-moment correlation coefficient. Phi is a similarity measure, and its range is 0 to 1.

PHI x y

ad bc a b a c b d c d

I , T I TI TI TI T

Binary Euclidean Distance

This is a distance measure. Its minimum value is 0, and it has no upper limit.

BEUCLID (^) I x y , (^) T = b + c

Binary Squared Euclidean Distance

This is also a distance measure. Its minimum value is 0, and it has no upper limit.

BSEUCLID (^) I x y , (^) T = b + c

Size Difference

This is a dissimilarity measure with a minimum value of 0 and no upper limit.

SIZE x y

b c a b c d

I , T

I T I T

2 2

Pattern Difference

This is also a dissimilarity measure. Its range is 0 to 1.

PATTERN x y bc a b c d

I , T I T

2

Binary Shape Difference

This dissimilarity measure has no upper or lower limit.

BSHAPE x y

a b c d b c b c a b c d

I , T

I TI T I T I T

2 2