Multiple Correspondence Analysis: A Technique for Analyzing Categorical Data, Study notes of Mathematical Statistics

Multiple correspondence analysis (mca) is a statistical method used to explore relationships among multiple categorical variables. It assigns numerical values to objects and categories, enabling visualization of data in a low-dimensional space. Objects within the same category are close together, while objects in different categories are far apart. The steps involved in performing mca, including calculating object scores, category quantifications, and updating these values iteratively to minimize the objective function.

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
MULTIPLE CORRESPONDENCE
Multiple Correspondence Analysis, also known as homogeneity analysis, quantifies
nominal (categorical) data by assigning numerical values to the cases (objects) and
categories, such that in the low-dimensional representation of the data, objects
within the same category are close together and objects in different categories are
far apart. Each object is as close as possible to the category points of categories that
apply to the object. In this way, the categories divide the objects into homogeneous
subgroups. Variables are considered homogeneous when they classify objects that
are in the same categories into the same subgroups.
Notation
The following notation is used throughout this chapter unless otherwise stated:
n
Number of analysis cases (objects)
w
n
Weighted number of analysis cases: 1
n
i
i
w
=
t
o
t
n
Total number of cases (analysis + supplementary)
i
w
Weight of object
;
1
i
w
=
if cases are unweighted;
0
i
w= if object
is
supplementary.
W
Diagonal
t
ot to
t
n
n× matrix, with
i
w
on the diagonal.
m
Number of analysis variables
w
m
Weighted number of analysis variables ( 1
m
wj
j
m
v
=
=)
t
o
t
m
Total number of variables (analysis + supplementary)
H
The data matrix (category indicators), of order
t
ot to
t
n
m×, after
discretization, imputation of missings , and listwise deletion, if applicable.
p
Number of dimensions
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Multiple Correspondence Analysis: A Technique for Analyzing Categorical Data and more Study notes Mathematical Statistics in PDF only on Docsity!

MULTIPLE CORRESPONDENCE

Multiple Correspondence Analysis, also known as homogeneity analysis, quantifies nominal (categorical) data by assigning numerical values to the cases (objects) and categories, such that in the low-dimensional representation of the data, objects within the same category are close together and objects in different categories are far apart. Each object is as close as possible to the category points of categories that apply to the object. In this way, the categories divide the objects into homogeneous subgroups. Variables are considered homogeneous when they classify objects that are in the same categories into the same subgroups.

Notation

The following notation is used throughout this chapter unless otherwise stated:

n (^) Number of analysis cases (objects)

n w Weighted number of analysis cases: 1

n i i

w

n t (^) ot Total number of cases (analysis + supplementary)

w i (^) Weight of object i (^) ; wi = (^1) if cases are unweighted; wi = (^0) if object i (^) is supplementary. W Diagonal nt^ ot^ ×^ ntot matrix, with wi^ on the diagonal. m (^) Number of analysis variables

m w Weighted number of analysis variables ( 1

m w j j

m v

) m t (^) ot Total number of variables (analysis + supplementary)

H The data matrix (category indicators), of order nt^ ot^ ×^ mtot , after discretization, imputation of missings , and listwise deletion, if applicable. p (^) Number of dimensions

For variable j , j =1, !, mtot

v (^) j variable weight; v (^) j = 1 if weight for variable j is not specified or if variable (^) j is supplementary

k (^) j Number of categories of variable^ j^ (number of distinct values in h (^) j , thus, including supplementary objects)

G (^) j Indicator matrix for variable j , of order nt (^) ot × kj

The elements of G (^) j are defined as i = 1,! , nt (^) ot ; r =1, !, kj

1 when the th object is in the th category of variable j ir 0 when the th object is not in the th category of variable

i r j g i r j

D (^) j Diagonal k (^) j × kj matrix, containing the weighted univariate marginals; i.e., the weighted column sums of G (^) j ( D (^) j = G WGj (^) j )

M (^) j Diagonal nt (^) ot × ntot matrix, with diagonal elements defined as

0 when the th observation is missing and missing strategy variable is passive

when the th object is in th category of variable and th category is only 0 used by supplementary objects (i.e. wh

j ii

i j

m = i^ r^ j^ r en (^) ( ) 0) otherwise

j rr j

d v

M * (^) j jM

The quantification matrices and parameter vectors are:

X (^) Object scores, of order ntot × p

X w Weighted object scores ( X w (^) = WX )

Grouping equal intervals of specified size

First the intervals are defined as lowest value + interval size, lowest value + 2*interval size, etc. Then cases with values in the k thinterval are assigned to category k.

Imputation of Missing Values

When there are variables with missing values specified to be treated as active (impute mode or extra category), then first the k (^) j ’s for these variables are computed before listwise deletion. Next the category indicator with the highest weighted frequency (mode; the smallest if multiple modes exist), or k (^) j + 1 (extra category) is imputed. Then listwise deletion is applied if applicable. Finally the k (^) j ’s are adjusted.

Configuration

MULTIPLE CORRESPONDENCE can read a configuration from a file to be used as the initial configuration or as a fixed configuration in which to fit variables. For an initial configuration see step 1 in the Objective Function Optimization section. A fixed configuration X is centered and orthonormalized as described in the optimization section in step 3 (with X instead of Z ) and step 4 (except for the factor n 1 2 w ), and the result is postmultiplied with 1 2^ (this leaves the configuration unchanged if it is already centered and orthogonal). The analysis variables are set to supplementary and variable weights are set to one. Then MULTIPLE CORRESPONDENCE proceeds as described in the Supplementary Variables section.

Objective Function Optimization

Objective Function

The MULTIPLE CORRESPONDENCE objective is to find object scores X and a set of Y j (for j = 1, !, m ) so that the function

( ;^ ) (^ w )^1 tr ( (^) j j ) (^) j ( (^) j j ) j

σ n p −^

X Y (^) ∑ X G Y M W X G Y

is minimal, under the normalization restriction X M WX ′ (^) ∗ = n mw w I ( I is the p × p identity matrix). The inclusion of M (^) j in σ (^) ( X Y ; (^) ) ensures that there is no influence of passive missing values (missing values in variables that have missing option passive, or missing option not specified ). M ∗ contains the number of active data values for each object. The object scores are also centered; i.e. they satisfy u M WX ′ (^) ∗ = 0 with u denoting an n -vector with ones.

Optimization

Optimization is achieved by executing the following iteration scheme:

  1. Initialization
  2. Update category quantifications
  3. Update object scores
  4. Orthonormalization
  5. Convergence test: repeat (2)(4) or continue
  6. Rotation Steps (1) through (6) are explained below.

(1) Initialization

If an initial configuration is not specified, the object scores X are initialized with random numbers. Then X is orthonormalized (see step 4) so that u M WX ′^ ∗ = 0 and X M WX ′^ ∗ = n mw w I , yielding X (^) w^ +.

(2) Update category quantifications; loop across variables (^) j = 1, …, m ( variables 1, …, m are analysis variables):

With fixed current values X (^) w^ + the update of Y j is

Y j

  • 1 j j w

= D^ −^ G X ′ +

(3) Update object scores First the auxiliary score matrix Z is computed as

Supplementary Objects

To compute the object scores for supplementary objects, after convergence steps (2) and (3) are repeated, with the zero’s in W temporarily set to ones in computing Z and X +. If a supplementary object has missing values, passive treatment is applied.

Supplementary Variables

The quantifications for supplementary variables are computed after convergence, by executing step (2) once.

Diagnostics

Maximum Rank (may be issued as a warning when exceeded)

The maximum rank p max indicates the maximum number of dimensions that can be computed for any data set. In general

max min^ 1,^ j j

p nk m

if there are no variables with missing values to be treated as passive. If there are variables with missing values to be treated as passive, the maximum rank is

p max (^) min n 1, ∑ k (^) j max( m 1 ,1)

 ^  

with m 1 the number of variables without missing values to be treated as passive.

Here k (^) j is exclusive supplementary objects (that is, a category only used by supplementary objects is not counted in computing the maximum rank). Although the number of nontrivial dimensions may be less than p max when m = 2 , MULTIPLE CORRESPONDENCE does allow dimensionalities all the way up to p max. When, due to empty categories in the actual data, the rank detoriates below the specified dimensionality, the program stops.

Descriptives

The descriptives tables gives the weighted univariate marginals and the weighted number of missing values (system missing, user defined missing, and values ≤ 0 ) for each variable.

Fit and Loss Measures

When the HISTORY option is in effect, the following fit and loss measures are reported: (a) Fit (VAF). This is the quantity TFIT as defined in step (5).

(b) Loss. This is σ (^) ( X Y ; (^) ).

Model Summary

Cronbach’s Alpha

Cronbach’s Alpha per dimension ( (^) s = 1, …, p ):

( 1 2^ 1) /( 1 2( 1)) α s (^) = mw λ s − λ s mw − ,

Total Cronbach’s Alpha is

( )

α= mw (^) ∑ (^) s λ s − (^) ∑ s λ s mw

with λ s the sth diagonal element of as computed in step (4) during the last iteration.

Variance Accounted For

Variance Accounted For per dimension ( s = 1, …, p ):

1 VAF s (^) nw (^) jvj tr( (^) ( ) j s j ( ) j s ) = −^ ′ ∑ y^ D y^ , (% of variance is^ VAF s^^ ×100 /^ mw ),

Eigenvalue

Eigenvalue per dimension:

1 2 (^) =VAF λ s (^) s ,

After transformation

After transformation p correlation matrices are computed ( s = 1, …, p ):

1 ( ) s nw ( ) s ( ) s

R = −^ Q ′ WQ ,

with (^) ( ) 1 2 1 2 ( ) s j nw^ j ( j s ) ( ) j s j ( ) j s

q = G Y YD Y.

Usually, for the higher eigenvalues, the first eigenvalue of R ( ) (^) s is equal to λ1 2 s

(see Model Summary section). The lower values of 1 2^ are in most cases the second or subsequent eigenvalues of R ( ) (^) s.

If there are missing values, specified to be treated as passive, the mode of the quantified variable or the quantification of an extra category (as specified in syntax; if not specified, default (mode) is used) is imputed before computing correlations. Then the eigenvalues of the correlation matrix do not equal 1 2^ (see Model Summary section). The quantification of an extra category is computed as

Y ( ) j (^) ( (^) k (^) j +1) s

1

i i is i I i I

w w x

∈ ∈

= ^ 

∑ ∑ ,

with I an index set recording which objects have missing values. For the eigenvalue decomposition of R (to compute the eigenvalues), first row j and column (^) j are removed from R if (^) j is a supplementary variable, and then rij

is multiplied by (^) ( )

1 2 v v i j.

Discrimination measures

The discrimination measures are the dimensionwise variances of the quantified variables, which are equal to the dimensionwise squared correlations of the quantified variables with the object scores. For variable j and dimension s the discrimination measure is 1 Discr js nw ( ) j s j ( j s ) = −^ yD y , which is equal to the squared correlation between G y j (^) ( ) j s and x (^) s.

Object Scores and quantifications

Normalization

If 1 2^ gives the eigenvalues, then 1 4^ gives the singular values, that can be used to spread the inertia over the object scores X and the category quantifications Y. During the optimization phase, variable principal normalization is used, then X n^ = X and Y n^ = Y , else n (^) ( 1 ) a 4 m w X = X

n (^) ( 1 )1 4( b 1) mw Y = Y −^ − ,

with a = (1 + q ) / 2, b = (1 − q ) / 2, and q any real value in the closed interval [-1,1], except for independent normalization: then there is no q value and a = b = 1. q = − 1 is equal to variable principal normalization, q = 1 is equal to object principal normalization, q = 0 is equal to symmetrical normalization.

Contributions object scores

Mass

The mass of object i is

Mass tr( )

ii i

m

M W

Inertia

The inertia of object i is

  • (^) , 0 ( )

Inertia Mass ij^ ij

j i i ii (^) j h j h

v m d

= (^) ∑ − ,

where d ( ) (^) j h (^) ij is the frequency of the category of object i on variable (^) j , and

hij ≠ 0 indicates to exclude a variable if object i has a missing value on the variable and the missing option for the variable is passive.

Contribution of point to inertia of dimension

The contribution of object i to the inertia of dimension s is

( )

Mass (^) j r mw

− , and then the total inertia for variable (^) j is j (^ j 1) w

v k m

Contribution of point to inertia of dimension

The contribution of category r of variable j to the inertia of dimension s is

2 ( ) ( ) Contribution( (^) ) Inertia

j rs j rr w w j rs s

y d n m = ,

the total contribution of variable j to the inertia of dimension s is (^) 1 2 j Discr js s

v λ

Contribution of dimension to inertia of point

The contribution of dimension s to the inertia of category r of variable j is

2 ( ) ( ) ( ) ( )

Contribution Inertia

j rs j rr w w s j r j r

y d n m =.

Residuals

Residuals gives plots per dimension of G y j ( ) n j s against the approximation x n s.

References

Cliff, N. 1966. Orthogonal rotation to congruence. Psychometrika , 31: 33–42.

Gifi, A. 1990. Nonlinear multivariate analysis. Leiden: Department of Data Theory.

Max, J. (1960), Quantizing for minimum distortion. IRE Transactions on Information Theory, 6, 7-12.

Wilkinson, J. H. 1965. The algebraic eigenvalue problem. Oxford: Clarendon Press.