Statistical Analysis of Contingency Tables: Crosstabs and Associated Statistics, Study notes of Mathematical Statistics

An in-depth analysis of various statistics used in the statistical analysis of contingency tables, including marginal and cell statistics, chi-square statistics, goodman and kruskal's tau, and cohen's kappa. It covers formulas, standard errors, and degrees of freedom for each statistic.

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 29

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CROSSTABS
The notation and statistics refer to bivariate subtables defined by a row variable X
and a column variable Y, unless specified otherwise. By default, CROSSTABS
deletes cases with missing values on a table-by-table basis.
Notation
The following notation is used throughout this chapter unless otherwise stated:
Xi Distinct values of row variable arranged in ascending order:
XX X
R12
<<<L
Yj Distinct values of column variable arranged in ascending order:
YY Y
C12
<<<L
fij Sum of cell weights for cases in cell ij,
16
cj fij
i
R
=
1
, the jth column subtotal
ri fij
j
C
=
1
, the ith row subtotal
W cr
j
j
C
i
i
R
==
∑∑
=
11
, the grand total
Marginal and Cell Statistics
Count
count =fij
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d

Partial preview of the text

Download Statistical Analysis of Contingency Tables: Crosstabs and Associated Statistics and more Study notes Mathematical Statistics in PDF only on Docsity!

1

The notation and statistics refer to bivariate subtables defined by a row variable X and a column variable Y, unless specified otherwise. By default, CROSSTABS deletes cases with missing values on a table-by-table basis.

Notation

The following notation is used throughout this chapter unless otherwise stated: X (^) i Distinct values of row variable arranged in ascending order: X 1 < X 2 < L Expected Count

E

r c ij W = i^ j

Row Percent

row percent = 100 ×Q f (^) ij riV

Column Percent

column percent = 100 ×Q f (^) ij cjV

Total Percent

total percent = 100 ×Q f (^) ij WV

Residual

Rij = f (^) ij −Eij

Standardized Residual

SR
R

ij E

ij ij

Yates Continuity Corrected for 2 x 2 Tables

χ (^) c

W f f f f W r r c c

f f f f 2

11 22 12 21

2

1 2 1 2

11 22 12 21

7

8

u u

9

u u

P. U if > 0.5W

otherwise

The degrees of freedom are 1.

Mantel-Haenszel Test of Linear Association

χ (^2) MH^ = (^) IW − (^1) Tr^2

where r is the Pearson correlation coefficient to be defined later. The degrees of freedom are 1.

Other Measures of Association

Phi Coefficient

For a table not 2 × 2

ϕ

χ = p W

2

For a 2 × 2 table only, ϕ is equal to the Pearson correlation coefficient so that the sign of ϕ matches that of the correlation coefficients.

Coefficient of Contingency

CC
W

p p

% '

&&

( 0

))

χ χ

2 2

1 2

Cramér’s V

V

W q

= p −

% '

&&

( 0

))

χ 2

1 2

I 1 T

where q = minY R C, d.

Measures of Proportional Reduction in Predictive Error

Lambda

Let f (^) im and f (^) mj be the largest cell count in row i and column j, respectively. Also, let rm be the largest row subtotal and cm the largest column subtotal. Define λY X as the proportion of relative error in predicting an individual’s Y category that can be eliminated by knowledge of the X category. λY X is computed as

λY X

im m i

R

m

f c = W c

=

∑ 1

The standard errors are

ASE

f f f c r W

W r c

ASE

f W

W r c

ij ijr^ ijc^ ir^ cj j

C im i

R mj m m j

C

i

R

m m

ij ijr^ ijc^ ir^ cj^ ir^ cj j

C

i

R

m m

0

2 1 1 1

2

1

1

2 1

2 1

% '

&&

( 0

))

1

3

2 2 2

4

6

5 5 5 − −

= = = =

= =

δ δ δ δ

δ δ δ δ λ δ δ λ

Q V

Q V

where

δ

δ

ijr^

mj

ir^

m

i f

i r

= (^78) 9

= (^78) 9

if is row index for otherwise

if is index for otherwise

and where

δ

δ

ijc^

im

ic^

m

j f

j c

= (^78) 9

= (^78) 9

if is column index for otherwise

if is index for otherwise

Goodman and Kruskal’s Tau (Goodman & Kruskal, 1954)

Similarly defined is Goodman and Kruskal’s tauI T τ :

τ (^) Y X

ij i i j

j j

C

j j

C

W f r c

W c

∑ ∑

=

=

2 2 1 2 2 1

R W ,

with standard error

ASE f v r

f c c W r

f r ij f i

ij j j j

C

i

ij i

ij j

C

i j

(^1 ) 1

2

2 1

2 = 4 − %^1 −^1 '

& &

(

0

) )

%

'

& &

(

0

) )

7 8 u 9 u

@ Au = = Bu δ ∑ I^ δT ∑^ δ ∑ ,

in which

δ = − = − = =

W (^) ∑ c (^) j v W (^) ∑ f r (^) ∑c j

C ij i i j

j j

C 2 2 1

2 2 1

and ,

τ (^) X Y and its standard error can be obtained by interchanging the roles of X and Y.

The significance level is based on the chi-square distribution, since

W C
W R

Y X R C

X Y R C

− −

− −

(^21 )

1 1

2

H SH S

H SH S

H SH S

H SH S

τ χ

τ χ

where

P f

c r ij Wf

j i i j ij

% '

&

( 0 ∑ ln ) ,

2

The formulas for U (^) X Y can be obtained by interchanging the roles of X and Y. A symmetric version of the two asymmetric uncertainty coefficients is defined as follows:

U U X^ U Y^ U XY
U X U Y
= +^ −

1 3

2 2

4 6

5 5

2 I T^ I T^ I^ T I T I T

with asymptotic standard errors

ASE
W U X U Y

f U XY

r c W

U X U Y

f ij W

i j ij i j

(^1 2 )

2 =^2

% '&^

( 0 )^

% '&^

( 0 )

7 8 u 9 u^

@ Au Bu

∑ I T I T

I T ln I T I Tln ,

or

ASE
W U X U Y
0 =^2 P^ U X^ U Y^ U^ XY^2 W

I T I T

I T I T I T

Cohen’s Kappa

Cohen’s kappaI T κ , defined only for square table (^) IR =CT , is computed as

κ =

= =

=

∑ ∑

W f r c

W r c

ii i

R i i i

R

i i i

R

1 1 2 1

with variance

var

var

,

(^1 2 2 2 )

2 2 2

2 4

2

2

2

7 8

uu

9

uu

1

3

2 2 2

4

6

5 5 5 −

@

A

u u u

B

u u u

%

'

&&

(

0

))

%

'

&&

(

0

)

∑ ∑

∑ ∑ ∑ ∑

∑ ∑ ∑

W

f W f

W r c

W f f r c W f r c

W r c

W f W f r c r c

W r c

W W r c

W r c

ii ii

i i

ii ii i i ii i i

i i

ii ij j i i j

i i

i i

i i i

i i i

R WR W

R W

R W R I^ TW

R W

R W Q V R W

R W

(^0) ) +

%

'

&&

(

0

)) −^ +

%

'

&&

(

0

))

1

3

2 2 2

4

6

5 5 5

∑ r ci^ i^ W^ ∑r c^ r^ c i

i i i i i

2 I T

ASE
D D

f D D C D v W D D r c

ij r c ij ij b ij i j

1 b r c

= (^) ∑ 2 − + − + I T R^ Q^ V W^

τ τ I T ,

where

vij = r Di c +c Dj r

Under the independence assumption, the standard error is

ASE

f C D (^) W P Q

D D

ij ij ij i j r c

0

(^2 )

2

∑ Q^ −^ V^ −^ I^ − T ,

Kendall’s Tau- c

τ (^) c

q P Q W q

I T (^2) I 1 T

with standard error

ASE q q W

f (^) ij Cij D (^) ij W P Q i j

(^1 )

∑ −^ −^ − I T

Q V I^ T ,

or, under the independence assumption,

ASE q q W

f (^) ij C (^) ij D (^) ij W P Q i j

(^0 )

∑ −^ −^ − I T

Q V I^ T ,

where

q = min Y R C, d

Gamma

GammaI T γ is estimated by

γ = −

P Q
P Q

with standard error

ASE
P Q

f (^) ij QC (^) ij PDij i j

(^1 )

=^42

∑ − I T

Q V ,

or, under the hypothesis of independence,

ASE
P Q

f C D W ij ij ij P^ Q i j

0

∑ −^ −^ − I T Q^ V^

I T ,

Somers’ d

Somers’ d with row variable X as the independent variable is calculated as

d P^ Q Y X (^) Dr

with standard error

ASE
D

f D C D P Q W R r

ij r ij ij i i j

(^1 )

= (^) ∑ t Q − (^) V I− − (^) TI − Tv ,

or, under the hypothesis of independence,

where

cov , ,

X Y X Y f X r Y c W

S X X r X r W

i j ij i j

i i i

R j j j

C

i i i

R i i i

R

I T

H S

% '

&

( 0

)

% '

&&

( 0

))

% '

&

( 0

)

= =

= =

1 1

2 1 1

2

and

S Y Y cj j Y c W j

C j j j

C I T =^ −

%

'

& &

(

0

) ∑= ∑= )

2 1 1

2

The variance of r is

var ,

(^1 )

1 32

4 65

7 8 9

@ A T ∑ B

f (^) ij T X (^) i X Y (^) j Y ST X (^) i X S Y Y (^) j Y S X i j

Q VQ V Q V I T Q V I T

If the null hypothesis is true,

var 0 ,^ ,

2 2

2

2 2

%

'

& &

(

0

) ) %

'

&&

(

0

))

%

'

& &

(

0

) )

∑ ∑

∑ ∑

f X Y f X Y W

r X c Y

ij i j i j

ij i j i j

i i i

j j j

where

X X ri i W i

R

=

∑ 1

and

Y Y cj j W j

C

=

∑ 1

Under the hypothesis that ρ = 0 ,

t r^ W r

is distributed as a t with W − 2 degrees of freedom.

Spearman Correlation

The Spearman’s rank correlation coefficient rs is computed by using rank scores Ri for X (^) i and Ci for Y (^) j. These rank scores are defined as follows:

R r r i R

C c c j^ C

i k i k i

j h j h j

<

<

I T

Q V

for

for

K
K

The formulas for rs and its asymptotic variance can be obtained from the Pearson formulas by substituting Ri and C (^) j for X (^) i and Y (^) j , respectively.

Eta

Asymmetric η with the column variable Y as dependent is

ηY SYW S Y

I T

where

v f f f f

f f f f

% '&^

( 0 )

12 11 11 12

22 21 21 22

1 2

I T I T

The relative risk for column 2 and the confidence interval are computed similarly.

McNemar-Bowker’s Test

This statistic is used to test if a square table is symmetric.

Notations

n Dimension of the table (both row and column)

p ij Unknown population cell probability of row i and column j

n ij Observed counts cell count of row i and column j

Algorithm

Given a n × nsquare table, the McNemar-Bowker’s statistic is used to test the

hypothesis H 0 :pij =pjifor all (i

A Special Case: 2x2 Tables

For 2x2 table, the statistic reduces to the classical McNemar (1947) statistic for which exact p-value can be computed. The two-tailed probability level is

2 12 21 ( 1 / 2 )^1221

min( , )

0

n n 12 21 n n

i i

n n +

=

Conditional Independence and Homogeneity

The Cochran’s and Mantel-Haenzel statistics test the independence of two dichotomous variables, controlling for one or more other categorical variables. These “other” categorical variables define a number of strata, across which these statistics are computed. The Breslow-Day statistic is used to test homogeneity of the common odds ratio, which is a weaker condition than the conditional independence (i.e., homogeneity with the common odds ratio of 1) tested by Cochran’s and Mantel-Haenszel statistics. Tarone’s statistic is the Breslow-Day statistic adjusted for the consistent but inefficient estimator such as the Mantel-Haenszel estimator of the common odds ratio.

Notation and Definitions

The addition of strata requires the following modifications to the notation: K (^) The number of strata. f (^) ijk Sum of cell weights for cases in the ith row of the jth column of the kth strata.

c (^) jk f (^) ijk i

R

=

 1

, the jth column of the kth strata subtotal.

rik f (^) ijk j

C

=

 1

, the ith row of the kth strata subtotal.

n (^) k c (^) jk r j

C ik i

R

= =

 = 1 1

, the grand total of the kth strata.