Categorical Data Analysis: Contingency Tables and Test of Independence - Prof. Michae Hudg, Study notes of Data Analysis & Statistical Methods

An introduction to categorical data analysis, focusing on contingency tables and the test of independence. It covers the concept of a two-way contingency table, notation, scenarios where it arises, and the test of independence. The document also discusses the pearson chi-square statistic and the test for trend.

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-hwu
koofers-user-hwu 🇺🇸

10 documents

1 / 51

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Categorical Data: Contingency Tables
Bios 662
Michael G. Hudgens, Ph.D.
http://www.bios.unc.edu/mhudgens
2006-10-17 17:11
BIOS 662 1 Categorical Data
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33

Partial preview of the text

Download Categorical Data Analysis: Contingency Tables and Test of Independence - Prof. Michae Hudg and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Categorical Data: Contingency Tables Bios 662

Michael G. Hudgens, Ph.D. [email protected] http://www.bios.unc.edu/∼mhudgens 2006-10-17 17:

Contingency Tables

  • Two-way (r × c) contingency table:

j i 1 2 · · · c 1 n 11 n 12 · · · n 1 c 2 n 21 n 22 · · · n 2 c ... ... ... ... ... r nr 1 nr 2 · · · nrc

  • Notation:

ni· =

∑^ c

j=

nij n·j =

∑^ r

i=

nij

Contingency Table: Example

  • A survey of physicians asked about the size of com- munity in which they were reared and the size of the community in which they practice Practice Reared <5k 5-49k 50-99k 100k+ Total <5k 40 38 32 37 147 5-49k 26 42 35 33 136 50-99k 24 26 34 31 115 100k+ 30 39 53 60 182 120 145 154 161 580

Contingency Table: Example

  • A case-control study was conducted to investigate the relationship between age at first birth and breast cancer Age at 1st birth <20 20-24 25-29 30-34 ≥ 35 Total Case 320 1206 1011 463 220 3220 Control 1422 4432 2893 1092 406 10245 1742 5638 3904 1555 626 13465

Contingency Tables

  • Breast cancer example H 0 : distribution across ages is the same for cases and controls H 0 : πij = πi′j; j = 1, 2 ,... , c
  • Test of homogeneity/association

Test of Independence or Association

  • Under H 0 , the expected frequency in the (i, j) cell is

Eij =

ni·n·j N

  • Consider breast cancer example
    • If H 0 is true, would expect the proportion of women < 20 to be πˆ· 1 = n^11 N+^ n^21 = n N·^1
    • There are n 1 · cases, so we would expect E 11 = n 1 ·^ n N·^1 = n^1 N·n·^1 cases to be < 20 years old

Test of Independence

  • Under H 0 , X^2 ∼ χ^2 (r−1)(c−1)
  • Physician’s Example:

(r − 1)(c − 1) = 3 × 3 = 9

C. 05 = {X^2 : X^2 > χ^2. 95 , 9 = 16. 92 }

Physician’s Example

  • Expected values

Practice Reared <5k 5-49k 50-99k 100k+ Total <5k 30.4 36.8 39.0 40.8 147 5-49k 28.1 34.0 36.1 37.8 136 50-99k 23.8 28.8 30.5 31.9 115 100k+ 37.7 45.5 48.3 50.5 182 120 145 154 161 580

Breast Cancer Example

  • Underlying probabilities

Age at 1st birth <20 20-24 25-29 30-34 ≥ 35 Total Case π 11 π 12 π 13 π 14 π 15 1 Control π 21 π 22 π 23 π 24 π 25 1

Breast Cancer Example

  • Null hypothesis

H 0 : π 1 j = π 2 j for j = 1, 2 , 3 , 4 , 5

  • Can use same statistic

X^2 =

∑^2

i=

∑^ c

j=

(Oij − Eij)^2 Eij^ ∼^ χ

2 (c−1)

Breast Cancer Example

  • Test statistic

X^2 = (320^ −^416 .6)

2

  1. 6 +^ · · ·^ +

(406 − 476 .3)^2

476. 3 = 130.^3

  • Rejection region

C. 05 = {X^2 : X^2 > χ^2. 95 , 4 = 9. 49 }

  • Reject H 0
  • The age distributions are not the same

Asymptotic Approximation

  • Note the χ^2 distribution for X^2 is an approximation
  • The approximation works well for if Eij ≥ 5 for all i, j
  • If Eij < 5, a generalization of Fisher’s exact test can be employed or categories combined

Test for Trend

  • Consider a 2 × c
  • The χ^2 test for homogeneity does not tell us how the probabilities differ
  • Rather, just if they differ
  • If the categories of the column variable are ordered, a more powerful test is possible

Test for Trend

  • Suppose columns = exposure are ordered
  • Rows = disease (yes/no)
  • Interested in detecting alternatives where the probabil- ity of disease proportional to exposure
  • I.e., looking for a monotonic dose-response type rela- tionship