Categorical Variables, Observed Values - Study Guide | STAT 541, Exams of Biostatistics

Material Type: Exam; Class: Introduction to Biostatistics; Subject: STATISTICS; University: University of Wisconsin - Madison; Term: Spring 2008;

Typology: Exams

Pre 2010

Uploaded on 09/02/2009

koofers-user-y5t
koofers-user-y5t 🇺🇸

10 documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Ismor Fischer, 8/24/2008 Stat 541 / 6-51
6.3 Several Samples
§ 6.3.1 Proportions
General formulation: Suppose that I and J are two general categorical variables, having
r and c categories, respectively. Then there is a total of r × c possible disjoint events:
I = i and J = j” for i = 1, 2, …, r and j = 1, 2, …, c. Let |ij
π
= the conditional
probability P(I = i | J = j) = P(“I = iJ = j”) / P(J = j), so that 1| 2| | 1
jj rj
π
ππ
+++
=
,
for each j = 1, 2, …, c. We wish to test the null hypothesis that, for each category i of I,
the probabilities |ij
π
are equal, over all the categories j of J. That is,
H0: 1
|
11
|
21
|
31
|
====
c
π
ππ π
and
2
|
12
|
22
|
32
|
====
c
π
ππ π
and
“There is no association between
(categories of) I and (categories of) J.”
and
1
2
3
====
rrr r
c
π
ππ π
versus
HA: At least one of these equalities “There is an association between
is false, i.e.,
||
i
j
ik
π
π
for some i. (categories of) I and (categories of) J.”
Important Note: We can formulate a similar but distinct null hypothesis, by considering
probabilities |ji
π
conditioned on the categories of I, rather than J as above, but with a
basically equivalent interpretation of non-association between them. Examples below…
Much as before, we can construct an r × c contingency table of n observed values,
where r = # rows, and c = # columns.
Categories of J
1 2 3 c
O
O O O
1 11 12 13 1cR1
2 O21 O22 O23 O2c
Categories of I
R2
O O
O O
3 31 32 33 3cR3
. . . . . . .
. . . . . .
. . . . .
.
.
.
OO O
r r1r2r3 Orc Rr
C C C C n
c123
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Categorical Variables, Observed Values - Study Guide | STAT 541 and more Exams Biostatistics in PDF only on Docsity!

6.3 Several Samples

§ 6.3.1 Proportions

General formulation: Suppose that I and J are two general categorical variables, having r and c categories, respectively. Then there is a total of r × c possible disjoint events:

“ I = i and J = j ” for i = 1, 2, …, r and j = 1, 2, …, c. Let π i j | = the conditional

probability P ( I = i | J = j ) = P (“ I = i ” ∩ “ J = j ”) / P ( J = j ), so that π 1| j + π2| j + … + π r j | = 1 ,

for each j = 1, 2, …, c. We wish to test the null hypothesis that, for each category i of I ,

the probabilities π i j | are equal, over all the categories j of J. That is,

H 0 : π 1 | 1 = π 1 | 2 = π 1 | 3 = … = π 1 | c and

π 2 | 1 = π 2 | 2 = π 2 | 3 = … = π 2 | c and

“There is no association between … (categories of) I and (categories of) J .” … and

π r | 1 = π r | 2 = π r | 3 = … = π r | c

versus

HA : At least one of these equalities“There is an association between

is false, i.e., π i | j ≠ π i k | for some i. (categories of) I and (categories of) J .”

Important Note: We can formulate a similar but distinct null hypothesis, by considering

probabilities π j i | conditioned on the categories of I , rather than J as above, but with a

basically equivalent interpretation of non-association between them. Examples below…

Much as before, we can construct an r × c contingency table of n observed values, where r = # rows, and c = # columns.

Categories of J 1 2 3 … c 1 O (^) 11 O (^) 12 O 13 O 1 c R 1

2 O 21 O 22 O 23 O 2 c

Categories of

I R

2

3 O (^) 31 O (^) 32 O 33 O 3 c R 3

....... ...... ......

. .

r O (^) r 1 O (^) r 2 Or 3 Orc Rr

C (^) 1 C (^) 2 C (^) 3 … C (^) c n

For i = 1, 2, …, r and j = 1, 2, …, c , the following are obtained:

ν = 1

ν = 2

ν = 3 ν = 4 ν = 5 ν = 6 ν = 7

χ

ν distribution

Observed Values Oij = #( I = i , J = j ) whole numbers ≥ 0

Expected Values Eij =

Ri Cj n ,^ real^ numbers (i.e.,^ with^ decimals)^ ≥^0

where the row marginals R (^) i = Oi 1 + Oi 2 + O (^) i 3 + … + Oic ,

and the column marginals C (^) j = O (^) 1 j + O (^) 2 j + O (^) 3 j + … + Orj

Test Statistic

2 = (^) Σ

( OijEij ) 2 Eij ~^ χ^

df

where ν = df = ( r − 1)( c − 1)

all i , j

Comments :

¾ Chi-squared Test is valid, provided 80% or more of Eij ≥ 5. For small expected values, lumping categories together increases the numbers in the corresponding cells. Example: The five age categories “18-24,” “25-39,” “40-49,” “50-64,” and “65+” in a might be lumped into three categories “18-39,” “40-64,” and “65+” if appropriate. Caution: Categories should be deemed contextually meaningful before using χ 2.

¾ Remarkably, the same Chi-squared statistic can be applied in different scenarios, including tests of different null hypotheses H 0 on the same contingency table, as shown in the following examples.

¾ If Z 1 , Z 2, …, Zd are N (0, 1) random variables, then Z 12 + Z 22 + … + Zd^2 ~ χ d^2.

because then...

150 =^
200 =^

⎛ ⎠

= pooled proportion πˆ None =

600 ,^ true since all = 0.1^9



150 =^
200 =^

⎛ ⎠

= pooled proportion πˆ Minor =

600 , true since all = 0.2^9



150 =^
200 =^

⎛ ⎠

= pooled proportion πˆ Mod =

600 ,^ true since all = 0.3^9



150 =^
200 =^

⎛ ⎠

= pooled proportion πˆ Major =

600 , true since all = 0.4.^9

If the null hypothesis is rejected based on the data, then the alternative is that at least one of its four statements is false. For that corresponding improvement category, one of the three treatment populations is significantly different from the others. This is referred to as a Chi-squared Test of Homogeneity.

However, we can also consider another null hypothesis: “ Within each treatment group, there is no significant difference in the improvement response rates.” That is, for each treatment category J , the conditional probabilities P ( J | I ) down all improvement categories I are equal, and in particular, equal to the corresponding fixed column marginal probability P ( J ), that is, P ( J | I ) = P ( J ). But recall that, by definition, any two events A and B are statistically independent if P ( A | B ) = P ( A ). Therefore...

H 0 : “Treatment and response are statistically independent.”

  • and + and + π PT + Rx | None π PT + OTC | None π PT only | None = π (^) PT + Rx | Minor = π (^) PT + OTC | Minor = π (^) PT only | Minor = π (^) PT + Rx | Moderate = π (^) PT + OTC | Moderate = π PT only | Moderate = π (^) PT + Rx | Major = π (^) PT + OTC | Major = π (^) PT only | Major

(= π (^) PT + Rx) (= π (^) PT + OTC) (= π (^) PT only)

[Also, recall that an equivalent definition of independence is P ( AB ) = P ( A ) × P ( B ), so that, for example, P (“None” ∩ “PT + Rx”) = P (None) × P (PT + Rx), and likewise for all the other cells: P (“ I = i ” ∩ “ J = j ”) = P ( I = i ) × P ( J = j ). Rewriting this yields “expected value Eij ” / n = (“Row marginal R (^) i ” / n ) × (“Column marginal Cj ” / n ), i.e., the familiar formula Eij = R (^) i C (^) j / n .]

If the null hypothesis is true, then the same expected table would yield the following.

  • and + and + 15/60 20/60 25/ = 30/120 = 40/120 = 50/ = 45/180 = 60/180 = 75/ = 60/240 = 80/240 = 100/

( πˆ PT + Rx = 150/600 ), ( πˆ PT + OTC = 200/600 ), ( πˆ PT only = 250/600 ),

true, since all = 1/4. true, since all = 1/3. true, since all = 5/12.

This is referred to as a Chi-squared Test of Independence.

2

all cells

(Obs Exp) Exp

The same Chi-squared statistic Χ (^) ∑ (^2) = on df = ( r – 1)( c – 1) is used for

both null hypothesis tests! The exact interpretation depends on the design of the experiment, i.e., whether two or more populations are being compared for homogeneity with respect to a set of responses, or whether two categorical variables are independent of one another. MORAL: In general, if the null hypothesis is rejected in either scenario, then there is an association between the two categorical variables I and J.

Exercise: Conduct (both versions of) the Chi-squared Test for this 4 × 3 table.

The Birds and the Bees An Application of the Chi-squared Test to Basic Genetics Inherited biological traits among humans (e.g., right- or left- handedness) and other organisms are transmitted from parents to offspring via “unit factors” called genes , discrete regions of DNA that are located on chromosomes, which are tightly coiled within the nucleus of a cell. Most human cells normally contain 46 chromosomes, arranged in 23 pairs (“diploid”); hence, two copies of each gene. Each copy can be either dominant (say, A = right-handedness) or recessive (a = left-handedness) for a given trait. The trait that is physically expressed in the organism – i.e., its phenotype – is det ermined by which of the three possible combinations of pairs AA, Aa, aa of these two “alleles” A and a occurs in its genes – i.e., its genotype – and its interactions with environmental factors: AA is “homozygous dominant” for right- handedness, Aa is “heterozygous dominant” (or “hybrid”) for right-handedness, and aa is “homozygous recessive” for left-handedness. However, reproductive cells (“gametes”: egg and sperm cells) only have 23 chromosomes, thus a single copy of each gene (“haploid”). When male and female parents reproduce, the “zygote” receives one gene copy – either A or a – from each parental gamete, restoring diploidy in the offspring. With two traits, say handedness and eye color (B = brown, b = blue), there are nine possible genotypes: AABB, AABb, AAbb, AaBB, AaBb, Aabb, aaBB, aaBb, aabb, resulting in four possible phenotypes. (AaBb is known as a “dihybrid.”)

According to Mendel’s Law of Independent Assortment , segregation of the alleles of one allelic pair during gamete formation is independent of the segregation of the alleles of another allelic pair. Therefore, a homozygous dominant parent AABB has gametes AB, and a homozygous recessive parent aabb has gametes ab; crossing them consequently results in all dihybrid AaBb offspring in the so-called F (^) 1 (or “first filial”) generation, having gametes AB, Ab, aB, and ab, as shown below.

Parental Genotypes AABB aabb v v

Parental Gametes

F 1 Genotype AaBb

Gametes

AB ab

F (^1) AB Ab aB ab

It follows that further crossing two such AaBb genotypes results in expected genotype frequencies in the F (^) 2 (“second filial”) generation that follow a 9:3:3:1 ratio, shown in the 4 × 4 Punnet square below.

F 2 Genotypes^ Female Gametes

AB Ab aB ab

Phenotypes Expected Frequencies 1 = Right-handed, Brown-eyed 9/16 = 0. 2 = Right-handed, Blue-eyed 3/16 = 0. 3 = Left-handed, Brown-eyed 3/16 = 0. 4 = Left-handed, Blue-eyed 1/16 = 0.

For example, in a random sample of n = 400 such individuals, the expected phenotypic values under the null hypothesis (^) H 0 : π 1 = 0.5625, π 2 = 0.1875, π 3 = 0.1875, π 4 = 0.0625are as follows.

1 2 3 4 Expected Values (^225 75 75 25) n = 400

These would be compared with the observed values, say

1 2 3 4 Observed Values (^234 67 81 18) n = 400

via the Chi-squared Goodness of Fit Test:

2 2 2 2 (^ 9 )^ (^ 8 )^ (^ 6 )^ (^ 7 ) 225 75 75 25

Χ

  • − + − = + + +

2 = 3.653 on df = 3.

Because this is less than the .05 Chi-squared score of 7.815, the p -value is greater than .05 (its exact value = 0.301), and hence the data provide evidence in support of the 9:3:3:1 ratio in the null hypothesis, at the α = .05significance level. If this model had been rejected however, then this would suggest a possible violation of the original assumption of independent assortment of allelic pairs. This is indeed the case in genetic linkage , where the two genes are located in close proximity to one other on the same chromosome.

If two alleles A and a occur with respective frequencies p and q (= 1 – p ) in a population, then observed genotype frequencies can be compared with those expected from The Hardy-Weinberg Law (namely p^2 for AA, 2 pq for Aa, and q^2 for aa) via a similar Chi-squared Test.

AB AABB^1 AABb^1 AaBB^1 AaBb^1

Male Gametes

Ab AABb^1 AAbb^2 AaBb^1 Aabb^2

aB AaBB^1 AaBb^1 aaBB^3 aaBb^3

ab AaBb^1 Aabb^2 aaBb^3 aabb^4

§ 6.3.3 Means

Assume k

σ (^1) σ 2

σ k

H 0 : μ 1 = μ 2 =.... μ k

X 1 X

2

Xk

The “total variation” in this system can be decomposed into two disjoint sources:

y

y

variation between the groups (via a “treatment” s^2 measure) variation within the groups (as measured by s pooled^2 ).

If the former is significantly larger than the latter (i.e., if the ratio is significantly > 1), then there must be a genuine treatment effect, and the null hypothesis can be rejected.

independent, normally-distributed groups X 1 ~ N ( μ 1 , σ 1 ), X 2 ~ N ( μ 2 , σ 2 ), …, X (^) k ~ N ( μ (^) k , σ k ), e.g., corresponding to different treatments. We wish to compare the treatment means with each other in order to determine if there is a significant difference among any of the groups. Hence…

H 0 : “There is no difference in treatment means, i.e., no treatment effect.” vs. HA : “There is at least one treatment mean μ (^) i that is different from the others.”

Key Strategy

Recall (from the comment at the end of 2.3) that sample variance has the general form

s^2 =

Σ( xi − x ) 2

n − 1

Sum of Squares degrees of freedom =

SS

df.

That is, SS = ( n − 1) s^2. Using this fact, the powerful technique of Analysis of Variance (ANOVA) separates the total variation of the system into its two disjoint sources (known as “partitioning sums of squares” ), so that a formal test statistic can then be formulated, and a decision regarding the null hypothesis ultimately reached. However, in order to apply this, it is necessary to make the additional assumption of equivariance, i.e, σ 12 = σ 22 = σ 32 = … = σ (^) k^2 , testable using the methods of the preceding section.

Figure 1^ t^4

Example: For simplicity, take k = 2 balanced samples, say of size n 1 = 3 and n 2 = 3 , from two independent, normally distributed populations:

x 11 x (^) 12 x (^) 13 x 21 x 22 x 23 X 1 : {50, 53, 71} X 2 : {1, 4, 25}

The null hypothesis H 0 : μ 1 = μ 2 is to be tested against the alternative H (^) A : μ 1 ≠ μ 2 at the

α = .05 level of significance, as usual. In this case, the difference in magnitudes between the two samples appears to be sufficiently substantial, that significance seems evident, despite the small sample sizes.

The following summary statistics are an elementary exercise:

x 1 = 58 x 2 = 10 (^) Also, the grand mean is calculated as:

  • 3 (10)

x =

  • 1 + 4 + 25
  • 3

s 12 = 129 s 22^3 (58) 50 + 53 + 71 3

Therefore,

s pooled^2 =

(3 − 1) (129) + (3 − 1) (171) (3 − 1) + (3 − 1) =^

600 4 ←^

SSError dfError = 150.

We are now in a position to carry out formal testing of the null hypothesis.

Method 1. (Old way: two-sample t -test) In order to use the t -test, we must first verify

equivariance σ 12 = σ 22. The computed sample variances of 129 and 171 are certainly sufficiently close that this condition is reasonably satisfied. (Or, check that the ratio 129/171 is between 0.25 and 4.) Now, recall from the formula for standard error , that:

n s.e. = 150 1/3 + 1/3 = 10.

Hence,

p -value = 2 P ( X (^) 1 – X (^) 2 ≥ 48) = 2 P

⎛ ⎠

T 4 ≥

10 =^2 P ( T^4 ≥^ 4.8 )^ =^ 2(.0043)^ =^ .0086^ <.

so the null hypothesis is (strongly) rejected; a significant difference exists at this level.

ANOVA Table

Test Statistic ( F

F 1, 4

Figure 2

23.

.

“Sum of Squares” “Mean Squares” (^) 1, 4 distribution)

MS =

SS

df F^ =

MS (^) Trt Source df SS MS (^) Err p -value

Treatment 1 3456 3456 ( = s between^2 ) 23.04. Error 4 600 150 ( = s within^2 )

Total (^5 4056) −

The F 1, 4-score of 23.04 is certainly much greater than 1 (the expected value under the null hypothesis of no treatment difference), and is in fact greater than 7.71, the F 1, 4

critical value for α = .05. Hence the small p -value, and significance is established.

In fact, the ratio of

SS (^) Trt SS (^) Total =

4056 = 0.852 indicates that^ 85.2% of the total variation in response is due to the treatment effect!

Comment : Note that 23.04 = ( ± 4.8)^2 , i.e., F 1, 4 = t 42. In general, F (^) 1, df = t (^) df^2 for any df. Hence the two tests are mathematically equivalent to each other. Compare Figs 1 and 2.

General ANOVA formulation Consider now the general case of k independent

F (^) k − 1 , nk

F

p -value

, normally-distributed, equivariant groups.

Treatment Groups X 1 ~ N ( μ 1 , σ 1 ) X 2 ~ N ( μ 2 , σ 2 ) Xk ~ N ( μ k , σ k )

Sample Sizes n 1 + n 2 + … nk = n

Group Means x 1^ x 2^ … xk

Group Variances s 12 s 22 … sk^2

Grand Mean x =

n 1 x 1 + n 2 x 2 + … + nk xk

n

Pooled Variance s within^2 =

( n 1 − 1) s 12 + ( n 2 − 1) s 22 + … + ( n k − 1) sk^2

n − k

Null Hypothesis H

Source df SS MS F -statistic p- value

Treatment k − 1 2 1

k i i i

n x x

∑ − )^ s between^2

Error (^) nk^2 1

k i i

n s

∑ − i s within^2

Fk − 1 , nk 0 ≤ p ≤ 1

Total n − 1

2 all ,

( (^) i j ) i j

∑^ x^ − x

Comments: ¾ This is referred to as the overall F -test of significance. If the null hypothesis is rejected, then (the mean value of at least) one of the treatment groups is different from the others. But which one(s)? ¾ Nonparametric form of ANOVA: Kruskal-Wallis Test ¾ Appendix > Geometric Viewpoint > ANOVA

0 :^ μ^ 1 =^ μ 2 = … =^ μ^ k ⇔^ “No^ treatment difference exists.”

Alternative Hyp. HA : μ (^) i ≠ μ (^) j for some ij“A treatment difference exists.”