Statistics 431: Paired Samples and Two-Sample Inference, Study notes of Statistics

A portion of lecture notes from a statistics 431 course focusing on paired samples and two-sample inference. The basics of paired samples, the difference between means for paired data, and the benefits of using paired data over unpaired data. It also introduces the concept of inference about two population proportions and provides large-sample tests and confidence intervals.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-pq1
koofers-user-pq1 🇺🇸

10 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Statistics 431:
Statistical Inference
Lecture 8: More on two-sample inference
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Statistics 431: Paired Samples and Two-Sample Inference and more Study notes Statistics in PDF only on Docsity!

Statistics 431:

Statistical Inference

Lecture 8: More on two-sample inference

Paired samples: basics

  • (^) Our two samples X 1 ,... , Xm and Y 1 ,... , Yn up to now have been unpaired :

there’s no correspondence between observations in the first sample and observations in the second. (The number of Xi ’s is not even the same as the number of Yi ’s, in general.)

  • (^) In some situations where m = n, there is a natural matching of each Xi with

one Yi , to form the n pairs (Xi , Yi ).

  • (^) Examples:
    • n patients in a clinical trial are given a sedative. The drowsiness of each patient i is measured one hour (Xi ) and two hours (Yi ) after treatment.
    • n houses are selected at random in Philadelphia. At each house i, the radon level is measured in the basement (Xi ) and on the highest floor (Yi ).
    • For each of n sets of identical twins, we measure the body mass index of the older (Xi ) and younger (Yi ) twin.
  • (^) Instead of comparing the sampled Xi ’s to the sampled Yi ’s, as for unpaired

data, we should focus on comparing the pairs (Xi , Yi ) to each other.

Why look at pairs?

  • (^) Include pairing

Time by Patient

Drowsiness

+1h +2h

89

1011

12 l l

P

+1h +2h

l l

P

+1h +2h

l l

P

+1h +2h

l l

P

+1h +2h

l l

P

+1h +2h

l l

P

+1h +2h

l l

P

+1h +2h

l l

P

+1h +2h

l l

P

l l

P

l l

P

l l

P

l l

P l l

P l l

P l l

P

l l

P

89

1011

12 l l

P

89

1011

12 l l

P19 (^) l P20l l l

P

l l

P

l l

P l l

P

l l

P

l l

P l l

P

l l

P l l

P

l l

P l l

P

l l

P

l l

P

l l

P

l l

P

89

1011

12 l l

P

89

1011

12

l l

P

l l

P

l l

P

l l

P l l

P

l l

P

l l

P

l l

P

l l

P

Difference between means: paired test

  • (^) The data: (X 1 , Y 1 ),... , (Xn, Yn) ∼ p(x, y), IID. We are not assuming the

Xi ’s are independent of the Yi ’s as we did with unpaired data.

  • (^) E Xi = μ 1 , EYi = μ 2.
  • (^) Define the within-pair differences D 1 = X 1 − Y 1 ,... , Dn = Xn − Yn.
  • (^) Let μD = E Di (we know μD = μ 1 − μ 2 ), and let σ (^) D^2 = Var(Di ).
  • (^) Under H 0 : μD = 10 (the same null as with unpaired data), and substituting

in the sample variance, we get

T =

D¯ − 10

SD/

n

≈ N ( 0 , 1 )

for large samples. For small samples, T ∼ tn− 1 , once we add the assumption that each Di is normal.

  • (^) As before: HA : μD 6 = 10 ⇒ reject when |T | > cα

HA : μD > 1 0 ⇒ reject when T > cα HA : μD < 1 0 ⇒ reject when T < cα.

  • (^) As before, set cα using N ( 0 , 1 ) in large samples and tn− 1 in small samples.

Collect paired or unpaired data?

  • (^) We can be more precise about the benefits of paired data.
  • (^) Just like Var( X¯) = σ (^) X^2 /m and Var( Y¯ ) = σ (^) Y^2 /n, we know that Var( D¯) = σ (^) D^2 /n.
  • (^) But σ (^) D^2 = Var(Di ) = Var(Xi − Yi ) = σ (^) X^2 + σ (^) Y^2 − 2 · Cov(Xi , Yi ).
  • (^) So: when Xi and Yi are positively correlated, Var( D¯) is smaller than when

they are uncorrelated.

  • (^) Very often, the within-pair correlation is positive, rather than negative. In a

paired analysis, the positive correlation is reflected in S^2 D, which usually turns out smaller than S^2 X + S Y^2.

  • (^) However: with n Xi ’s and n Yi ’s, we have 2 n unpaired obsvns, but only n pairs. So the reference t distrn in a small sample will have 2 n − 1 obsvns for unpaired data, which is better than n − 1 for paired data.
  • (^) Despite this, for moderate n, the (usually) smaller SE of D¯ leads to

increased testing power via a larger T statistic, and narrower CIs.

  • (^) If Xi and Yi are negatively correlated in a paired dataset, you still need to

do a paired analysis, but you would have been better off without pairing!

Inference about two population proportions

  • (^) Recall the one-sample setup: p is the proportion of “successes” in a

population, i.e. the fraction of the population possessing some characteristic.

  • (^) Now we have two populations: the success fraction in population “A” is p 1 ,

and in “B” it is p 2.

  • (^) We draw m times independently from population “A”, and call X the number

of successes in the sample. Similarly, we make n independent draws from population “B”, and call Y the number of successes.

  • (^) Then X ∼ Bin(m, p 1 ) and Y ∼ Bin(n, p 2 ). We assume the two samples are

drawn independently, so X is independent of Y.

  • (^) Example: in 1954, a large-scale randomized controlled experiment was

conducted to study Salk’s polio vaccine. Randomizing a child to the control (placebo) group is like drawing from a population having probability p 1 of “success” (contracting polio). Randomizing a child to the treatment (vaccination) group is like drawing from a population with probability p 2 of contracting polio.

  • (^) In 1954, people were very interested in p 1 − p 2.
  • (^) If p 1 = p 2 = p, then X ∼ Bin(m, p), and, independently, Y ∼ Bin(n, p). But

then X + Y ∼ Bin(m + n, p).

  • (^) So a natural estimator of p under H 0 : p 1 = p 2 = p is just

p ˆ = (X + Y )/(m + n).

  • (^) This results in the large-sample test statistic

Z =

pˆ 1 − ˆp 2 √ pˆ( 1 − ˆp)( 1 /m + 1 /n)

and the procedures HA : p 1 − p 2 6 = 0 ⇒ reject when |Z | > zα/ 2 HA : p 1 − p 2 > 0 ⇒ reject when Z > zα HA : p 1 − p 2 < 0 ⇒ reject when Z < −zα.

Two population proportions: large-sample CIs

  • (^) With both m and n large, and substituting sample proportions in the

denominator,

Z =

pˆ 1 − ˆp 2 − ( p 1 − p 2 ) √ pˆ 1 ( 1 − ˆp 1 )/m + ˆp 2 ( 1 − ˆp 2 )/n

≈ N ( 0 , 1 ).

  • (^) Try this at home: write down the 100 ( 1 − α)% pivoting confidence statement

for p 1 − p 2 , using Z. Rearrange it to obtain the confidence interval

p ˆ 1 − ˆp 2 ± zα/ 2

pˆ 1 ( 1 − ˆp 1 ) m

pˆ 2 ( 1 − ˆp 2 ) n

  • (^) A suggested correction when m and n are not huge: replace pˆ 1 = X/m with

p ˜ 1 = (X + 1 )/(m + 2 ) and pˆ 2 with the analogous p˜ 2. This can improve the quality of the normal approximation, hence the correctness of the CI.

  • (^) When m or n is quite small? A different approach is needed, which we won’t cover. (Take more statistics courses.)