Permutation tests, Exercises of Genetics

A permutation test gives a simple way to compute the sampling distribution for any test statistic, under the strong null hypothesis that a set of genetic ...

Typology: Exercises

2022/2023

Uploaded on 05/11/2023

bridge
bridge 🇺🇸

4.9

(13)

287 documents

1 / 27

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Permutation tests
Ken Rice
Thomas Lumley
UW Biostatistics
Seattle, June 2008
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b

Partial preview of the text

Download Permutation tests and more Exercises Genetics in PDF only on Docsity!

Permutation tests

Ken Rice Thomas Lumley

UW Biostatistics

Seattle, June 2008

Overview

  • Permutation tests
  • A mean
  • Smallest p-value across multiple models
  • Cautionary notes

Permutations

To estimate the sampling distribution of the test statistic we need many samples generated under the strong null hypothesis.

If the null hypothesis is true, changing the exposure would have no effect on the outcome. By randomly shuffling the exposures we can make up as many data sets as we like.

If the null hypothesis is true the shuffled data sets should look like the real data, otherwise they should look different from the real data.

The ranking of the real test statistic among the shuffled test statistics gives a p-value

Example: null is true

gender outcome

Data

gender outcome

Shuffling outcomes

gender outcome

Shuffling outcomes (ordered)

Means

Our first example is a difference in mean outcome in a dominant model for a single SNP

make up some ‘true’ data

carrier<-rep(c(0,1), c(100,200)) null.y<-rnorm(300) alt.y<-rnorm(300, mean=carrier/2)

In this case we know from theory the distribution of a difference in means and we could just do a t-test.

We can compare the t-test results to the results of a permutation test on the mean difference.

Means: t-test

t.test(null.y~carrier, var.equal=TRUE) Two Sample t-test

data: null.y by carrier t = 0.5348, df = 298, p-value = 0. alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.1740603 0. sample estimates: mean in group 0 mean in group 1 0.067211652 0.

t.test(alt.y~carrier, var.equal=TRUE) Two Sample t-test

data: alt.y by carrier t = -5.0415, df = 298, p-value = 8.033e- alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.8381897 -0. sample estimates: mean in group 0 mean in group 1 -0.1405250 0.

Means: permutation test

hist(many.truenull) abline(v=null.diff, lwd=2, col="purple") mean(abs(many.truenull) > abs(null.diff))

hist(many.falsenull) abline(v=alt.diff, lwd=2, col="purple") mean(abs(many.falsenull) > abs(alt.diff))

Example: null is true

616 shuffled differences exceeed true difference: p = 617/ 1001

How many permutations?

With 1000 permutations the smallest possible p-value is 0.001, and the uncertainty near p = 0.05 is about ±1%

If we have multiple testing we may need much more precision.

Using 100,000 permutations reduces the uncertainty near p = 0 .05 to ± 0 .1% and allows p-values as small as 0.00001.

A useful strategy is to start with 1000 permutations and continue to larger numbers only if p is small enough to be interesting, eg p < 0 .1.

Parallel computing of permutations is easy: just run R on multiple computers.

Debugging

R error messages are sometimes opaque, because the error occurs in a low-level function.

traceback() reports the entire call stack, which is useful for seeing where the error really happened.

Suppose our outcome variable were actually a data frame with one column rather than a vector:

one.test(carrier,ywrong) Error in ‘[.data.frame‘(y, xstar == 1) : undefined columns selected

We didn’t know we were calling ‘[.data.frame‘, so we don’t understand the message

Debugging

The post-mortem debugger lets you look inside the code where the error occurred.

options(error=recover) one.test(carrier,ywrong) Error in ‘[.data.frame‘(y, xstar == 1) : undefined columns selected

Enter a frame number, or 0 to exit

1: one.test(carrier, ywrong) 2: mean(y[xstar == 1]) 3: y[xstar == 1] 4: ‘[.data.frame‘(y, xstar == 1)

Selection: 1 Called from: eval(expr, envir, enclos)

Debugging

Browse[1]> str(xstar) num [1:300] 1 1 1 1 1 1 1 0 1 1 ... Browse[1]> str(y) ’data.frame’: 300 obs. of 1 variable: $ null.y: num 1.265 0.590 -0.722 0.676 -0. ...

Turn the post-mortem debugger off with

options(error=NULL)

Minimum p-value

dat <- data.frame(y=rep(0:1,each=100), SNP1=rbinom(200,2,.1), SNP2=rbinom(200,2,.2),SNP3=rbinom(200,2,.2), SNP4=rbinom(200,2,.4),SNP5=rbinom(200,2,.1), SNP6=rbinom(200,2,.2),SNP7=rbinom(200,2,.2), SNP8=rbinom(200,2,.4))

head(dat) y SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP 1 0 0 0 0 0 0 1 0 0 2 0 0 1 0 1 0 1 0 2 3 0 0 1 0 1 1 0 0 0 4 0 0 0 1 1 0 0 0 0 5 0 0 1 0 1 1 0 0 0 6 0 0 0 0 1 0 1 0 1

Minimum p-value

oneZ<-function(outcome, snp){ model <- glm(outcome~snp, family=binomial()) coef(summary(model))["snp","z value"] } maxZ<-function(outcome, snps){ allZs <- sapply(snps, function(snp) oneZ(outcome, snp)) max(abs(allZs)) }

true.maxZ<-maxZ(dat$y, dat[,-1])

manypermZ<-replicate(10000, maxZ(sample(dat$y), dat[,-1]))