



















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A permutation test gives a simple way to compute the sampling distribution for any test statistic, under the strong null hypothesis that a set of genetic ...
Typology: Exercises
1 / 27
This page cannot be seen from the preview
Don't miss anything!




















Ken Rice Thomas Lumley
UW Biostatistics
Seattle, June 2008
To estimate the sampling distribution of the test statistic we need many samples generated under the strong null hypothesis.
If the null hypothesis is true, changing the exposure would have no effect on the outcome. By randomly shuffling the exposures we can make up as many data sets as we like.
If the null hypothesis is true the shuffled data sets should look like the real data, otherwise they should look different from the real data.
The ranking of the real test statistic among the shuffled test statistics gives a p-value
gender outcome
Data
gender outcome
Shuffling outcomes
gender outcome
Shuffling outcomes (ordered)
Our first example is a difference in mean outcome in a dominant model for a single SNP
carrier<-rep(c(0,1), c(100,200)) null.y<-rnorm(300) alt.y<-rnorm(300, mean=carrier/2)
In this case we know from theory the distribution of a difference in means and we could just do a t-test.
We can compare the t-test results to the results of a permutation test on the mean difference.
t.test(null.y~carrier, var.equal=TRUE) Two Sample t-test
data: null.y by carrier t = 0.5348, df = 298, p-value = 0. alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.1740603 0. sample estimates: mean in group 0 mean in group 1 0.067211652 0.
t.test(alt.y~carrier, var.equal=TRUE) Two Sample t-test
data: alt.y by carrier t = -5.0415, df = 298, p-value = 8.033e- alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.8381897 -0. sample estimates: mean in group 0 mean in group 1 -0.1405250 0.
hist(many.truenull) abline(v=null.diff, lwd=2, col="purple") mean(abs(many.truenull) > abs(null.diff))
hist(many.falsenull) abline(v=alt.diff, lwd=2, col="purple") mean(abs(many.falsenull) > abs(alt.diff))
616 shuffled differences exceeed true difference: p = 617/ 1001
With 1000 permutations the smallest possible p-value is 0.001, and the uncertainty near p = 0.05 is about ±1%
If we have multiple testing we may need much more precision.
Using 100,000 permutations reduces the uncertainty near p = 0 .05 to ± 0 .1% and allows p-values as small as 0.00001.
A useful strategy is to start with 1000 permutations and continue to larger numbers only if p is small enough to be interesting, eg p < 0 .1.
Parallel computing of permutations is easy: just run R on multiple computers.
R error messages are sometimes opaque, because the error occurs in a low-level function.
traceback() reports the entire call stack, which is useful for seeing where the error really happened.
Suppose our outcome variable were actually a data frame with one column rather than a vector:
one.test(carrier,ywrong) Error in ‘[.data.frame‘(y, xstar == 1) : undefined columns selected
We didn’t know we were calling ‘[.data.frame‘, so we don’t understand the message
The post-mortem debugger lets you look inside the code where the error occurred.
options(error=recover) one.test(carrier,ywrong) Error in ‘[.data.frame‘(y, xstar == 1) : undefined columns selected
Enter a frame number, or 0 to exit
1: one.test(carrier, ywrong) 2: mean(y[xstar == 1]) 3: y[xstar == 1] 4: ‘[.data.frame‘(y, xstar == 1)
Selection: 1 Called from: eval(expr, envir, enclos)
Browse[1]> str(xstar) num [1:300] 1 1 1 1 1 1 1 0 1 1 ... Browse[1]> str(y) ’data.frame’: 300 obs. of 1 variable: $ null.y: num 1.265 0.590 -0.722 0.676 -0. ...
Turn the post-mortem debugger off with
options(error=NULL)
dat <- data.frame(y=rep(0:1,each=100), SNP1=rbinom(200,2,.1), SNP2=rbinom(200,2,.2),SNP3=rbinom(200,2,.2), SNP4=rbinom(200,2,.4),SNP5=rbinom(200,2,.1), SNP6=rbinom(200,2,.2),SNP7=rbinom(200,2,.2), SNP8=rbinom(200,2,.4))
head(dat) y SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP 1 0 0 0 0 0 0 1 0 0 2 0 0 1 0 1 0 1 0 2 3 0 0 1 0 1 1 0 0 0 4 0 0 0 1 1 0 0 0 0 5 0 0 1 0 1 1 0 0 0 6 0 0 0 0 1 0 1 0 1
oneZ<-function(outcome, snp){ model <- glm(outcome~snp, family=binomial()) coef(summary(model))["snp","z value"] } maxZ<-function(outcome, snps){ allZs <- sapply(snps, function(snp) oneZ(outcome, snp)) max(abs(allZs)) }
true.maxZ<-maxZ(dat$y, dat[,-1])
manypermZ<-replicate(10000, maxZ(sample(dat$y), dat[,-1]))