


Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Instructions for homework 5 in stat 421, a fall 2008 course. Students are required to analyze a flux dataset, understand a function named pval.delta, and test a hypothesis about randomly generated data using the central limit theorem (clt) or simulation. The goal is to determine the probability that the mean of a dataset follows a normal distribution and to find the confidence interval for a parameter expressing the shift in means due to a difference in fluxes.
Typology: Assignments
1 / 4
This page cannot be seen from the preview
Don't miss anything!



Stat 421, Fall 2008 Fritz Scholz Homework 5 Due: October 31, 2008 Problem 1: Using the Flux data set, assume that for some reason we received less data, say 5 for flux X and 5 for flux Y. In particular, work with the flux data set, where the rows/boards 1,5,6,9,10,13,14, have been removed. Explain line by line what the following function does, by embedding comment lines # ...... Note that there is a function fun0 defined within pval.delta. That works as long as you define the function before you call it. See what happens if you define the function after you first call it. pval.delta = function (delta=0) { fun0=function(ind,y){mean(y[ind])-mean(y[-ind])}
flux0=flux[-c(1,5,6,9,10,13,14,15),] SIR=flux0$SIR FLUX=flux0$FLUX sir.delta=SIR sir.delta[FLUX=="Y"]=sir.delta[FLUX=="Y"]-delta ybar.xbar.observed=mean(SIR[FLUX=="Y"])-delta- mean(SIR[FLUX=="X"]) ybar.xbar.ref.dist=combn(1:10,5,FUN=fun0,y=sir.delta) pval=mean(abs(ybar.xbar.ref.dist)>=.999999*abs(ybar.xbar.obs erved)) pval } Think of delta as a possible effect due to flux difference. By subtracting out that delta from all SIR values that were treated with Y we remove such a difference effect. The test of the null hypothesis (no difference in flux) should then lead to acceptance (high p-values) as long as we remove the right amount delta from the SIR values obtained under FLUX Y. In effect we test the hypothesis that the effect of FLUX=Y is a delta shift in the SIR values relative to the SIR values under FLUX=X. After removing such a delta shift from SIR under FLUX Y we treat
the situation as though there is no FLUX effect, which represented our null hypothesis when we first introduced this example. Exercise this function for various values of delta and record the resulting output pval. After getting a feel for this function try to determine for which set of delta values your p-values stay at or above .05. Write a loop function , say delta.plot, that exercises the above function for an appropriate grid-vector of delta values, gets the resulting vector of p-values, and plots the latter against delta, drawing a horizontal line at the level .05 (using abline(h=.05)). Give the code and the plot. The set of delta values with p-value >= .05 is an interval. Try to determine the interval end points to within one decimal accuracy, e.g, [-2.1,-.5]. This may take some chiseling in the choices of the delta vector in your loop function. What you will have determined is a 95% confidence interval for delta, a parameter that expresses the shift in means due to a difference in fluxes. This is a confidence interval for delta based on the randomization test. Does this interval contain zero? Would you have accepted the hypothesis of no flux difference without making any delta adjustment? [ Further commentary: Such confidence intervals (sets) exploit the basic duality between confidence sets and tests of hypotheses. Our hypothesis H(delta.0): delta=delta.0 expresses the fact that after subtracting delta.0 from all SIR.Y values there is no more difference between FLUX X and FLUX Y. This can be examined by testing H.0: no difference in Fluxes in view of our modified SIR values, i.e., SIR.X and SIR.Y-delta.0. The fact that such delta shift in the SIR responses represent the only possible way in which FLUX X and FLUX Y may differ is a strong assumption. The basic duality between confidence sets and tests of hypotheses consists in defining as 1-alpha level confidence set the set C(SIR.X,SIR.Y) of all delta.0 for which we cannot reject the hypothesis H(delta.0) at level alpha based on SIR.X and SIR.Y, or for which we cannot reject the hypothesis H.0 based on SIR.X and SIR.Y-delta.0. Then delta.0 is in C(SIR.X,SIR.Y) if and only if H(delta.0): delta=delta.0 is accepted by the corresponding level alpha test. We then have P(delta.0 in C(SIR.X,SIR.Y) |H(delta.0)) = P(H(delta.0) is accepted|H(delta.0)) =1-alpha.
To apply the CLT, find the mean and variance of each Bi and from that the mean and variance of Bidi (treating the di as given constants) and from that the mean and variance of Dbar, using the rules of expectation and variance. Based on this find the normal approximation to the full reference distribution of the Dbar values and give the approximate two-sided p-value of Dbar.observed. Examine the variance ratio criterion that plays an important role in the CLT. What is its numerical size here, expressed in terms of the variances of Bidi/n? For the simulation approach generate random vectors (B 1 ,…, Bn) and evaluate Dbar each time, using the given vector (d 1 , …, dn ). Hint: to generate such a vector (B 1 ,…, Bn), note that rbinom(n,1,.5) generates a vector of length n filled with 0’s and 1’s, independently chosen with probability ½ each. You can transform 0’s and 1’s into -1 and +1, respectively, using rbinom(n,1,.5)*a-b for appropriate a and b, try various choices of a and b to get the hang of it. Do this generation of Dbar Nsim=10000 times in a loop as part of a function. This function should also plot a histogram of the Nsim generated Dbar values and show superimposed the normal density determined in the previous approach. This simulated histogram is also a good approximation to the full reference distribution. Use this simulated set of Nsim Dbar values to estimate the two-sided p-value for Dbar.observed. Give the details of your normal approximation, the value of the variance ratio criterion, your simulation code, the histogram plot with normal density, and the two p-value assessments. Try to make the code general enough so that it can be applied to a data set D.vec of any length n>1.