Statistical Analysis: Model Interpretation & Comparison of Survival Models, Exams of Mathematics

Statistical analysis of weight data using linear regression models lm1, lm2, and lm3. It also includes the rewrite of anova table for lmsf.y model and analysis of variance for eggs data using glm models. Furthermore, it discusses the concept of right censoring in survival data analysis and derives an expression for the survivor function in terms of discrete hazard function. Lastly, it presents a case study on estimating survivor functions for time to death from lung cancer and other causes using proportional hazards models.

Typology: Exams

2012/2013

Uploaded on 02/27/2013

lakshanya
lakshanya 🇮🇳

4.4

(25)

98 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
MATHEMATICAL TRIPOS Part III
Thursday, 31 May, 2012 9:00 am to 12:00 pm
PAPER 37
APPLIED STATISTICS
Attempt no more than FOUR questions, with at most THREE from Section A.
There are SIX questions in total.
The questions carry equal weight.
STATIONERY REQUIREMENTS SPECIAL REQUIREMENTS
Cover sheet None
Treasury Tag
Script paper
You may not start to read the questions
printed on the subsequent pages until
instructed to do so by the Invigilator.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Statistical Analysis: Model Interpretation & Comparison of Survival Models and more Exams Mathematics in PDF only on Docsity!

MATHEMATICAL TRIPOS Part III

Thursday, 31 May, 2012 9:00 am to 12:00 pm

PAPER 37

APPLIED STATISTICS

Attempt no more than FOUR questions, with at most THREE from Section A.

There are SIX questions in total. The questions carry equal weight.

STATIONERY REQUIREMENTS SPECIAL REQUIREMENTS

Cover sheet None Treasury Tag Script paper

You may not start to read the questions printed on the subsequent pages until instructed to do so by the Invigilator.

SECTION A

(a) Write down the form of an ordinary linear model for a n × 1 response vector Y , n × p covariate matrix X of rank p, and p × 1 parameter vector β. Derive the form of the maximum likelihood estimator for β, and any other parameters.

(b) Define the residuals, and derive their joint distribution under the model. [Hint: You may use the fact that if Z ∼ N ( 0 , Σ) for some vector Z, then AZ ∼ N ( 0 , AΣAT^ ) for any suitable matrix A.]

A nutritionist measures the weights and heights of n = 100 students, n 1 = 45 female and n 2 = 55 male. Let the weight of student j = 1,... , ni of sex i = 1, 2 be Yij , and their height be xij. The nutritionist would like to be able to predict the weight of other students, as closely as possible, from their sex and height.

(c) Explain the meaning and purpose of using the command I(height-165) in the R output below. Write out the model lm2 algebraically, and give the estimates and standard errors for each regression parameter; make sure you interpret the results qualitatively in terms of the original data.

(d) Considering all the R output, write a concise summary for the nutritionist about your findings. For any hypothesis tests you use, state the hypotheses and null distribution. Comment in particular on the difference between the slope parameters for height in models lm1 and lm2.

(e) The summaries give the maximum of the residuals for each model; by comparing this to the residual standard errors, suggest any comments you might make to the nutritionist about the data and the models.

Part III, Paper 37

Min 1Q Median 3Q Max -11.9844 -3.5162 -0.0463 3.1702 24.

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 58.4506 0.8593 68.020 < 2e- sexM 4.6710 1.7504 2.668 0. I(height - 165) 0.6696 0.2000 3.349 0. sexM:I(height - 165) -0.0977 0.2838 -0.344 0.

Residual standard error: 5.737 on 96 degrees of freedom Multiple R-squared: 0.4178,Adjusted R-squared: 0. F-statistic: 22.96 on 3 and 96 DF, p-value: 2.737e-

Part III, Paper 37

A farmer is measuring the yields of two varieties of wheat under various conditions. He has 4 fields, each of equal size. In each of the three years 2004, 2005, 2006, he planted each of the two varieties of wheat in two fields, and gave each either a high or low dose of fertiliser. The data are given in the table below.

Variety Fertiliser Year 2004 2005 2006 1 Low 5.26 6.93 5. High 8.08 9.15 8. 2 Low^ 9.91^ 11.03^ 8. High 6.39 8.84 7.

Explain what is meant by a factor, in the context of linear models. Why might we want to treat the year as a factor when analysing the wheat yields? Look at the edited R output below. Write down the model being fitted in lmSF.Y algebraically, remembering to define each symbol you use, and any constraints. In the output from the anova() command below, some of the values have been removed and replaced with xx. Rewrite the ANOVA table, filling in the missing entries from the columns Df and Mean Sq; explain how to calculate the missing entry in the column labelled F value, and give a rough approximation to its value. What is the hypothesis test being tested in the row labelled seed:fert? State the hypotheses, null distribution, and your conclusion clearly. Based on this ANOVA table only, would you suggest a simpler model to try to fit? Explain your answer. Now look at the output from the summary() command. How would you interpret the results of this model fit, and what would you tell the farmer about the yields from the two varieties of wheat under the different conditions?

seed [1] 1 1 1 1 1 1 2 2 2 2 2 2 Levels: 1 2 fert [1] Low Low Low High High High Low Low Low High High High Levels: Low High year [1] 2004 2005 2006 2004 2005 2006 2004 2005 2006 2004 2005 2006 Levels: 2004 2005 2006 yield [1] 5.26 6.93 5.99 8.08 9.15 8.36 9.91 11.03 8.19 6. [11] 8.84 7.

lmSF.Y = lm(yield ~ seed*fert + year) anova(lmSF.Y) Analysis of Variance Table

Part III, Paper 37 [TURN OVER

Adult beetles of the genus Tribolium eat the eggs of their own species as well as those of closely related species. An experiment was conducted on adult beetles of three species, denoted by A, B, and C, to determine if any of these species has evolved a preference for eggs of the other species, that is, if any of the three species can recognise and avoid eggs of its own species - clearly an evolutionary advantage! The experiment was conducted on three different days with a number of repetitions. A number of adult beetles of each species was isolated, and presented with a vial of 150 eggs, 50 of each type. Two days later, the number of eggs remaining of each type was recorded. The data are presented in the table below.

Adult species A B C Eggs Total Eggs Total Eggs Total

Day 1

Day 2

Day 3

The column Eggs gives the number of eggs of the other species that were eaten over the 2 day period, and the column Total gives the total number of eggs eaten. So for species A, Eggs gives the number of eggs of species B and C that were eaten. The data show that only 2 experiments were conducted for species B in Day 2, and none for species C in day

  1. The R output below contains an analysis of these data.

(a) Write down the algebraic form of the model fitted in eggs.glm1, defining your notation carefully and giving any constraints explicitly. (b) In the analysis of deviance table, six entries have been replaced by xx. Find these values. (c) Using deviances, test whether the probability of eating eggs of the other species depends on the day in which the experiment started, controlling for Species. Give the null hypothesis, the test statistic, its null distribution, the result, and your conclusion in words. (d) Using your preferred model, obtain an estimate of the log-odds of eating eggs of the other species for an adult beetle of Species A. Explain how to obtain an approximate 95% confidence interval for this log-odds, stating any asymptotic distribution results used. (e) The confidence interval in (d) equals (0. 7586 , 1 .065). What do you conclude about the eating preferences of adult beetles of Species A?

Part III, Paper 37 [TURN OVER

eggs <- c(61,60,44,54,68,65,77,78,63,53,36,46,73,62,73,63,80,44,45,51,

  • 52,38,30)

total <- c(86,84,68,75,90,88,101,108,99,88,64,84,115,100,

  • 117,102,125,66,61,70,79,61,53)

prop <- eggs / total Species <- as.factor(c(rep("A",times=9),rep("B",times=8),rep("C",times=6))) Day <- as.factor(c(rep(1,3), rep(2,3), rep(3,3), rep(1,3), rep(2,2),

  • rep(3,3), rep(1,3), rep(2,3)))

eggs.glm1 <- glm(prop ~ Day + Species, family=binomial, weights=total) summary(eggs.glm1) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.87083 0.10612 8.206 2.28e- Day2 0.04578 0.11641 0.393 0. Day3 0.06941 0.12270 0.566 0. SpeciesB -0.46029 0.10701 -4.301 1.70e- SpeciesC -0.20026 0.13919 -1.439 0. (Dispersion parameter for binomial family taken to be 1) Null deviance: 33.631 on 22 degrees of freedom Residual deviance: 14.646 on 18 degrees of freedom anova(eggs.glm1, test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: prop Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL xx xx Day xx xx 20 33.358 0. Species 2 xx 18 xx 8.647e- eggs.glm2 <- glm(prop ~ Species, family=binomial, weights=total) summary(eggs.glm2) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.91191 0.07824 11.656 < 2e- SpeciesB -0.45905 0.10684 -4.297 1.73e- SpeciesC -0.21877 0.13289 -1.646 0. (Dispersion parameter for binomial family taken to be 1)

Null deviance: 33.631 on 22 degrees of freedom Residual deviance: 14.987 on 20 degrees of freedom

Part III, Paper 37

parameter by the following formula.

φˆ =

∑^ n

i=

(Yi − μˆi)^2 μ ˆi

(n − p),

where Yi = counti is the response.

phi <- sum((count-acf.glm2$fitted.values)^2/acf.glm2$fitted.values)/ phi [1] 1.

How would you modify the test of hypothesis above to take account of over- dispersion? Give the test statistic and its null distribution.

Part III, Paper 37

SECTION B

(a) In survival data analysis, what is

(i) right censoring?

(ii) the relationship between the censoring distribution and the survival time distribu- tion if right censoring is assumed to be non-informative?

(iii) implied, at each time point, for those subjects who are censored and those who are still under observation when right censoring is assumed non-informative in a study?

(b) Let T be a positive discrete survival time random variable which takes unique ordered failure time points t 1 , t 2 ,... , tn (i.e. 0 = t 0 < t 1 <... < tn). In a study of time to failure of m subjects, interest lies in estimating the survivor function, S(t) = P(T > t), of T. Denote by dj the number of subjects that are observed to fail at tj and nj the number of subjects who were still at risk immediately prior to tj. Define f (tj ) and h(tj ) to be the probability mass function and discrete hazard function of failing at tj respectively. The latter corresponds to the conditional probability of failing at tj , given still alive beyond the previous failure time point tj− 1. Assume that censoring in this study is non-informative. Derive an expression for the survivor function at time t in terms of only the discrete hazard function. From this, write down the “Kaplan-Meier” estimator of the survivor function.

(c) In a recently completed randomised controlled clinical trial investigating the effect of chemotherapy versus radiotherapy on time to death after surgical removal of a localised cancerous lung tumour in elderly patients, both the time to death from lung cancer and the time to death from other causes since surgery are of interest. In particular, the estimated probabilities of dying from lung cancer and dying from other causes within 5 years of surgery, and the estimated effect of treatment on survival are required. The data collected are in the form (Ti, Di), where Ti represents either the time to death, if observed, or the time until the end of study if death has been unobserved for the ith patient, and is recorded in the variable time within the data-set, cancer.dat. Di takes the value 0, 1 or 2 depending on whether death is unobserved (status= 0), death is observed to be due to lung cancer (status= 1) or death is observed to be from other causes (status= 2) for the ith patient. Dr PH Cox, the study statistician, is responsible for analysing the data. Initially, Dr Cox decides to describe the data, ignoring treatment group status, by producing two overall “Kaplan-Meier”-type survival curves. The first curve is for time to death from lung cancer, where deaths from other causes are treated as censored data. The second is for the time to death from other causes where now deaths from lung cancer are treated as censored data. The data from subjects who have not died by the end of the study are treated as censored in both situations. Afterwards Dr Cox analyses time to death from lung cancer and

Part III, Paper 37 [TURN OVER

0 2 4 6 8

Time from Surgery (in years)

Probability

Lung Cancer Deaths

Other Deaths

Part III, Paper 37 [TURN OVER

> lungcancerdeath.km <- survfit(Surv(time,(status==1))~1,data=cancer.dat)

summary(lungcancerdeath.km) Call: survfit(formula = Surv(time, (status == 1)) ~ 1, data = cancer.dat)

lungcancerdth.cox <- coxph(Surv(time,(status==1))~trt,data=cancer.dat) summary(lungcancerdth.cox) Call: coxph(formula = Surv(time, (status == 1)) ~ trt, data = cancer.dat)

n= 80, number of events= 31

coef exp(coef) se(coef) z Pr(>|z|) trt 1.3113 3.7111 0.4346 3.018 0.00255 **


Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1

exp(coef) exp(-coef) lower .95 upper. trt 3.711 0.2695 1.583 8.

otherdeath.cox <- coxph(Surv(time,(status==2))~trt,data=cancer.dat) summary(otherdeath.cox) Call: coxph(formula = Surv(time, (status == 2)) ~ trt, data = cancer.dat)

n= 80, number of events= 34

coef exp(coef) se(coef) z Pr(>|z|) trt 0.5903 1.8045 0.3640 1.621 0.

exp(coef) exp(-coef) lower .95 upper. trt 1.805 0.5542 0.8841 3.

Part III, Paper 37

A behavioural researcher approaches two statisticians (Dr Quasi and Dr Pascal) with data collected from a one academic year follow-up study of m Part IIB and Part III Mathematical Tripos students. The researcher has collected data on the number of times each student in the study has been to any Night Club in Cambridge during each of the three terms (Michaelmas, Lent and Easter), and is interested in modelling these count profiles over time and determining what variables can affect the number of visits in a term. The researcher has brought the Night Club data to the two statisticians in the form of a vector of counts Yi = (Yi 1 , Yi 2 , Yi 3 )T^ for the ith subject chronologically ordered over the three terms in the academic year. Additional information on the student is recorded in a baseline covariate vector xi, which includes details on the distance of the student’s college to the nearest Night Club, whether the student is reading for Part IIB or Part III of the Mathematical Tripos and the gender of the student. Furthermore, the term effect is represented by the factor variable, αj (j = 1, 2 , 3 with 1, 2 , 3 corresponding to Michaelmas, Lent, Easter respectively). The researcher believes that there may be some students who over their entire time at the University will never be inclined towards attending a Night Club during term time. Both Drs Quasi and Pascal recognise that there will be correlation between the components of Yi. Moreover they realise that they need to incorporate into their models the researcher’s belief that there is a proportion of students who will never be inclined towards attending a Night Club during term time. Dr Quasi decides to model the data as follows. He assumes that

log E(Yij |xi, αj , β 0 , βT^ , π) = log(μij (1 − π)) = β 0 + βT^ xi + αj + log(1 − π) V ar(Yij |xi, αj , β 0 , βT^ , π) = (1 − π)μij (1 + (π + τ )μij ) Cov(Yij , Yik|xi, αj , αk, β 0 , βT^ , π) = μij μikτ (j 6 = k),

where π ∈ [0, 1] and τ > 0 are parameters that accommodate the possibility a student will never be inclined to attend a Night Club during term time and potential unexplained heterogeneity respectively. However, Dr Pascal decides to adopt the following alternative strategy. She assumes that conditional on a random effect bi, the counts {Yij : j = 1, 2 , 3 } on the ith student are independent Poisson random variables with

E(Yij |bi; xi, αj , β 0 , βT^ ) = ηij = bi exp(β 0 + βT^ xi + αj ) V ar(Yij |bi; xi, αj , β 0 , βT^ ) = ηij Cov(Yij , Yik|bi; xi, αj , αk , β 0 , βT^ ) = 0 (j 6 = k).

She assumes that the bi’s are independently and identically distributed random variables and that bi is from a mixture distribution with probability density function

fbi (u) =

π if u = 0 (1 − π) (1/τ^ )

1 /τ Γ(1/τ ) u

(1/τ )− (^1) exp(−u/τ ) if u > 0

Part III, Paper 37 [TURN OVER