Applied Statistics - Mathematical Tripos - Final Exam, Exams of Mathematics

This is the Final Exam of Mathematical Tripos which includes Astrophysical Dynamics, Self-Gravitating Stellar System, Power-Law Density Profile, Gravitational Potential, Length and Density Scales, Isotropic Spherical System etc. Key important points are: Applied Statistics, Normal Random Variables, Mean Zero and Variance, Random Vector, Least Squares Estimator, Algebraic Form, Dimensional Vector, Unknown Parameter Values, Covariate Values

Typology: Exams

2012/2013

Uploaded on 02/26/2013

devaku
devaku 🇮🇳

4.2

(11)

131 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
MATHEMATICAL TRIPOS Part III
Tuesday, 1 June, 2010 9:00 am to 12:00 pm
PAPER 37
APPLIED STATISTICS
Attempt no more than FOUR questions.
There are FIVE questions in total.
The questions carry equal weight.
STATIONERY REQUIREMENTS SPECIAL REQUIREMENTS
Cover sheet None
Treasury Tag
Script paper
You may not start to read the questions
printed on the subsequent pages until
instructed to do so by the Invigilator.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Applied Statistics - Mathematical Tripos - Final Exam and more Exams Mathematics in PDF only on Docsity!

MATHEMATICAL TRIPOS Part III

Tuesday, 1 June, 2010 9:00 am to 12:00 pm

PAPER 37

APPLIED STATISTICS

Attempt no more than FOUR questions. There are FIVE questions in total. The questions carry equal weight.

STATIONERY REQUIREMENTS SPECIAL REQUIREMENTS

Cover sheet None Treasury Tag Script paper

You may not start to read the questions printed on the subsequent pages until instructed to do so by the Invigilator.

Suppose that Y = (Y 1 ,... , Yn)T^ satisfies Y = Xβ + ε , where X is a known n × p matrix with rank p (< n), β = (β 1 ,... , βp)T^ is unknown, ε = (ε 1 ,... , εn)T^ where ε 1 ,... , εn are independent normal random variables with mean zero and variance σ^2 , and, where vT^ denotes the transpose of v. Derive the least squares estimator βˆ of β. Explain what is meant by the vector Yˆ of fitted values and by the vector ˆǫ of residuals. Find the distribution of ˆǫ. Show that Yˆ is in the space spanned by the columns of X. Show that XT^ ˆǫ = 0 and interpret this result.

[You may assume without proof that, for an m× 1 random vector W and a k×m (constant) matrix A, cov(AW ) = Acov(W )AT^ .]

Gas chromatography is a technique used to detect small amounts of a substance using a gas chromatograph. The edited R output below refers to a study in which five gas chromatograph readings were taken for each of four specimens containing different (known) amounts of the substance. The aim of the study is to calibrate the chromatograph by relating the actual amount of the substance to the chromatograph reading. In the R output reading contains the chromatograph readings and amount contains the amount of the substance in nanograms. The plots are also included below the output.

Write down the algebraic form of the model fitted in gas1.lm, together with any assumptions, and discuss whether or not this model seems to be satisfactory. Explain briefly what is shown in the boxcox plot and explain what you conclude from it. Write down the model fitted in gas2.lm. What features of the plot for this model might lead you to fit model gas3.lm? Using the gas3.lm model, explain how to obtain an estimate of the expected chromatograph reading when the amount of substance is 3.0 nanograms.

gasdata amount reading 1 0.25 6. 2 0.25 7. 3 0.25 6. 4 0.25 6. 5 0.25 7. 6 1.00 29. 7 1.00 30. 8 1.00 30. 9 1.00 29. 10 1.00 29. 11 5.00 211. 12 5.00 204. 13 5.00 212. 14 5.00 213. 15 5.00 205. 16 20.00 929. 17 20.00 905. 18 20.00 922. 19 20.00 928. 20 20.00 919. gas1.lm <- lm(reading~amount) plot(gas1.lm$fitted.values,gas1.lm$residuals) library(MASS)

Part III, Paper 37

The table below shows car insurance premiums for various categories of policyholders with 0, 3, 6 or 9 points on their driving licenses. For each category of policyholder the top row gives the premiums for third party fire and theft only policies and the bottom row gives the premiums for comprehensive policies.

Number of points 0 3 6 9 21 year old male 306 384 384 409 500 555 555 605 21 year old female 266 304 279 287 435 430 464 478 30 year old female 177 177 177 213 320 325 325 268 40 year old male 154 162 162 189 230 230 230 295

In the (edited) R output below, Gender, Age, Policy and Points are factors, and corner point constraints are used.

(a) Comment on any obvious deficiencies of the data.

(b) Write down the algebraic form of the model fitted in insurance1.lm, defining your notation carefully and writing down the assumptions and constraints explicitly. You are given that the residual sum of squares for this model is 19512.

(c) You are given that the model insurance2.lm has residual sum of squares equal to

  1. What hypothesis is being tested by the test statistic whose value is f, and why does the test statistic take this form? What is the result of this hypothesis test? Write down your conclusion in words.

(d) Write down the algebraic form of the model fitted in insurance3.lm, again explicitly writing down the assumptions and constraints. Test whether this model is an improvement over insurance2.lm, and summarise in words how premiums depend on age, gender, policy type and the number of points. What is the estimated comprehensive policy premium for a 40 year old female policyholder with 6 points on her license?

x [1] 306 384 384 409 500 555 555 605 266 304 279 287 435 430 464 478 177 177 177 [20] 213 320 325 325 368 154 162 162 189 230 230 230 295 Gender [1] M M M M M M M M F F F F F F F F F F F F F F F F M M M M M M M M Levels: F M Age [1] 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 21 30 30 30 30 30 30 30 30 40 [26] 40 40 40 40 40 40 40 Levels: 21 30 40 Points

Part III, Paper 37

[1] 0 3 6 9 0 3 6 9 0 3 6 9 0 3 6 9 0 3 6 9 0 3 6 9 0 3 6 9 0 3 6 9 Levels: 0 3 6 9

Policy [1] 3rd 3rd 3rd 3rd comp comp comp comp 3rd 3rd 3rd 3rd comp comp comp [16] comp 3rd 3rd 3rd 3rd comp comp comp comp 3rd 3rd 3rd 3rd comp comp [31] comp comp Levels: 3rd comp insurance1.lm <- lm(x~Age+Gender+Policy+Points)

Points2 <- factor(rep(c(1,1,1,2),times=8)) Points [1] 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 2 Levels: 1 2 insurance2.lm <- lm(x~Age+Gender+Policy+Points2) f <- ((22323-19512)/2)/(19512/24) f [1] 1. qf(0.95,2,24) [1] 3.

insurance3.lm <- lm(x~Age*Policy + Gender + Points2) anova(insurance3.lm) Df Sum Sq Mean Sq F value Pr(>F) Age 2 275639 137820 329.850 < 2.2e- Policy 1 167476 167476 400.827 < 2.2e- Gender 1 35627 35627 85.267 2.276e- Points2 1 10438 10438 24.981 4.177e- Age:Policy 2 12295 6147 14.713 6.754e- Residuals 24 10028 418 summary(insurance3.lm) Estimate Std. Error t value Pr(>|t|) (Intercept) 269.760 9.094 29.665 < 2e- Age30 -94.187 13.520 -6.966 3.33e- Age40 -207.812 13.520 -15.370 6.38e- Policycomp 175.375 10.220 17.159 5.61e- GenderM 94.375 10.220 9.234 2.28e- Points22 41.708 8.345 4.998 4.18e- Age30:Policycomp -26.875 17.702 -1.518 0. Age40:Policycomp -95.875 17.702 -5.416 1.46e-

Residual standard error: 20.44 on 24 degrees of freedom Multiple R-Squared: 0. F-statistic: 171.5 on 7 and 24 DF, p-value: < 2.2e-

Part III, Paper 37 [TURN OVER

summary(blow3.glm) Estimate Std. Error z value Pr(>|z|) (Intercept) -9.5621 0.7499 -12.75 <2e- lT 2.2164 0.2079 10.66 <2e- S 4.5086 0.5159 8.74 <2e-

Part III, Paper 37 [TURN OVER

The (edited) R output below refers to a study into the effectiveness of some particular traffic control measures in reducing accident rates. In each of eight locations, there are data on the number of accidents over a number of years before and after the installation of the traffic control measures. In the R ouput below, loc contains the location identifiers (numbers between 1 and 8), befaft contains indicators of whether the observation was taken before or after installation (1 denotes before, 2 denotes afterwards), years contains the length of the observation period (in years), and nacc contains the number of accidents that occurred during that observation period. Corner point constraints are used.

(a) Explain what is calculated in line (*).

(b) Write down the algebraic form of the model fitted in traffic1.glm, defining your notation carefully and stating any assumptions. Using the output to summary(traffic1.glm), show how to obtain an estimate of the ratio r of the accident rate after installation to the accident rate before installation. Explain how to obtain an approximate 95% confidence interval for r.

(c) Write down the algebraic form of the model in traffic2.glm. Why do you think this model is fitted? Comment on the fit of the model.

(d) Write a short paragraph giving relevant formal statistical analysis and your conclu- sions about the effect of the traffic measures on accident rates.

loc [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 befaft [1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 years [1] 9 2 9 2 8 3 8 2 9 2 8 2 9 2 8 3 nacc [1] 13 0 6 2 30 4 20 0 10 0 15 6 7 1 13 2 Befaft <- factor(befaft) Loc <- factor(loc) r1 <- sum(nacc[befaft==1])/sum(years[befaft==1]) r2 <- sum(nacc[befaft==2])/sum(years[befaft==2]) r2/r1 # line (*) [1] 0. traffic1.glm <- glm(nacc~offset(log(years))+Befaft,poisson) summary(traffic1.glm) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.51669 0.09366 5.517 3.45e- Befaft2 -0.69901 0.27466 -2.545 0. Null deviance: 58.589 on 15 degrees of freedom Residual deviance: 50.863 on 14 degrees of freedom exp(-0.69901) [1] 0. traffic2.glm <- glm(nacc~offset(log(years))+Loc+Befaft,poisson) anova(traffic2.glm,test="Chisq") Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 15 58. Loc 7 32.564 8 26.025 3.191e-

Part III, Paper 37

A researcher has collected hospital data for swine influenza-related admissions during the middle period of the 2009 UK epidemic. Specifically, she has recorded the dates of admission, swine influenza-related death and discharge, and the time still in hospital since admission if a patient has yet to be discharged or to die from swine influenza- related causes at the time of data collection. She approaches you with the data and is particularly interested in the case fatality ratio θ associated with hospitalisation (i.e. the proportion of swine influenza-related hospital cases who eventually die from the disease) and the conditional distribution corresponding to the time of death given that a case will eventually die (I = 1) from swine influenza-related causes (with distribution function F (t|I = 1) and density f (t|I = 1)). The conditional distribution corresponding to the time to recovery (i.e. discharge) given that a case will eventually recover (I = 2) from the illness (with distribution function F (t|I = 2) and density f (t|I = 2)) may also be of interest. You recognise that this is a survival analysis problem and offer to help her analyse the data.

By appropriately defining all notation used: (a) Identify which type(s) of patients correspond to right-censored observations.

(b) Write down the likelihood contributions for a case (i.e. a swine influenza-related admitted patient) who

(i) dies in hospital at time t after admission; (ii) recovers and is discharged at time t after admission; (iii) remains in hospital at time t after admission.

(c) Derive an E-M algorithm, giving full details for the E-step, that can be used to estimate the parameters of interest to the researcher given that the conditional densities, f (t|I = 1) and f (t|I = 2), associated with time to swine influenza-related death and time to recovery given eventual death from swine influenza-related causes and eventual recovery respectively, are log-normal densities with parameters (μ 1 , σ 1 ) and (μ 2 , σ 2 ).

[Hint: if X has a log-normal distribution with parameter (μ, σ), then Y = log(X) has a normal distribution with mean μ and variance σ^2. Also, if Y has a N (μ, σ^2 ) distribution, then, writing z = (y − μ)/σ , we have E(Y |Y > y) = μ + σψ(z) ,

E

Y − a b

∣ Y > y

b 2

σ^2 [1 − ω(z)] + [(μ − a) + σψ(z)]^2

for constants a and b ( 6 = 0), and

var (Y |Y > 0) = σ^2 [1 − ω(z)] where ψ(z) = φ(z) 1 − Φ(z)

and ω(z) = ψ(z)[ψ(z) − z] ,

and where φ(·) and Φ(·) are the density and distribution function respectively for a standard normal distribution.]

Part III, Paper 37

END OF PAPER

Part III, Paper 37