Applied Statistics Comprehensive Exam, Study notes of Statistics

This comprehensive exam consists of 10 questions pertaining to methodological statistical topics. 1 This Ph. D level exam will run from 8:30 AM to 3:30 PM. 2 ...

Typology: Study notes

2021/2022

Uploaded on 08/01/2022

hal_s95
hal_s95 🇵🇭

4.4

(655)

10K documents

1 / 135

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Applied Statistics Comprehensive Exam
August 2013
Ph.D Methods Exam
This comprehensive exam consists of 10 questions pertaining to methodological statistical
topics.
1 This Ph.D level exam will run from 8:30 AM to 3:30 PM.
2 Please label each page with your identification number.
DO NOT USE YOUR NAME OR BEAR NUMBER.
3 Please write only on one side of each page.
4 Please leave one inch margins on all sides of each page.
5 Please number all pages consecutively.
6 Please label the day number (Day 1 or Day 2) on each page.
7 Please begin each question on a new page, and number each question.
8 Please do not staple pages together.
9 No wireless devices, formula sheets, or other outside materials are permitted.
10 Statistical tables and paper will be provided.
11 Relax and good luck!
I have read and understand the rules of this exam.
Signature: Date:
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Applied Statistics Comprehensive Exam and more Study notes Statistics in PDF only on Docsity!

Applied Statistics Comprehensive Exam

August 2013

Ph.D Methods Exam

This comprehensive exam consists of 10 questions pertaining to methodological statistical topics.

1 This Ph.D level exam will run from 8:30 AM to 3:30 PM.

2 Please label each page with your identification number.

DO NOT USE YOUR NAME OR BEAR NUMBER.

3 Please write only on one side of each page.

4 Please leave one inch margins on all sides of each page.

5 Please number all pages consecutively.

6 Please label the day number (Day 1 or Day 2) on each page.

7 Please begin each question on a new page, and number each question.

8 Please do not staple pages together.

9 No wireless devices, formula sheets, or other outside materials are permitted.

10 Statistical tables and paper will be provided.

11 Relax and good luck!

I have read and understand the rules of this exam.

Signature: Date:

1.) Cancer rehabilitation researchers are interested in evaluating patients’ post-treatment cardiopulmonary function using a continuous measure of “VO2 peak.” They would like to compare this measure across four cancer stages (I, II, III, IV) while controlling for gender (female / male) and also patient age (measured in years). Researchers are not interested in testing hypotheses across gender and different ages, as it is accepted that cardiopulmonary function differs for males and females and at different ages.

i. Describe an appropriate model that could be used to assess differences in cardiopul- monary function across cancer stages while accounting for gender and age.

ii. State the assumptions of your model.

iii. Your model must include an assumption about the relationship between age and lung capacity. Describe how your model could be adjusted to change this assumption.

iv. Provide an interpretation of the intercept / constant term in your model. Is this meaningful for making conclusions?

v. Provide an interpretation of the term(s) associated with cancer stage in your model. Give a detailed expression that could be used to test for the significance of this term, including the steps in the testing process, any calculations or formulas involved, and the distribution of any test statistic(s) used.

2.) Higher education researchers are interested in trends of GPA for first-generation col- lege students during their first four semesters in college, and the corresponding effects of motivation and substance abuse. For their study they randomly selected 50 first-generation college students and initially classified their “motivation” level into one of three groups (low, medium, high). At the end of each of the first four semesters of school for each student, semester GPA is recorded as well as a self-reported continuous measure of “substance abuse.”

i. As a factor in a longitudinal panel study, how would you classify “motivation”?

ii. Clearly describe a model for GPA, accounting for all of the factors described.

iii. State the assumptions of your model. Specifically, what have you assumed about the effect of “time”?

iv. Describe a process that could be used to test the effect of “motivation” on GPA. Include all steps in the testing process, any calculations or formulas involved, degrees of freedom and the distribution of any test statistic(s) used.

v. Assuming researchers are interested in assessing a “time trend” across the four semesters, describe in detail the types of trends that could be considered as well as how these trends could be assessed using your model.

6.) Suppose we wish to study the effects of three factors on corn yields: amount of nitrogen added, planting depth, and planting date. The nitrogen and depth factors each have two levels, and the date factor has three levels. There are 24 plots available for this experiment: twelve are on a farm near Greeley, CO, and twelve are on a different farm near Brighton, CO.

i. Describe the experimental design you would use. Specifically describe the process for assigning treatments to EUs (plots). Briefly explain why you selected that design.

ii. Construct a partial ANOVA table that includes sources of variation, degrees of freedom, expected mean squares, and appropriate F-ratios.

7.) Given the data, use the Sign Test to test H 0 : ˜μ = 8.41 versus H 1 : ˜μ > 8 .41.

  1. 30 , 9. 50 , 9. 60 , 8. 75 , 8. 40 , 9. 10 , 9. 25 , 9. 80 , 10. 05 , 8. 15 , 10. 00 , 9. 60 , 9. 80 , 9. 20 , 9. 30

8.) Compare and contrast stratified sampling to simple random sampling. What do these designs have in common? How are they different? Give examples/applications of each design. Under what conditions is stratified sampling preferred over simple random sampling.

9.) The observations for delivery time, number of cases, and distance walked by the router drive were collected in four cities. A model was developed that relates delivery time y to cases x 1 , distance x 2 , and the city in which the delivery was made. Based on SAS output on pages 5-7, answer the following questions.

i. Is there an indication that delivery site is an important variable?

ii. What conclusions can you draw regarding model adequacy?

10.) Based on a random sample of 3 Colorado high school classrooms, researchers have recorded a measure of proficiency (proficient / not proficient) for a total of 93 students ( in the first classroom, 27 in the second, and 35 in the third). As proficiency is determined by standardized exams, researchers would like to know if high school GPA is a reasonable predictor of proficiency. They would also like to control for gender (male = 1 / female = 0) and block by class.

i. Based on researcher interest, construct an appropriate model for “proficiency.”

ii. Using the output on pages 8-10, assess the fit of this model.

iii. Using the output on pages 8-10, provide an interpretation for the coefficient for GPA, for gender, and also for the coefficient for the “class 2” indicator.

iv. Describe how your model would change if classes were treated as random blocks.

The LOGISTIC Procedure

Model Information

Data Set WORK.PROFICIENCY Response Variable Proficiency Number of Response Levels 2 Model binary logit Optimization Technique Fisher’s scoring

Number of Observations Read 93 Number of Observations Used 93

Response Profile

Ordered Total Value Proficiency Frequency

Probability modeled is Proficiency=1.

Model Convergence Status

Convergence criterion (GCONV=1E-8) satisfied.

Partition for the Hosmer and Lemeshow Test

Proficiency = 1 Proficiency = 0 Group Total Observed Expected Observed Expected

Hosmer and Lemeshow Goodness-of-Fit Test

Chi-Square DF Pr > ChiSq

Applied Statistics Comprehensive Exam

August 2014

Ph.D Methods Exam

This comprehensive exam consists of 10 questions pertaining to methodological statistical topics.

1 This Ph.D level exam will run from 8:30 AM to 3:30 PM.

2 Please label each page with your identification number.

DO NOT USE YOUR NAME OR BEAR NUMBER.

3 Please write only on one side of each page.

4 Please leave one inch margins on all sides of each page.

5 Please number all pages consecutively.

6 Please label the day number (Day 1 or Day 2) on each page.

7 Please begin each question on a new page, and number each question.

8 Please do not staple pages together.

9 No wireless devices, formula sheets, or other outside materials are permitted.

10 Statistical tables and paper will be provided.

11 Relax and good luck!

I have read and understand the rules of this exam.

Signature: Date:

3.) Consider a 2^2 factorial design with two factors A and B. The levels of the factors may be arbitrarily called “low” and “high”. Consider the following data where an yield was recorded when the above mentioned factorial experiment was run in a completely randomized design with three replicates.

Factor Replicate Total A B Treatment Combination I II III − − A low, B low 28 25 27 80

  • − A high, B low 36 32 32 100 − + A low, B high 18 19 23 60
    • A high, B high 31 30 29 90

You may want to construct a standard order table (also known as Yates’ order) in order to answer the following questions.

i. Obtain the estimates of main effects of A, B, and AB interaction.

ii. Obtain the sum of squares estimates SSA, SSB , and SSAB for A, B, and AB, respec- tively.

iii. The total corrected sum of squares for this experiment, SST is 323. Calculate the sum of squares for the error, SSE by subtracting SSA, SSB , and SSAB from the SST.

iv. Construct the analysis of variance table including the calculated F -statistic. Comment on the significance of the main effects and the interaction.

4.) Consider a Two-Factor ANOVA model:

Yijk = μ + αi + βj + (αβ)ij + ijk,

where i = 1, 2, j = 1, 2 , 3, and k = 1, 2.

i. Write the model in the form Y = Xβ + , giving Y, X, and β explicitly.

ii. Provide (but do not simplify!) an expression for the Least Squares Estimator βˆ.

iii. Determine whether the linear expression βk − βl is estimable, for any combination k and l. What is the importance of identifying “estimable” functions?

iv. Describe how to find the Best Linear Unbiased Estimator for  

β 1 − β 2 β 2 − β 3 β 3 − β 1

Explain in what sense the estimator is “best.”

5.) [This problem should be answered based on a 7-page SAS output on pages 7 through 13.]

The admission officer of a graduate school has used an “index” of undergraduate GPA and graduate management aptitude test (GMAT) scores to help decide which applicants should be admitted to the graduate programs. The scatter plot of GPA vs GMAT (shown in the attached SAS output) shows recent applicants who have been classified as “Admit (A)”, “Borderline (B)”, and “Reject (R)”.

A discriminant analysis and classification have been performed on the data and the results are shown in the attached SAS output. Answer the following questions. Note: when you answer, make sure to include the associated statistics. For example, if you decide to reject a null hypothesis, you should mention the value of the appropriate test statistic and the corresponding p-value.

i. Is there significant association between admission status (admitted, rejected, border- line) and the scores on GPA and GMAT?

ii. If there is significant association, we would like to perform a discriminant analysis. How many discriminant functions (DF) are possible for the given problem?

iii. Comment on the significance of the discriminant function(s).

iv. What is the overall effect size for the discriminant analysis? Comment on the effect size of each of the discriminant functions.

v. Write the classification functions corresponding to each discriminant function. Use the classification function(s) for classifying an applicant as “Admit” or “Reject” or “Borderline” who has GPA = 3.7 and GMAT score = 650

vi. In the SAS output, both resubstitution summary and crossvalidation summary for classification are provided. Comment on the error of misclassification based on these output.

vii. Is there a reason to believe that the classification function produces noticeably higher error rate than what we would have obtained by chance alone? Would you use the discriminant functions obtained from this analysis to classify an applicant to either admit, reject, or borderline? Justify.

9.) The Berkeley Guidance Study was a longitudinal monitoring of girls born in Berkeley, California between January 1928 and June 1929, and followed for at least 18 years. The variables are described as follows.

Variable Description HT18 Age 18 height (cm) HT2 Age 2 height (cm) LG9 Age 9 leg circumference (cm) ST9 Age 9 strength (kg) WT9 Age 9 weight (kg) WT18 Age 18 weight (kg)

Use the SAS output on page 14 to answer the following questions.

i. Test for significance of regression for the relationship between HT 18 and HT 2 , LG 9 , ST 9 , W T 9 and W T 18.

ii. Assess multicollinearity in the model, and describe the results.

iii. Use the model with 5 regressors to find the prediction for x 0 with the following values

HT 2 LG 9 ST 9 W T 9 W T 18

  1. 4 26. 61 62 30. 1 76. 3

iv. Using the partial F test, determine the contribution of W T 9 and W T 18 to the model. Note F. 05 (2, 130) = 3. 065.

10.) Health researchers are interested in explaining the likelihood of myocardial infarction (MI, “heart attack”) using professional attributes. For their study they randomly selected 75 individuals between 55 and 65 years of age and recorded each individual’s annual income (in thousands of dollars), whether the individual has a college degree, and whether the individual has experienced at least one MI within the last 10 years.

i. Describe an appropriate Generalized Linear Model for this research situation. Clearly explain each term in your systematic component.

ii. Using the output on pages 15 to 16, assess the fit of this model.

iii. Using the output on pages 15 to 16, provide an interpretation of the coefficient for “college degree.” Also provide an interpretation of the coefficient for “annual income.”

iv. The significance of each independent variable can be assessed using Wald Statistics. Briefly explain the process of a Wald hypothesis test.

v. Suppose the researchers also want to model the variation in MI (using a variance multiplier). Thinking of the properties of variance, describe an appropriate Generalized Linear Model for modeling variance (assume the same independent variables as part i).

SAS Output for Question 5

1

gmat

300

400

500

600

700

gpa

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.

admit A A A^ Admit^ R R R^ Reject^ B B BBorderline

A

A A

A A

A

A

A

A

A A (^) A A

A

A

A

A A

A A A

A A

A

A

A

A

A A

A

A R R

R

R R

R R

R

R

R

R R

R R (^) R

R

R

R

R R R

R

R

R

R

R R R

BB

B

B

B

B

B

B B B

B

B

B

B

B

B (^) B B B

B

B B

B B B

B

SAS Output for Question 5

The SAS System The DISCRIM Procedure Canonical Discriminant Analysis

3

Generalized Squared Distance to admit From admit Admit Borderline Reject Admit^0 10.06344^ 31. Borderline 10.06344 0 7. Reject 31.28880 7.43364 0

Multivariate Statistics and F Approximations S=2 M=-0.5 N=39. Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.12637661 73.43 4 162 <. Pillai's Trace 1.00963002^ 41.80^4 164 <. Hotelling-Lawley Trace 5.83665601^ 117.72^4 96.17^ <. Roy's Greatest Root 5.64604452^ 231.49^2 82 <. NOTE: F Statistic for Roy's Greatest Root is an upper bound. NOTE: F Statistic for Wilks' Lambda is exact.

Canonical Correlation

Adjusted Canonical Correlation

Approximate Standard Error

Squared Canonical Correlation

Eigenvalues of Inv(E)H = CanRsq/(1-CanRsq) Eigenvalue Difference Proportion Cumulative 1* 0.921702^ 0.920516^ 0.016417^ 0.849535^ 5.6460^ 5.4554^ 0.9673^ 0. 2 0.400119^.^ 0.091641^ 0.160095^ 0.1906^ 0.0327^ 1.

Test of H0: The canonical correlations in the current row and all that follow are zero Likelihood Ratio

Approximate F Value Num DF Den DF Pr > F 1 0.12637661 73.43 4 162 <. 2 0.83990454 15.63 1 82 0.

SAS Output for Question 5

The SAS System The DISCRIM Procedure Canonical Discriminant Analysis

4

Total Canonical Structure Variable Can1 Can gpa 0.969922^ -0. gmat 0.662832 0.

Between Canonical Structure Variable Can1 Can gpa 0.994118^ -0. gmat 0.897852^ 0.

Pooled Within Canonical Structure Variable Can1 Can gpa 0.860161 -0. gmat 0.350860 0.