




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This comprehensive exam consists of 10 questions pertaining to methodological statistical topics. 1 This Ph. D level exam will run from 8:30 AM to 3:30 PM. 2 ...
Typology: Study notes
1 / 135
This page cannot be seen from the preview
Don't miss anything!





























































































August 2013
Ph.D Methods Exam
This comprehensive exam consists of 10 questions pertaining to methodological statistical topics.
1 This Ph.D level exam will run from 8:30 AM to 3:30 PM.
2 Please label each page with your identification number.
3 Please write only on one side of each page.
4 Please leave one inch margins on all sides of each page.
5 Please number all pages consecutively.
6 Please label the day number (Day 1 or Day 2) on each page.
7 Please begin each question on a new page, and number each question.
8 Please do not staple pages together.
9 No wireless devices, formula sheets, or other outside materials are permitted.
10 Statistical tables and paper will be provided.
11 Relax and good luck!
I have read and understand the rules of this exam.
Signature: Date:
1.) Cancer rehabilitation researchers are interested in evaluating patients’ post-treatment cardiopulmonary function using a continuous measure of “VO2 peak.” They would like to compare this measure across four cancer stages (I, II, III, IV) while controlling for gender (female / male) and also patient age (measured in years). Researchers are not interested in testing hypotheses across gender and different ages, as it is accepted that cardiopulmonary function differs for males and females and at different ages.
i. Describe an appropriate model that could be used to assess differences in cardiopul- monary function across cancer stages while accounting for gender and age.
ii. State the assumptions of your model.
iii. Your model must include an assumption about the relationship between age and lung capacity. Describe how your model could be adjusted to change this assumption.
iv. Provide an interpretation of the intercept / constant term in your model. Is this meaningful for making conclusions?
v. Provide an interpretation of the term(s) associated with cancer stage in your model. Give a detailed expression that could be used to test for the significance of this term, including the steps in the testing process, any calculations or formulas involved, and the distribution of any test statistic(s) used.
2.) Higher education researchers are interested in trends of GPA for first-generation col- lege students during their first four semesters in college, and the corresponding effects of motivation and substance abuse. For their study they randomly selected 50 first-generation college students and initially classified their “motivation” level into one of three groups (low, medium, high). At the end of each of the first four semesters of school for each student, semester GPA is recorded as well as a self-reported continuous measure of “substance abuse.”
i. As a factor in a longitudinal panel study, how would you classify “motivation”?
ii. Clearly describe a model for GPA, accounting for all of the factors described.
iii. State the assumptions of your model. Specifically, what have you assumed about the effect of “time”?
iv. Describe a process that could be used to test the effect of “motivation” on GPA. Include all steps in the testing process, any calculations or formulas involved, degrees of freedom and the distribution of any test statistic(s) used.
v. Assuming researchers are interested in assessing a “time trend” across the four semesters, describe in detail the types of trends that could be considered as well as how these trends could be assessed using your model.
6.) Suppose we wish to study the effects of three factors on corn yields: amount of nitrogen added, planting depth, and planting date. The nitrogen and depth factors each have two levels, and the date factor has three levels. There are 24 plots available for this experiment: twelve are on a farm near Greeley, CO, and twelve are on a different farm near Brighton, CO.
i. Describe the experimental design you would use. Specifically describe the process for assigning treatments to EUs (plots). Briefly explain why you selected that design.
ii. Construct a partial ANOVA table that includes sources of variation, degrees of freedom, expected mean squares, and appropriate F-ratios.
7.) Given the data, use the Sign Test to test H 0 : ˜μ = 8.41 versus H 1 : ˜μ > 8 .41.
8.) Compare and contrast stratified sampling to simple random sampling. What do these designs have in common? How are they different? Give examples/applications of each design. Under what conditions is stratified sampling preferred over simple random sampling.
9.) The observations for delivery time, number of cases, and distance walked by the router drive were collected in four cities. A model was developed that relates delivery time y to cases x 1 , distance x 2 , and the city in which the delivery was made. Based on SAS output on pages 5-7, answer the following questions.
i. Is there an indication that delivery site is an important variable?
ii. What conclusions can you draw regarding model adequacy?
10.) Based on a random sample of 3 Colorado high school classrooms, researchers have recorded a measure of proficiency (proficient / not proficient) for a total of 93 students ( in the first classroom, 27 in the second, and 35 in the third). As proficiency is determined by standardized exams, researchers would like to know if high school GPA is a reasonable predictor of proficiency. They would also like to control for gender (male = 1 / female = 0) and block by class.
i. Based on researcher interest, construct an appropriate model for “proficiency.”
ii. Using the output on pages 8-10, assess the fit of this model.
iii. Using the output on pages 8-10, provide an interpretation for the coefficient for GPA, for gender, and also for the coefficient for the “class 2” indicator.
iv. Describe how your model would change if classes were treated as random blocks.
The LOGISTIC Procedure
Model Information
Data Set WORK.PROFICIENCY Response Variable Proficiency Number of Response Levels 2 Model binary logit Optimization Technique Fisher’s scoring
Number of Observations Read 93 Number of Observations Used 93
Response Profile
Ordered Total Value Proficiency Frequency
Probability modeled is Proficiency=1.
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Partition for the Hosmer and Lemeshow Test
Proficiency = 1 Proficiency = 0 Group Total Observed Expected Observed Expected
Hosmer and Lemeshow Goodness-of-Fit Test
Chi-Square DF Pr > ChiSq
August 2014
Ph.D Methods Exam
This comprehensive exam consists of 10 questions pertaining to methodological statistical topics.
1 This Ph.D level exam will run from 8:30 AM to 3:30 PM.
2 Please label each page with your identification number.
3 Please write only on one side of each page.
4 Please leave one inch margins on all sides of each page.
5 Please number all pages consecutively.
6 Please label the day number (Day 1 or Day 2) on each page.
7 Please begin each question on a new page, and number each question.
8 Please do not staple pages together.
9 No wireless devices, formula sheets, or other outside materials are permitted.
10 Statistical tables and paper will be provided.
11 Relax and good luck!
I have read and understand the rules of this exam.
Signature: Date:
3.) Consider a 2^2 factorial design with two factors A and B. The levels of the factors may be arbitrarily called “low” and “high”. Consider the following data where an yield was recorded when the above mentioned factorial experiment was run in a completely randomized design with three replicates.
Factor Replicate Total A B Treatment Combination I II III − − A low, B low 28 25 27 80
You may want to construct a standard order table (also known as Yates’ order) in order to answer the following questions.
i. Obtain the estimates of main effects of A, B, and AB interaction.
ii. Obtain the sum of squares estimates SSA, SSB , and SSAB for A, B, and AB, respec- tively.
iii. The total corrected sum of squares for this experiment, SST is 323. Calculate the sum of squares for the error, SSE by subtracting SSA, SSB , and SSAB from the SST.
iv. Construct the analysis of variance table including the calculated F -statistic. Comment on the significance of the main effects and the interaction.
4.) Consider a Two-Factor ANOVA model:
Yijk = μ + αi + βj + (αβ)ij + ijk,
where i = 1, 2, j = 1, 2 , 3, and k = 1, 2.
i. Write the model in the form Y = Xβ + , giving Y, X, and β explicitly.
ii. Provide (but do not simplify!) an expression for the Least Squares Estimator βˆ.
iii. Determine whether the linear expression βk − βl is estimable, for any combination k and l. What is the importance of identifying “estimable” functions?
iv. Describe how to find the Best Linear Unbiased Estimator for
β 1 − β 2 β 2 − β 3 β 3 − β 1
Explain in what sense the estimator is “best.”
5.) [This problem should be answered based on a 7-page SAS output on pages 7 through 13.]
The admission officer of a graduate school has used an “index” of undergraduate GPA and graduate management aptitude test (GMAT) scores to help decide which applicants should be admitted to the graduate programs. The scatter plot of GPA vs GMAT (shown in the attached SAS output) shows recent applicants who have been classified as “Admit (A)”, “Borderline (B)”, and “Reject (R)”.
A discriminant analysis and classification have been performed on the data and the results are shown in the attached SAS output. Answer the following questions. Note: when you answer, make sure to include the associated statistics. For example, if you decide to reject a null hypothesis, you should mention the value of the appropriate test statistic and the corresponding p-value.
i. Is there significant association between admission status (admitted, rejected, border- line) and the scores on GPA and GMAT?
ii. If there is significant association, we would like to perform a discriminant analysis. How many discriminant functions (DF) are possible for the given problem?
iii. Comment on the significance of the discriminant function(s).
iv. What is the overall effect size for the discriminant analysis? Comment on the effect size of each of the discriminant functions.
v. Write the classification functions corresponding to each discriminant function. Use the classification function(s) for classifying an applicant as “Admit” or “Reject” or “Borderline” who has GPA = 3.7 and GMAT score = 650
vi. In the SAS output, both resubstitution summary and crossvalidation summary for classification are provided. Comment on the error of misclassification based on these output.
vii. Is there a reason to believe that the classification function produces noticeably higher error rate than what we would have obtained by chance alone? Would you use the discriminant functions obtained from this analysis to classify an applicant to either admit, reject, or borderline? Justify.
9.) The Berkeley Guidance Study was a longitudinal monitoring of girls born in Berkeley, California between January 1928 and June 1929, and followed for at least 18 years. The variables are described as follows.
Variable Description HT18 Age 18 height (cm) HT2 Age 2 height (cm) LG9 Age 9 leg circumference (cm) ST9 Age 9 strength (kg) WT9 Age 9 weight (kg) WT18 Age 18 weight (kg)
Use the SAS output on page 14 to answer the following questions.
i. Test for significance of regression for the relationship between HT 18 and HT 2 , LG 9 , ST 9 , W T 9 and W T 18.
ii. Assess multicollinearity in the model, and describe the results.
iii. Use the model with 5 regressors to find the prediction for x 0 with the following values
HT 2 LG 9 ST 9 W T 9 W T 18
iv. Using the partial F test, determine the contribution of W T 9 and W T 18 to the model. Note F. 05 (2, 130) = 3. 065.
10.) Health researchers are interested in explaining the likelihood of myocardial infarction (MI, “heart attack”) using professional attributes. For their study they randomly selected 75 individuals between 55 and 65 years of age and recorded each individual’s annual income (in thousands of dollars), whether the individual has a college degree, and whether the individual has experienced at least one MI within the last 10 years.
i. Describe an appropriate Generalized Linear Model for this research situation. Clearly explain each term in your systematic component.
ii. Using the output on pages 15 to 16, assess the fit of this model.
iii. Using the output on pages 15 to 16, provide an interpretation of the coefficient for “college degree.” Also provide an interpretation of the coefficient for “annual income.”
iv. The significance of each independent variable can be assessed using Wald Statistics. Briefly explain the process of a Wald hypothesis test.
v. Suppose the researchers also want to model the variation in MI (using a variance multiplier). Thinking of the properties of variance, describe an appropriate Generalized Linear Model for modeling variance (assume the same independent variables as part i).
1
gmat
300
400
500
600
700
gpa
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.
admit A A A^ Admit^ R R R^ Reject^ B B BBorderline
A
A A
A A
A
A
A
A
A A (^) A A
A
A
A
A A
A A A
A A
A
A
A
A
A A
A
A R R
R
R R
R R
R
R
R
R R
R R (^) R
R
R
R
R R R
R
R
R
R
R R R
BB
B
B
B
B
B
B B B
B
B
B
B
B
B (^) B B B
B
B B
B B B
B
The SAS System The DISCRIM Procedure Canonical Discriminant Analysis
3
Generalized Squared Distance to admit From admit Admit Borderline Reject Admit^0 10.06344^ 31. Borderline 10.06344 0 7. Reject 31.28880 7.43364 0
Multivariate Statistics and F Approximations S=2 M=-0.5 N=39. Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.12637661 73.43 4 162 <. Pillai's Trace 1.00963002^ 41.80^4 164 <. Hotelling-Lawley Trace 5.83665601^ 117.72^4 96.17^ <. Roy's Greatest Root 5.64604452^ 231.49^2 82 <. NOTE: F Statistic for Roy's Greatest Root is an upper bound. NOTE: F Statistic for Wilks' Lambda is exact.
Canonical Correlation
Adjusted Canonical Correlation
Approximate Standard Error
Squared Canonical Correlation
Eigenvalues of Inv(E)H = CanRsq/(1-CanRsq) Eigenvalue Difference Proportion Cumulative 1* 0.921702^ 0.920516^ 0.016417^ 0.849535^ 5.6460^ 5.4554^ 0.9673^ 0. 2 0.400119^.^ 0.091641^ 0.160095^ 0.1906^ 0.0327^ 1.
Test of H0: The canonical correlations in the current row and all that follow are zero Likelihood Ratio
Approximate F Value Num DF Den DF Pr > F 1 0.12637661 73.43 4 162 <. 2 0.83990454 15.63 1 82 0.
The SAS System The DISCRIM Procedure Canonical Discriminant Analysis
4
Total Canonical Structure Variable Can1 Can gpa 0.969922^ -0. gmat 0.662832 0.
Between Canonical Structure Variable Can1 Can gpa 0.994118^ -0. gmat 0.897852^ 0.
Pooled Within Canonical Structure Variable Can1 Can gpa 0.860161 -0. gmat 0.350860 0.