



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An analysis of two datasets: one on breast cancer mortality and adult white female population in north carolina, south carolina, and georgia, and the other on population density and vehicle thefts in 18 chicago districts. The analysis includes visualizing the data, fitting linear and quadratic regression models, identifying outliers, and testing hypotheses.
Typology: Exams
1 / 6
This page cannot be seen from the preview
Don't miss anything!




Problem: The cancer data set gives values for breast cancer mortality from 1950 to 1960 (y) and the adult white female population in 1960(x) for 301 counties in North Carolina, South Carolina, and Georgia. Explore if you can model y as a function of x. Defend your model on statistical grounds. Provide any other information that you can extract from this data set so health professional can utilize it. Do not forget to do residual analysis for your models. First of all, plot mortality against population (the graph below), from which we can observe a roughly linear relationship between mortality and population, and mortality grows with population, i.e., mortality is roughly proportional to population. Intuitively, the larger the population is, the more people suffer from breast cancer, thus the larger the number of mortality, so this assumption is plausible. If there are no people in a certain community at all, there will be no mortality, so we can additionally assume that this line passes through the origin, i.e. there is no intercept. So we use model Yi=βXi, (i=1,2…301). Under this model, the ‘best’ line is given by Mortality=0.00356population. From the graph below, we can see that this line fits the given data fairly well, actually if we compute the ‘goodness-of-fit ratio’ R-square, we will find it is 95.98%, which means about 95.98% of the variation in mortality can be explained by the variation of population, so this model fits the data very well. To play it safely, we can fit the data with a model with intercept, i.e. Yi=α+βXi (i=1,2…301). Under this model, the estimated line is given by Mortality=–0.52612+0.00358Population. R-square is 93.52%. Intuitively, the intercept (-0.52612, i.e. less than one person) is so small that can be ignored. We can justify this statistically. To do this, we test the hypothesis ‘intercept is 0’ vs. ‘intercept is not 0’. The result (in the chart below) strongly suggests that intercept is indeed zero. We can also perform test slope β=0 vs. β≠0, from the result, we conclude that β≠0.
Parameter Estimates Test statistic P-value Significant? Intercept - 0.5261 - 0.54 0.5876 No Population 0.00358 65.69 (^) <0.0001 Yes To check this model, plot the residuals (the differences between the predicted number of mortalities and the observed number mortalities, i.e. the predict ‘error’) against population (graph is as below). From the graph we see there exits a trend that the residual increases with population. To solve this, square root transformation and log transformation were used (on x, y individually, and on both of them), the graphs are shown as follows: 1.take square root of population (squpop=square root of population): 2.take square root of mortality (squmort=square root of mortality): 3.take square root of both population and mortality (‘squmort’ stands for square root of mortality, ‘squpop’ stands for square root of population):
small, almost all the residuals are less than zero, while population is big, almost all residuals are greater than zero. These signs suggest that these data transformations are not appropriate. In data transformation 3 (take square root of both) we observe no obvious trend in the residual plot, perhaps this is the model we are seeking? From the results of regression, we find that it is indeed the case. Some of the results are listed below: Parameter Estimate Root of MSE P-value Is intercept significant? R-square Intercept 0.05892 0.82098 (^) <0.0001 Yes 0. By this transformation, we get rid of the trend in residual, increased R-square, we also found that most of the residuals lie between a horizontal ± 2 σ band around zero, which suggest that the variability of Yi’s is about constant. So we use this model, i.e., Square root of mortality = 0.05892* square root of population. From this model, we can say the square root of mortality is proportional to square root of population, this relationship can be simulated by Square root of mortality = 0.05892* square root of population, i.e. when population is 10000, the number of mortality is about 6.
Problem: Data on population density (pd) and vehicle thefts (vtt) in 18 Chicago districts are given. Run a regression with vtt as the dependent variable and pd as the independent variable. Plot the residuals against pd, is there any outlier? If so, explain. If appropriate, delete outliers and re-estimate the model. Test the hypothesis that the slope is 0 against the alternative that it is different from 0, use 5% as the level of significance. First of all, give the scatter plot of the raw data as follows: Under linear model Yi=α+βXi, i=1,2…18, the residual plot (against pd) is given below:
From these graphs, we notice an outlier—Chicago downtown, which has a much higher vehicle thefts rate compared with the trend displayed by the other 17 districts. This probably is because the crime rate is generally higher in downtown area. Some of the results are given as follows: Parameters Estimates P-value Significant? Root MSE R-square Slope - 0.0033 0.0005 Yes Intercept 88.95 (^) <0.0001 Yes
From above, we can see that under this model, r-square is only 54.67%, and the residual increase with Pd, which implies this model is not good enough. We can try to fit the data with quadratic model, such as Yi=α+βXi+χXi^2, some of the results are given below: Parameters Estimates P-value Significant? Root MSE R-square Square of Pd 2.42E- 7 <0.0001 Yes Pd - 0.01215 <0.0001 Yes Intercept 159.87 <0.0001 Yes
This will give us a higher R-square (0.92), but the residual plot still shows the same trend. We first consider model Yi=α+βXi, i=1,2…18. Since we have noticed that the single data point Chicago changed the direction of the fitted line dramatically, we can delete the outlier and fit again. The scatter plot, residual plot and some of the results are given below: