Exploratory Data Analysis for Cancer Mortality and Vehicle Thefts Data, Exams of Statistics

An analysis of two datasets: one on breast cancer mortality and adult white female population in north carolina, south carolina, and georgia, and the other on population density and vehicle thefts in 18 chicago districts. The analysis includes visualizing the data, fitting linear and quadratic regression models, identifying outliers, and testing hypotheses.

Typology: Exams

Pre 2010

Uploaded on 09/17/2009

koofers-user-5ic
koofers-user-5ic 🇺🇸

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Note: Only the first two parts are posted here.
Part 1:
Problem: The cancer data set gives values for breast cancer mortality from 1950 to 1960 (y) and the adult
white female population in 1960(x) for 301 counties in North Carolina, South Carolina, and Georgia.
Explore if you can model y as a function of x. Defend your model on statistical grounds. Provide any other
information that you can extract from this data set so health professional can utilize it. Do not forget to do
residual analysis for your models.
First of all, plot mortality against population (the graph below), from which we can observe a roughly
linear relationship between mortality and population, and mortality grows with population, i.e., mortality is
roughly proportional to population.
Intuitively, the larger the population is, the more people suffer from breast cancer, thus the larger the
number of mortality, so this assumption is plausible. If there are no people in a certain community at all,
there will be no mortality, so we can additionally assume that this line passes through the origin, i.e. there is
no intercept. So we use model Yi=βXi, (i=1,2…301).
Under this model, the ‘best’ line is given by Mortality=0.00356*population. From the graph below, we can
see that this line fits the given data fairly well, actually if we compute the ‘goodness-of-fit ratio’ R-square,
we will find it is 95.98%, which means about 95.98% of the variation in mortality can be explained by the
variation of population, so this model fits the data very well.
To play it safely, we can fit the data with a model with intercept, i.e. Yi=α+βXi (i=1,2…301). Under this
model, the estimated line is given by Mortality=0.52612+0.00358*Population. R-square is 93.52%.
Intuitively, the intercept (-0.52612, i.e. less than one person) is so small that can be ignored. We can justify
this statistically. To do this, we test the hypothesis ‘intercept is 0’ vs. ‘intercept is not 0’. The result (in the
chart below) strongly suggests that intercept is indeed zero. We can also perform test slope β=0 vs. β0,
from the result, we conclude that β0.
pf3
pf4
pf5

Partial preview of the text

Download Exploratory Data Analysis for Cancer Mortality and Vehicle Thefts Data and more Exams Statistics in PDF only on Docsity!

Note: Only the first two parts are posted here.

Part 1:

Problem: The cancer data set gives values for breast cancer mortality from 1950 to 1960 (y) and the adult white female population in 1960(x) for 301 counties in North Carolina, South Carolina, and Georgia. Explore if you can model y as a function of x. Defend your model on statistical grounds. Provide any other information that you can extract from this data set so health professional can utilize it. Do not forget to do residual analysis for your models. First of all, plot mortality against population (the graph below), from which we can observe a roughly linear relationship between mortality and population, and mortality grows with population, i.e., mortality is roughly proportional to population. Intuitively, the larger the population is, the more people suffer from breast cancer, thus the larger the number of mortality, so this assumption is plausible. If there are no people in a certain community at all, there will be no mortality, so we can additionally assume that this line passes through the origin, i.e. there is no intercept. So we use model Yi=βXi, (i=1,2…301). Under this model, the ‘best’ line is given by Mortality=0.00356population. From the graph below, we can see that this line fits the given data fairly well, actually if we compute the ‘goodness-of-fit ratio’ R-square, we will find it is 95.98%, which means about 95.98% of the variation in mortality can be explained by the variation of population, so this model fits the data very well. To play it safely, we can fit the data with a model with intercept, i.e. Yi=α+βXi (i=1,2…301). Under this model, the estimated line is given by Mortality=–0.52612+0.00358Population. R-square is 93.52%. Intuitively, the intercept (-0.52612, i.e. less than one person) is so small that can be ignored. We can justify this statistically. To do this, we test the hypothesis ‘intercept is 0’ vs. ‘intercept is not 0’. The result (in the chart below) strongly suggests that intercept is indeed zero. We can also perform test slope β=0 vs. β≠0, from the result, we conclude that β≠0.

Parameter Estimates Test statistic P-value Significant? Intercept - 0.5261 - 0.54 0.5876 No Population 0.00358 65.69 (^) <0.0001 Yes To check this model, plot the residuals (the differences between the predicted number of mortalities and the observed number mortalities, i.e. the predict ‘error’) against population (graph is as below). From the graph we see there exits a trend that the residual increases with population. To solve this, square root transformation and log transformation were used (on x, y individually, and on both of them), the graphs are shown as follows: 1.take square root of population (squpop=square root of population): 2.take square root of mortality (squmort=square root of mortality): 3.take square root of both population and mortality (‘squmort’ stands for square root of mortality, ‘squpop’ stands for square root of population):

small, almost all the residuals are less than zero, while population is big, almost all residuals are greater than zero. These signs suggest that these data transformations are not appropriate. In data transformation 3 (take square root of both) we observe no obvious trend in the residual plot, perhaps this is the model we are seeking? From the results of regression, we find that it is indeed the case. Some of the results are listed below: Parameter Estimate Root of MSE P-value Is intercept significant? R-square Intercept 0.05892 0.82098 (^) <0.0001 Yes 0. By this transformation, we get rid of the trend in residual, increased R-square, we also found that most of the residuals lie between a horizontal ± 2 σ band around zero, which suggest that the variability of Yi’s is about constant. So we use this model, i.e., Square root of mortality = 0.05892* square root of population. From this model, we can say the square root of mortality is proportional to square root of population, this relationship can be simulated by Square root of mortality = 0.05892* square root of population, i.e. when population is 10000, the number of mortality is about 6.

Part 2.

Problem: Data on population density (pd) and vehicle thefts (vtt) in 18 Chicago districts are given. Run a regression with vtt as the dependent variable and pd as the independent variable. Plot the residuals against pd, is there any outlier? If so, explain. If appropriate, delete outliers and re-estimate the model. Test the hypothesis that the slope is 0 against the alternative that it is different from 0, use 5% as the level of significance. First of all, give the scatter plot of the raw data as follows: Under linear model Yi=α+βXi, i=1,2…18, the residual plot (against pd) is given below:

From these graphs, we notice an outlier—Chicago downtown, which has a much higher vehicle thefts rate compared with the trend displayed by the other 17 districts. This probably is because the crime rate is generally higher in downtown area. Some of the results are given as follows: Parameters Estimates P-value Significant? Root MSE R-square Slope - 0.0033 0.0005 Yes Intercept 88.95 (^) <0.0001 Yes

From above, we can see that under this model, r-square is only 54.67%, and the residual increase with Pd, which implies this model is not good enough. We can try to fit the data with quadratic model, such as Yi=α+βXi+χXi^2, some of the results are given below: Parameters Estimates P-value Significant? Root MSE R-square Square of Pd 2.42E- 7 <0.0001 Yes Pd - 0.01215 <0.0001 Yes Intercept 159.87 <0.0001 Yes

This will give us a higher R-square (0.92), but the residual plot still shows the same trend. We first consider model Yi=α+βXi, i=1,2…18. Since we have noticed that the single data point Chicago changed the direction of the fitted line dramatically, we can delete the outlier and fit again. The scatter plot, residual plot and some of the results are given below: