Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Exploratory Data Analysis for Cancer Mortality and Vehicle Thefts Data, Exams of Statistics

University of Illinois - Chicago Statistics

An analysis of two datasets: one on breast cancer mortality and adult white female population in north carolina, south carolina, and georgia, and the other on population density and vehicle thefts in 18 chicago districts. The analysis includes visualizing the data, fitting linear and quadratic regression models, identifying outliers, and testing hypotheses.

Typology: Exams

Pre 2010

Uploaded on 09/17/2009

koofers-user-5ic 🇺🇸

10 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

1

Note: Only the first two parts are posted here.

Part 1:

Problem: The cancer data set gives values for breast cancer mortality from 1950 to 1960 (y) and the adult

white female population in 1960(x) for 301 counties in North Carolina, South Carolina, and Georgia.

Explore if you can model y as a function of x. Defend your model on statistical grounds. Provide any other

information that you can extract from this data set so health professional can utilize it. Do not forget to do

residual analysis for your models.

First of all, plot mortality against population (the graph below), from which we can observe a roughly

linear relationship between mortality and population, and mortality grows with population, i.e., mortality is

roughly proportional to population.

Intuitively, the larger the population is, the more people suffer from breast cancer, thus the larger the

number of mortality, so this assumption is plausible. If there are no people in a certain community at all,

there will be no mortality, so we can additionally assume that this line passes through the origin, i.e. there is

no intercept. So we use model Yi=βXi, (i=1,2…301).

Under this model, the ‘best’ line is given by Mortality=0.00356*population. From the graph below, we can

see that this line fits the given data fairly well, actually if we compute the ‘goodness-of-fit ratio’ R-square,

we will find it is 95.98%, which means about 95.98% of the variation in mortality can be explained by the

variation of population, so this model fits the data very well.

To play it safely, we can fit the data with a model with intercept, i.e. Yi=α+βXi (i=1,2…301). Under this

model, the estimated line is given by Mortality=–0.52612+0.00358*Population. R-square is 93.52%.

Intuitively, the intercept (-0.52612, i.e. less than one person) is so small that can be ignored. We can justify

this statistically. To do this, we test the hypothesis ‘intercept is 0’ vs. ‘intercept is not 0’. The result (in the

chart below) strongly suggests that intercept is indeed zero. We can also perform test slope β=0 vs. β≠0,

from the result, we conclude that β≠0.

Discover Exams of Statistics University of Illinois - Chicago

Partial preview of the text

Download Exploratory Data Analysis for Cancer Mortality and Vehicle Thefts Data and more Exams Statistics in PDF only on Docsity!

Note: Only the first two parts are posted here.

Part 1:

Problem: The cancer data set gives values for breast cancer mortality from 1950 to 1960 (y) and the adult white female population in 1960(x) for 301 counties in North Carolina, South Carolina, and Georgia. Explore if you can model y as a function of x. Defend your model on statistical grounds. Provide any other information that you can extract from this data set so health professional can utilize it. Do not forget to do residual analysis for your models. First of all, plot mortality against population (the graph below), from which we can observe a roughly linear relationship between mortality and population, and mortality grows with population, i.e., mortality is roughly proportional to population. Intuitively, the larger the population is, the more people suffer from breast cancer, thus the larger the number of mortality, so this assumption is plausible. If there are no people in a certain community at all, there will be no mortality, so we can additionally assume that this line passes through the origin, i.e. there is no intercept. So we use model Yi=βXi, (i=1,2…301). Under this model, the ‘best’ line is given by Mortality=0.00356population. From the graph below, we can see that this line fits the given data fairly well, actually if we compute the ‘goodness-of-fit ratio’ R-square, we will find it is 95.98%, which means about 95.98% of the variation in mortality can be explained by the variation of population, so this model fits the data very well. To play it safely, we can fit the data with a model with intercept, i.e. Yi=α+βXi (i=1,2…301). Under this model, the estimated line is given by Mortality=–0.52612+0.00358Population. R-square is 93.52%. Intuitively, the intercept (-0.52612, i.e. less than one person) is so small that can be ignored. We can justify this statistically. To do this, we test the hypothesis ‘intercept is 0’ vs. ‘intercept is not 0’. The result (in the chart below) strongly suggests that intercept is indeed zero. We can also perform test slope β=0 vs. β≠0, from the result, we conclude that β≠0.

Parameter Estimates Test statistic P-value Significant? Intercept - 0.5261 - 0.54 0.5876 No Population 0.00358 65.69 (^) <0.0001 Yes To check this model, plot the residuals (the differences between the predicted number of mortalities and the observed number mortalities, i.e. the predict ‘error’) against population (graph is as below). From the graph we see there exits a trend that the residual increases with population. To solve this, square root transformation and log transformation were used (on x, y individually, and on both of them), the graphs are shown as follows: 1.take square root of population (squpop=square root of population): 2.take square root of mortality (squmort=square root of mortality): 3.take square root of both population and mortality (‘squmort’ stands for square root of mortality, ‘squpop’ stands for square root of population):

small, almost all the residuals are less than zero, while population is big, almost all residuals are greater than zero. These signs suggest that these data transformations are not appropriate. In data transformation 3 (take square root of both) we observe no obvious trend in the residual plot, perhaps this is the model we are seeking? From the results of regression, we find that it is indeed the case. Some of the results are listed below: Parameter Estimate Root of MSE P-value Is intercept significant? R-square Intercept 0.05892 0.82098 (^) <0.0001 Yes 0. By this transformation, we get rid of the trend in residual, increased R-square, we also found that most of the residuals lie between a horizontal ± 2 σ band around zero, which suggest that the variability of Yi’s is about constant. So we use this model, i.e., Square root of mortality = 0.05892* square root of population. From this model, we can say the square root of mortality is proportional to square root of population, this relationship can be simulated by Square root of mortality = 0.05892* square root of population, i.e. when population is 10000, the number of mortality is about 6.

Part 2.

Problem: Data on population density (pd) and vehicle thefts (vtt) in 18 Chicago districts are given. Run a regression with vtt as the dependent variable and pd as the independent variable. Plot the residuals against pd, is there any outlier? If so, explain. If appropriate, delete outliers and re-estimate the model. Test the hypothesis that the slope is 0 against the alternative that it is different from 0, use 5% as the level of significance. First of all, give the scatter plot of the raw data as follows: Under linear model Yi=α+βXi, i=1,2…18, the residual plot (against pd) is given below:

From these graphs, we notice an outlier—Chicago downtown, which has a much higher vehicle thefts rate compared with the trend displayed by the other 17 districts. This probably is because the crime rate is generally higher in downtown area. Some of the results are given as follows: Parameters Estimates P-value Significant? Root MSE R-square Slope - 0.0033 0.0005 Yes Intercept 88.95 (^) <0.0001 Yes

From above, we can see that under this model, r-square is only 54.67%, and the residual increase with Pd, which implies this model is not good enough. We can try to fit the data with quadratic model, such as Yi=α+βXi+χXi^2, some of the results are given below: Parameters Estimates P-value Significant? Root MSE R-square Square of Pd 2.42E- 7 <0.0001 Yes Pd - 0.01215 <0.0001 Yes Intercept 159.87 <0.0001 Yes

This will give us a higher R-square (0.92), but the residual plot still shows the same trend. We first consider model Yi=α+βXi, i=1,2…18. Since we have noticed that the single data point Chicago changed the direction of the fitted line dramatically, we can delete the outlier and fit again. The scatter plot, residual plot and some of the results are given below:

Exploratory Data Analysis for Cancer Mortality and Vehicle Thefts Data, Exams of Statistics

Related documents

Partial preview of the text

Download Exploratory Data Analysis for Cancer Mortality and Vehicle Thefts Data and more Exams Statistics in PDF only on Docsity!

Note: Only the first two parts are posted here.

Part 1:

Part 2.