Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Assignment; Class: Methods of Applied Statistics; Subject: Statistics; University: University of Illinois - Urbana-Champaign; Term: Unknown 1989;
Typology: Assignments
1 / 7
13.1 Consider only the 3 predictors: status, income and sex. dependant variable = gamble. (1.) Plot gamble vs status for sex = 1 and plot gamble vs status for sex = 0. (2.) Do the same in (1) using income. Comment on the above plots: Is a linear relationship obvious? Iis there any difference between slopes for males vs females?
library(faraway) data(teengamb) attach(teengamb) teengamb[1:3,] sex status income verbal gamble 1 1 51 2.0 8 0 2 1 28 2.5 8 0 3 1 37 2.0 6 0 teen1<-subset(teengamb,sex==1,select=c(status,income,gamble)) teen0<-subset(teengamb,sex==0,select=c(status,income,gamble)) ?teengamb
Study of teenage gambling in Britain Description: The 'teengamb' data frame has 47 rows and 5 columns. A survey was conducted to study teenage gambling in Britain.
This data frame contains the following columns: 'sex' 0=male, 1=female 'status' Socioeconomic status score based on parents' occupation 'income' in pounds per week 'verbal' verbal score in words out of 12 correctly defined 'gamble' expenditure on gambling in pounds per year
par(mfrow=c(2,2)) plot(teen1$gamble~teen1$status, main="1. gamble vs status for female") g1<-lm(teen1$gamble~teen1$status) abline(g1)
plot(teen0$gamble~teen0$status, main="2. gamble vs status for male") g2<-lm(teen0$gamble~teen0$status) abline(g2)
plot(teen1$gamble~teen1$income, main="3. gamble vs income for female") g3<-lm(teen1$gamble~teen1$income)
abline(g3) plot(teen0$gamble~teen0$income, main="4. gamble vs income for male") g4<-lm(teen0$gamble~teen0$income) abline(g4)
20 30 40 50 60
0
5
10
15
20
1. gamble vs status for female
teen1$status
teen1$gamble
20 30 40 50 60 70
0
50
100
150
2. gamble vs status for male
teen0$status
teen0$gamble
2 4 6 8 10
0
5
10
15
20
3. gamble vs income for female
teen1$income
teen1$gamble
2 4 6 8 10 14
0
50
100
150
4. gamble vs income for male
teen0$income
teen0$gamble
For gamble vs status, there are linear trends for both female and male although females have a positive trend and males have a negative trend. For gamble vs income, females have a very weak linear trend in positive direction, but the slope is almost 0 and the linear model doesn't seem to fit well. In contrast, males have a relatively stronger positive linear trend.
(3.) Using AIC or BIC or Backward elimination, find the best set of predictors among status, income, sex, statussex, incomesex, status*income. (4.) Interpret the best model. Find fitted equations for males vs females.
g<-lm(gamble~status+income+sex+status:sex+income:sex+status:income, data=teengamb)
AIC
step(g) Start: AIC= 284. gamble ~ status + income + sex + status:sex + income:sex + status:income
Df Sum of Sq RSS AIC
Step: AIC= 282. gamble ~ status + income + sex + income:sex + status:income
Df Sum of Sq RSS AIC
Call: lm(formula = gamble ~ status + income + sex + income:sex + status:income, data = teengamb)
Coefficients: (Intercept) status income sex income:sex -29.8693 0.5899 13.8884 15.1803 -9. status:income -0.
According to AIC criterion, the best model is gamble ~ status + income + sex + income:sex + status:income. It seems that the interaction between status and sex is not significant.
gamble = -29.8693 + (0.5899)status + (13.8884)income + (15.1803)sex +(-9.1007)income:sex + (-0.1698)*status:income
Therefore, for females (sex=1), gamble = (-29.8693+15.1803) + (0.5899)status + (13.8884-9.1007)income
Backward Elimination (alpha=0.1) summary(g)
g1<-lm(gamble~status+income+status:sex+income:sex+status:income,data=teengamb) summary(g1)
g2<-lm(gamble~status+income+income:sex+status:income,data=teengamb) summary(g2)
g3<-lm(gamble~income+income:sex+status:income,data=teengamb) summary(g3) Estimate Std. Error t value Pr(>|t|) (Intercept) 3.55217 5.09364 0.697 0. income 9.81500 1.53470 6.395 9.72e-08 *** income:sex -7.14418 1.28127 -5.576 1.51e-06 *** income:status -0.09215 0.03390 -2.718 0.00943 **
resid<-g3$resid ; fitted<-g3$fitted par(mfrow=c(1,2)) plot(fitted,resid) qqnorm(resid); qqline(resid)
0 20 40 60 80 100
0
20
40
60
80
fitted
resid
-2 -1 0 1 2
0
20
40
60
80
Normal Q-Q Plot
Theoretical Quantiles
Sample Quantiles
The residuals vs fitted plot implies the nonconstant variance, and the normal QQ plot indicates a problem of nonnormality. Therefore, we better try some transformation for the response variable 'gamble'. We will fit the model again with the sqrt(gamble). (We can also try log transformation.)
gamblenew<-sqrt(gamble)
With this new response variable 'gamblenew', we will first fit the full model, and do the backward elimination to find the best submodel.
out<-lm(gamblenew~status+income+sex+status:sex+income:sex+status:income,data=teen gamb) summary(out)
out1<-lm(gamblenew~status+income+status:sex+income:sex+status:income,data=teengam b) summary(out1)
out2<-lm(gamblenew~status+income+income:sex+status:income,data=teengamb) summary(out2) Estimate Std. Error t value Pr(>|t|) (Intercept) -1.306508 1.564998 -0.835 0. status 0.061596 0.030470 2.022 0.0496 * income 1.127340 0.233780 4.822 1.89e-05 ***
income:sex -0.592807 0.129840 -4.566 4.30e-05 *** status:income -0.012898 0.005181 -2.489 0.0168 *
resid<-out2$resid fitted<-out2$fitted par(mfrow=c(1,2)) plot(fitted,resid) qqnorm(resid) qqline(resid)
2 4 6 8 10
-^ -^
0
2
4
fitted
resid
-2 -1 0 1 2
-^ -^
0
2
4
Normal Q-Q Plot
Theoretical Quantiles
Sample Quantiles
These plots look a lot better than the first plots although these still don't seem completely satisfactory for our constant variance and normality assumption. You can further try another transformations to improve these plots. Now for the remaining main effects - 'status' and 'income', we need to check the linear relationship between gamblenew and each of them. To see this, we'll use the added variable plots.
par(mfrow=c(1,2))
a1<-lm(gamblenew~income+income:sex+status:income) b1<-lm(status~income+income:sex+status:income) resid_a1<-a1$resid resid_b1<-b1$resid plot(resid_b1,resid_a1)
a2<-lm(gamblenew~status+income:sex+status:income) b2<-lm(income~status+income:sex+status:income)
resid_a2<-a2$resid resid_b2<-b2$resid plot(resid_b2,resid_a2)
-15 -10 -5 0 5 10 15
-^ -^
0
2
4
resid_b
resid_a
-2 0 2 4
-^ -^
0
2
4
6
resid_b
resid_a
The first added variable plot (gamblenew vs status) shows a weak linear relationship between gamblenew and status. And the second plot(gamblenew vs income) shows a strong linear relationship between gamblenew and income. Therefore, linear model seems to be fine.
Therefore, according to the Backward Elimination with transformed response, the best model is gamblenew~status+income+income:sex+status:income,data=teengamb.
gamblenew = -1.306508+(0.061596)status+(1.127340 )income +(-0.592807)income:sex+(-0.012898)status:income
For females (sex=1), gamblenew = -1.306508+(0.061596)status+(1.127340-0.592807 )income +(-0.012898)*status:income
For males (sex=0), gamblenew = -1.306508+(0.061596)status+(1.127340 )income+(-0.012898)*status:income