Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Solved Problems on Methods of Applied Statistics - Assignment 4 | STAT 420, Assignments of Data Analysis & Statistical Methods

Material Type: Assignment; Class: Methods of Applied Statistics; Subject: Statistics; University: University of Illinois - Urbana-Champaign; Term: Unknown 1989;

Typology: Assignments

Pre 2010

Uploaded on 03/11/2009

koofers-user-wca-1
koofers-user-wca-1 🇺🇸

10 documents

1 / 7

Toggle sidebar

Related documents


Partial preview of the text

Download Solved Problems on Methods of Applied Statistics - Assignment 4 | STAT 420 and more Assignments Data Analysis & Statistical Methods in PDF only on Docsity!

13.1 Consider only the 3 predictors: status, income and sex. dependant variable = gamble. (1.) Plot gamble vs status for sex = 1 and plot gamble vs status for sex = 0. (2.) Do the same in (1) using income. Comment on the above plots: Is a linear relationship obvious? Iis there any difference between slopes for males vs females?

library(faraway) data(teengamb) attach(teengamb) teengamb[1:3,] sex status income verbal gamble 1 1 51 2.0 8 0 2 1 28 2.5 8 0 3 1 37 2.0 6 0 teen1<-subset(teengamb,sex==1,select=c(status,income,gamble)) teen0<-subset(teengamb,sex==0,select=c(status,income,gamble)) ?teengamb

Study of teenage gambling in Britain Description: The 'teengamb' data frame has 47 rows and 5 columns. A survey was conducted to study teenage gambling in Britain.

This data frame contains the following columns: 'sex' 0=male, 1=female 'status' Socioeconomic status score based on parents' occupation 'income' in pounds per week 'verbal' verbal score in words out of 12 correctly defined 'gamble' expenditure on gambling in pounds per year

par(mfrow=c(2,2)) plot(teen1$gamble~teen1$status, main="1. gamble vs status for female") g1<-lm(teen1$gamble~teen1$status) abline(g1)

plot(teen0$gamble~teen0$status, main="2. gamble vs status for male") g2<-lm(teen0$gamble~teen0$status) abline(g2)

plot(teen1$gamble~teen1$income, main="3. gamble vs income for female") g3<-lm(teen1$gamble~teen1$income)

abline(g3) plot(teen0$gamble~teen0$income, main="4. gamble vs income for male") g4<-lm(teen0$gamble~teen0$income) abline(g4)

20 30 40 50 60

0

5

10

15

20

1. gamble vs status for female

teen1$status

teen1$gamble

20 30 40 50 60 70

0

50

100

150

2. gamble vs status for male

teen0$status

teen0$gamble

2 4 6 8 10

0

5

10

15

20

3. gamble vs income for female

teen1$income

teen1$gamble

2 4 6 8 10 14

0

50

100

150

4. gamble vs income for male

teen0$income

teen0$gamble

For gamble vs status, there are linear trends for both female and male although females have a positive trend and males have a negative trend. For gamble vs income, females have a very weak linear trend in positive direction, but the slope is almost 0 and the linear model doesn't seem to fit well. In contrast, males have a relatively stronger positive linear trend.

(3.) Using AIC or BIC or Backward elimination, find the best set of predictors among status, income, sex, statussex, incomesex, status*income. (4.) Interpret the best model. Find fitted equations for males vs females.

g<-lm(gamble~status+income+sex+status:sex+income:sex+status:income, data=teengamb)

AIC

step(g) Start: AIC= 284. gamble ~ status + income + sex + status:sex + income:sex + status:income

Df Sum of Sq RSS AIC

  • status:sex 1 71.5 14950.9 282. 14879.4 284.
  • status:income 1 2946.1 17825.5 291.
  • income:sex 1 4310.1 19189.5 294.

Step: AIC= 282. gamble ~ status + income + sex + income:sex + status:income

Df Sum of Sq RSS AIC 14950.9 282.

  • status:income 1 3514.0 18464.8 290.
  • income:sex 1 6696.5 21647.3 298.

Call: lm(formula = gamble ~ status + income + sex + income:sex + status:income, data = teengamb)

Coefficients: (Intercept) status income sex income:sex -29.8693 0.5899 13.8884 15.1803 -9. status:income -0.

According to AIC criterion, the best model is gamble ~ status + income + sex + income:sex + status:income. It seems that the interaction between status and sex is not significant.

gamble = -29.8693 + (0.5899)status + (13.8884)income + (15.1803)sex +(-9.1007)income:sex + (-0.1698)*status:income

Therefore, for females (sex=1), gamble = (-29.8693+15.1803) + (0.5899)status + (13.8884-9.1007)income

  • (-0.1698)status:income And for males (sex=0), gamble = -29.8693 + (0.5899)status + (13.8884)income + (-0.1698)status:income

Backward Elimination (alpha=0.1) summary(g)

remove 'sex'

g1<-lm(gamble~status+income+status:sex+income:sex+status:income,data=teengamb) summary(g1)

remove 'status:sex'

g2<-lm(gamble~status+income+income:sex+status:income,data=teengamb) summary(g2)

remove 'status'

g3<-lm(gamble~income+income:sex+status:income,data=teengamb) summary(g3) Estimate Std. Error t value Pr(>|t|) (Intercept) 3.55217 5.09364 0.697 0. income 9.81500 1.53470 6.395 9.72e-08 *** income:sex -7.14418 1.28127 -5.576 1.51e-06 *** income:status -0.09215 0.03390 -2.718 0.00943 **

all significant - stop elimination - final model - g

Diagnostics

resid<-g3$resid ; fitted<-g3$fitted par(mfrow=c(1,2)) plot(fitted,resid) qqnorm(resid); qqline(resid)

0 20 40 60 80 100

0

20

40

60

80

fitted

resid

-2 -1 0 1 2

0

20

40

60

80

Normal Q-Q Plot

Theoretical Quantiles

Sample Quantiles

The residuals vs fitted plot implies the nonconstant variance, and the normal QQ plot indicates a problem of nonnormality. Therefore, we better try some transformation for the response variable 'gamble'. We will fit the model again with the sqrt(gamble). (We can also try log transformation.)

Transform the response variable

gamblenew<-sqrt(gamble)

With this new response variable 'gamblenew', we will first fit the full model, and do the backward elimination to find the best submodel.

Backward Elimination using the transformed response variable (alpha=0.1)

out<-lm(gamblenew~status+income+sex+status:sex+income:sex+status:income,data=teen gamb) summary(out)

remove 'sex'

out1<-lm(gamblenew~status+income+status:sex+income:sex+status:income,data=teengam b) summary(out1)

remove 'status:sex'

out2<-lm(gamblenew~status+income+income:sex+status:income,data=teengamb) summary(out2) Estimate Std. Error t value Pr(>|t|) (Intercept) -1.306508 1.564998 -0.835 0. status 0.061596 0.030470 2.022 0.0496 * income 1.127340 0.233780 4.822 1.89e-05 ***

income:sex -0.592807 0.129840 -4.566 4.30e-05 *** status:income -0.012898 0.005181 -2.489 0.0168 *

all significant - stop elimination

final model - out

Diagnostics

resid<-out2$resid fitted<-out2$fitted par(mfrow=c(1,2)) plot(fitted,resid) qqnorm(resid) qqline(resid)

2 4 6 8 10

-^ -^

0

2

4

fitted

resid

-2 -1 0 1 2

-^ -^

0

2

4

Normal Q-Q Plot

Theoretical Quantiles

Sample Quantiles

These plots look a lot better than the first plots although these still don't seem completely satisfactory for our constant variance and normality assumption. You can further try another transformations to improve these plots. Now for the remaining main effects - 'status' and 'income', we need to check the linear relationship between gamblenew and each of them. To see this, we'll use the added variable plots.

Added Variable Plots

1. gamblenew vs status

2. gamblenew vs income

par(mfrow=c(1,2))

a1<-lm(gamblenew~income+income:sex+status:income) b1<-lm(status~income+income:sex+status:income) resid_a1<-a1$resid resid_b1<-b1$resid plot(resid_b1,resid_a1)

a2<-lm(gamblenew~status+income:sex+status:income) b2<-lm(income~status+income:sex+status:income)

resid_a2<-a2$resid resid_b2<-b2$resid plot(resid_b2,resid_a2)

-15 -10 -5 0 5 10 15

-^ -^

0

2

4

resid_b

resid_a

-2 0 2 4

-^ -^

0

2

4

6

resid_b

resid_a

The first added variable plot (gamblenew vs status) shows a weak linear relationship between gamblenew and status. And the second plot(gamblenew vs income) shows a strong linear relationship between gamblenew and income. Therefore, linear model seems to be fine.

Therefore, according to the Backward Elimination with transformed response, the best model is gamblenew~status+income+income:sex+status:income,data=teengamb.

gamblenew = -1.306508+(0.061596)status+(1.127340 )income +(-0.592807)income:sex+(-0.012898)status:income

For females (sex=1), gamblenew = -1.306508+(0.061596)status+(1.127340-0.592807 )income +(-0.012898)*status:income

For males (sex=0), gamblenew = -1.306508+(0.061596)status+(1.127340 )income+(-0.012898)*status:income