Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Incorporating Categorical Predictors in Linear Regression with Dummy Variables, Study notes of Statistics

University of Iowa (UI)Statistics

How to integrate categorical predictors into linear regression models by constructing dummy variables. The interpretation of the coefficients in a simple model with one covariate and one dummy variable, and provides an example using apple tree size and yield data with a categorical pruning method variable. The document also discusses the significance of the dummy variable as a regressor and the creation of plots to distinguish the pruning methods.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-1wo 🇺🇸

8 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

22s:152 Applied Linear Regression

Chapter 7:

Dummy Variable Regression

————————————————————

So far, we’ve only considered quantitative vari-

ables in our models.

We can integrate categorical predictors by con-

structing artificial variables (known as dummy

variables or indicator variables).

We’ll illustrate here with a binary predictor

(e.g. Male/Female).

•Pick one category as the default (say Fe-

male).

•Define zi= 0 if observation iis a Female,

otherwise zi= 1 (denotes Male).

1

•A simple model with 1 covariate and 1 dummy

variable:

Yi=β0+β1xi+β2zi+"i

where Yiand xiare continuous variables,

and "iid

∼N(0,σ2).

–note:

E(Y|x)=β0+β1xfor females

E(Y|x)=(β0+β2)+β1xfor males

–Info from both the males and females were

used to estimate β1,σ2

–β0is interpreted as the female Y-intercept

–β2is interpreted as the difference in ex-

pected value of Y for identical x units (same

x) in the two groups.

2

•Example: Apple yield and tree size.

A botany students wants to model the rela-

tionship between apple tree size (diameter)

and yield (bushels). She also has informa-

tion on the pruning method used on the trees

(Pyramid or Flattop).

The dependent variable Bushels is quantita-

tive, as is the independent variable Diame-

ter, but Pruning is a categorical or qualita-

tive variable.

Consider the model we previously described:

Yi=β0+β1xi+β2zi+"i

where zi= 0 if observation iwas a Flattop

pruning, otherwise zi= 1 (denotes Pyramid

pruning).

3

The data:

> botany.data=read.csv("botany.csv")

> attach(botany.data)

> head(botany.data)

Diameter Bushels Pruning

1 20 10.56 Pyramid

2 14 6.14 Pyramid

3 16 6.30 Pyramid

4 13 6.38 Pyramid

5 18 8.65 Pyramid

6 17 7.02 Pyramid

> unique(Pruning)

[1] Pyramid Flattop

Levels: Flattop Pyramid

> is.factor(Pruning)

[1] TRUE

> is.numeric(Pruning)

[1] FALSE

> plot(Diameter,Bushels,pch=16)

4

Discover Study notes of Statistics University of Iowa (UI)

Partial preview of the text

Download Incorporating Categorical Predictors in Linear Regression with Dummy Variables and more Study notes Statistics in PDF only on Docsity!

22s:152 Applied Linear Regression Chapter 7: Dummy Variable Regression ———————————————————— So far, we’ve only considered quantitative vari- ables in our models. We can integrate categorical predictors by con- structing artificial variables (known as dummy variables or indicator variables). We’ll illustrate here with a binary predictor (e.g. Male/Female).

Pick one category as the default (say Fe- male).
Define zi = 0 if observation i is a Female, otherwise zi = 1 (denotes Male). 1 - A simple model with 1 covariate and 1 dummy variable: Yi = β 0 + β 1 xi + β 2 zi + "i where Yi and xi are continuous variables, and " iid ∼ N (0, σ^2 ).
note: E(Y |x) = β 0 + β 1 x for females E(Y |x) = (β 0 + β 2 ) + β 1 x for males
Info from both the males and females were used to estimate β 1 , σ^2
β 0 is interpreted as the female Y-intercept
β 2 is interpreted as the difference in ex- pected value of Y for identical x units (same x) in the two groups. 2
Example: Apple yield and tree size. A botany students wants to model the rela- tionship between apple tree size (diameter) and yield (bushels). She also has informa- tion on the pruning method used on the trees (Pyramid or Flattop). The dependent variable Bushels is quantita- tive, as is the independent variable Diame- ter, but Pruning is a categorical or qualita- tive variable. Consider the model we previously described: Yi = β 0 + β 1 xi + β 2 zi + "i where zi = 0 if observation i was a Flattop pruning, otherwise zi = 1 (denotes Pyramid pruning). The data: > botany.data=read.csv("botany.csv") > attach(botany.data) > head(botany.data) Diameter Bushels Pruning 1 20 10.56 Pyramid 2 14 6.14 Pyramid 3 16 6.30 Pyramid 4 13 6.38 Pyramid 5 18 8.65 Pyramid 6 17 7.02 Pyramid > unique(Pruning) [1] Pyramid Flattop Levels: Flattop Pyramid > is.factor(Pruning) [1] TRUE > is.numeric(Pruning) [1] FALSE > plot(Diameter,Bushels,pch=16)

! !!! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! 6 8 10 12 14 16 18 20 2 4 6 8 10 Diameter Bushels There does appear to be a linear relationship between tree diameter and yield in bushels. 5 The simple linear regression:

lm.out=lm(Bushels ~ Diameter) summary(lm.out) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.18886 0.75898 -2.884 0.00988 ** Diameter 0.62361 0.05185 12.028 4.86e-10 ***

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.133 on 18 degrees of freedom Multiple R-Squared: 0.8894,Adjusted R-squared: 0. F-statistic: 144.7 on 1 and 18 DF, p-value: 4.86e-

abline(lm.out) ! !!! ! ! ! ! !! ! ! ! ! ! ! !! ! ! 6 8 10 12 14 16 18 20

2 4 6 8 10 Diameter Bushels 6 Does Pruning Method also make a significant impact on yield? First, we’ll create the dummy variable:

Allocate space for the new vector:

pruning.dummy=rep(0,nrow(botany.data)) pruning.dummy [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

If Pruning equals "Pyramid", code it as 1.

pruning.dummy[Pruning=="Pyramid"]= pruning.dummy [1] 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 data.frame(Pruning,pruning.dummy) Pruning pruning.dummy 1 Pyramid 1 2 Pyramid 1 3 Pyramid 1 4 Pyramid 1 5 Pyramid 1 6 Pyramid 1 7 Pyramid 1 8 Pyramid 1 9 Pyramid 1 10 Pyramid 1 11 Pyramid 1 12 Flattop 0 13 Flattop 0 14 Flattop 0 15 Flattop 0 16 Flattop 0 17 Flattop 0 18 Flattop 0 19 Flattop 0 20 Flattop 0 Fit a model with both Diameter and Pruning (as a dummy variable). lm.out.2=lm(Bushels ~ Diameter + pruning.dummy) summary(lm.out.2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.90616 0.75416 -2.528 0.0217 *

Diameter 0.63352 0.05038 12.574 4.91e-10 *** pruning.dummy -0.76259 0.49468 -1.542 0.

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.092 on 17 degrees of freedom Multiple R-Squared: 0.9029,Adjusted R-squared: 0. F-statistic: 79.06 on 2 and 17 DF, p-value: 2.458e-

Inclusion of Interaction Returning to the earlier model with a male/female binary variable. A slightly more complicated model: Yi = β 0 + β 1 xi + β 2 zi + β 3 xizi + "i

note: E(Y |x) = β 0 + β 1 x for females E(Y |x) = (β 0 +β 2 )+(β 1 +β 3 )x for males
The xz interaction term allows for a different slope for each group
xz may be called a slope dummy variable
This model allows for two separate regression lines for each group 13 - What is the benefit to bringing the data for the two groups together? - If there is constant variance, we’ll have a better estimate for σ^2 by pooling the data (may not do it if you didn’t think they had a common σ^2 .) - We can run tests comparing the two groups. - More degrees of freedom. - To test if the lines are parallel, test if the in- teraction term is significant. H 0 : β 3 = 0. - If there’s no significant interaction, we can test if there’s two separate lines for the two groups, or if one is sufficient to describe the data. H 0 : β 2 = 0. - If there is significant interaction, the effect of covariate x on the response is different for different values of z. 14 Let’s fit the interaction model to the tree data.

lm.out.3=lm(Bushels ~ Diameter + pruning.dummy + Diameter*pruning.dummy) summary(lm.out.3) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.00130 0.98989 -2.022 0. Diameter 0.64078 0.06988 9.170 9.05e-08 ***

pruning.dummy -0.53930 1.52761 -0.353 0. Diameter:pruning.dummy -0.01618 0.10434 -0.155 0.

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.124 on 16 degrees of freedom Multiple R-Squared: 0.9031,Adjusted R-squared: 0. F-statistic: 49.69 on 3 and 16 DF, p-value: 2.487e-

lm.out.3$coefficients Diameter: (Intercept) Diameter pruning.dummy pruning.dummy -2.00129614 0.64077682 -0.53929972 -0. β 0 β 1 β 2 β 3

Model allows for different intercepts:

intercept.Flattop=lm.out.2$coefficients[1] intercept.Flattop (Intercept) -1. intercept.Pyramid=lm.out.2$coefficients[1] + lm.out.2$coefficients[3] intercept.Pyramid (Intercept) -2.

Model allows for different slopes:

slope.Flattop=lm.out.3$coefficients[2] slope.Flattop Diameter

slope.Pyramid=lm.out.3$coefficients[2] + lm.out.3$coefficients[4] slope.Pyramid Diameter

The separate fitted line for each group:

plot(Diameter,Bushels,type="n") ## Don’t plot the points points(Diameter[1:11],Bushels[1:11],pch=1,col=1) points(Diameter[12:20],Bushels[12:20],pch=9,col=4) legend(8,10,c("Pyramid","Flattop"),col=c(1,4),pch=c(1,9)) abline(intercept.Flattop,slope.Flattop,col=4) abline(intercept.Pyramid,slope.Pyramid,col=1) 6 8 10 12 14 16 18 20 2 4 6 8 10 Diameter Bushels ! !!! ! ! ! ! !! ! ! Pyramid Flattop You can’t see much difference, but the fitted line for Flattop has a slightly steeper slope. 17

When interpreting models with dummy vari- ables, it’s important to keep in mind the cod- ing scheme that was used (which variable is associated with 1 and which with 0, for ex- ample).
Numerous covariates can be used in conjunc- tion with dummy variables, though we only showed 1 covariate in this example.
A model with no interaction terms is said to be additive.
If an interaction term between two variables is included in the model, then the ‘main ef- fects’ for those variables should also be in- cluded. So, if xz is in the model, so should you include x and z. 18
If there is significant interaction, we do not consider the tests for ‘main effects’. If there is NOT significant interaction, we can con- sider the tests for ‘main effects’. We look at the interaction significance first, and then proceed from there. Categorical Variables with Many Cat- egories Consider the categorical variable of Rank for a professor, which has the categories of Assistant, Associate, and Full. We can represent this categorical variable with 2 dummy variables: Category D 1 D 2 Assistant 1 0 Associate 0 1 Full 0 0 Suppose we wish to predict Salary based on Rank and the quantitative variable Grants, which tells how much grant money a professor has brought in.