Incorporating Categorical Predictors in Linear Regression with Dummy Variables, Study notes of Statistics

How to integrate categorical predictors into linear regression models by constructing dummy variables. The interpretation of the coefficients in a simple model with one covariate and one dummy variable, and provides an example using apple tree size and yield data with a categorical pruning method variable. The document also discusses the significance of the dummy variable as a regressor and the creation of plots to distinguish the pruning methods.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-1wo
koofers-user-1wo 🇺🇸

8 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
22s:152 Applied Linear Regression
Chapter 7:
Dummy Variable Regression
————————————————————
So far, we’ve only considered quantitative vari-
ables in our models.
We can integrate categorical predictors by con-
structing artificial variables (known as dummy
variables or indicator variables).
We’ll illustrate here with a binary predictor
(e.g. Male/Female).
Pick one category as the default (say Fe-
male).
Define zi= 0 if observation iis a Female,
otherwise zi= 1 (denotes Male).
1
A simple model with 1 covariate and 1 dummy
variable:
Yi=β0+β1xi+β2zi+"i
where Yiand xiare continuous variables,
and "iid
N(0,σ2).
note:
E(Y|x)=β0+β1xfor females
E(Y|x)=(β0+β2)+β1xfor males
Info from both the males and females were
used to estimate β1,σ2
β0is interpreted as the female Y-intercept
β2is interpreted as the difference in ex-
pected value of Y for identical x units (same
x) in the two groups.
2
Example: Apple yield and tree size.
A botany students wants to model the rela-
tionship between apple tree size (diameter)
and yield (bushels). She also has informa-
tion on the pruning method used on the trees
(Pyramid or Flattop).
The dependent variable Bushels is quantita-
tive, as is the independent variable Diame-
ter, but Pruning is a categorical or qualita-
tive variable.
Consider the model we previously described:
Yi=β0+β1xi+β2zi+"i
where zi= 0 if observation iwas a Flattop
pruning, otherwise zi= 1 (denotes Pyramid
pruning).
3
The data:
> botany.data=read.csv("botany.csv")
> attach(botany.data)
> head(botany.data)
Diameter Bushels Pruning
1 20 10.56 Pyramid
2 14 6.14 Pyramid
3 16 6.30 Pyramid
4 13 6.38 Pyramid
5 18 8.65 Pyramid
6 17 7.02 Pyramid
> unique(Pruning)
[1] Pyramid Flattop
Levels: Flattop Pyramid
> is.factor(Pruning)
[1] TRUE
> is.numeric(Pruning)
[1] FALSE
> plot(Diameter,Bushels,pch=16)
4
pf3
pf4
pf5

Partial preview of the text

Download Incorporating Categorical Predictors in Linear Regression with Dummy Variables and more Study notes Statistics in PDF only on Docsity!

22s:152 Applied Linear Regression Chapter 7: Dummy Variable Regression ———————————————————— So far, we’ve only considered quantitative vari- ables in our models. We can integrate categorical predictors by con- structing artificial variables (known as dummy variables or indicator variables). We’ll illustrate here with a binary predictor (e.g. Male/Female).

  • Pick one category as the default (say Fe- male).
  • Define zi = 0 if observation i is a Female, otherwise zi = 1 (denotes Male). 1 - A simple model with 1 covariate and 1 dummy variable: Yi = β 0 + β 1 xi + β 2 zi + "i where Yi and xi are continuous variables, and " iid ∼ N (0, σ^2 ).
  • note: E(Y |x) = β 0 + β 1 x for females E(Y |x) = (β 0 + β 2 ) + β 1 x for males
  • Info from both the males and females were used to estimate β 1 , σ^2
  • β 0 is interpreted as the female Y-intercept
  • β 2 is interpreted as the difference in ex- pected value of Y for identical x units (same x) in the two groups. 2
  • Example: Apple yield and tree size. A botany students wants to model the rela- tionship between apple tree size (diameter) and yield (bushels). She also has informa- tion on the pruning method used on the trees (Pyramid or Flattop). The dependent variable Bushels is quantita- tive, as is the independent variable Diame- ter, but Pruning is a categorical or qualita- tive variable. Consider the model we previously described: Yi = β 0 + β 1 xi + β 2 zi + "i where zi = 0 if observation i was a Flattop pruning, otherwise zi = 1 (denotes Pyramid pruning). The data: > botany.data=read.csv("botany.csv") > attach(botany.data) > head(botany.data) Diameter Bushels Pruning 1 20 10.56 Pyramid 2 14 6.14 Pyramid 3 16 6.30 Pyramid 4 13 6.38 Pyramid 5 18 8.65 Pyramid 6 17 7.02 Pyramid > unique(Pruning) [1] Pyramid Flattop Levels: Flattop Pyramid > is.factor(Pruning) [1] TRUE > is.numeric(Pruning) [1] FALSE > plot(Diameter,Bushels,pch=16)

! !!! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! 6 8 10 12 14 16 18 20 2 4 6 8 10 Diameter Bushels There does appear to be a linear relationship between tree diameter and yield in bushels. 5 The simple linear regression:

lm.out=lm(Bushels ~ Diameter) summary(lm.out) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.18886 0.75898 -2.884 0.00988 ** Diameter 0.62361 0.05185 12.028 4.86e-10 ***


Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.133 on 18 degrees of freedom Multiple R-Squared: 0.8894,Adjusted R-squared: 0. F-statistic: 144.7 on 1 and 18 DF, p-value: 4.86e-

abline(lm.out) ! !!! ! ! ! ! !! ! ! ! ! ! ! !! ! ! 6 8 10 12 14 16 18 20

2 4 6 8 10 Diameter Bushels 6 Does Pruning Method also make a significant impact on yield? First, we’ll create the dummy variable:

Allocate space for the new vector:

pruning.dummy=rep(0,nrow(botany.data)) pruning.dummy [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

If Pruning equals "Pyramid", code it as 1.

pruning.dummy[Pruning=="Pyramid"]= pruning.dummy [1] 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 data.frame(Pruning,pruning.dummy) Pruning pruning.dummy 1 Pyramid 1 2 Pyramid 1 3 Pyramid 1 4 Pyramid 1 5 Pyramid 1 6 Pyramid 1 7 Pyramid 1 8 Pyramid 1 9 Pyramid 1 10 Pyramid 1 11 Pyramid 1 12 Flattop 0 13 Flattop 0 14 Flattop 0 15 Flattop 0 16 Flattop 0 17 Flattop 0 18 Flattop 0 19 Flattop 0 20 Flattop 0 Fit a model with both Diameter and Pruning (as a dummy variable). lm.out.2=lm(Bushels ~ Diameter + pruning.dummy) summary(lm.out.2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.90616 0.75416 -2.528 0.0217 *

Diameter 0.63352 0.05038 12.574 4.91e-10 *** pruning.dummy -0.76259 0.49468 -1.542 0.


Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.092 on 17 degrees of freedom Multiple R-Squared: 0.9029,Adjusted R-squared: 0. F-statistic: 79.06 on 2 and 17 DF, p-value: 2.458e-

Inclusion of Interaction Returning to the earlier model with a male/female binary variable. A slightly more complicated model: Yi = β 0 + β 1 xi + β 2 zi + β 3 xizi + "i

  • note: E(Y |x) = β 0 + β 1 x for females E(Y |x) = (β 0 +β 2 )+(β 1 +β 3 )x for males
  • The xz interaction term allows for a different slope for each group
  • xz may be called a slope dummy variable
  • This model allows for two separate regression lines for each group 13 - What is the benefit to bringing the data for the two groups together? - If there is constant variance, we’ll have a better estimate for σ^2 by pooling the data (may not do it if you didn’t think they had a common σ^2 .) - We can run tests comparing the two groups. - More degrees of freedom. - To test if the lines are parallel, test if the in- teraction term is significant. H 0 : β 3 = 0. - If there’s no significant interaction, we can test if there’s two separate lines for the two groups, or if one is sufficient to describe the data. H 0 : β 2 = 0. - If there is significant interaction, the effect of covariate x on the response is different for different values of z. 14 Let’s fit the interaction model to the tree data.

lm.out.3=lm(Bushels ~ Diameter + pruning.dummy + Diameter*pruning.dummy) summary(lm.out.3) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.00130 0.98989 -2.022 0. Diameter 0.64078 0.06988 9.170 9.05e-08 ***

pruning.dummy -0.53930 1.52761 -0.353 0. Diameter:pruning.dummy -0.01618 0.10434 -0.155 0.


Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 1.124 on 16 degrees of freedom Multiple R-Squared: 0.9031,Adjusted R-squared: 0. F-statistic: 49.69 on 3 and 16 DF, p-value: 2.487e-

lm.out.3$coefficients Diameter: (Intercept) Diameter pruning.dummy pruning.dummy -2.00129614 0.64077682 -0.53929972 -0. β 0 β 1 β 2 β 3

Model allows for different intercepts:

intercept.Flattop=lm.out.2$coefficients[1] intercept.Flattop (Intercept) -1. intercept.Pyramid=lm.out.2$coefficients[1] + lm.out.2$coefficients[3] intercept.Pyramid (Intercept) -2.

Model allows for different slopes:

slope.Flattop=lm.out.3$coefficients[2] slope.Flattop Diameter

slope.Pyramid=lm.out.3$coefficients[2] + lm.out.3$coefficients[4] slope.Pyramid Diameter

The separate fitted line for each group:

plot(Diameter,Bushels,type="n") ## Don’t plot the points points(Diameter[1:11],Bushels[1:11],pch=1,col=1) points(Diameter[12:20],Bushels[12:20],pch=9,col=4) legend(8,10,c("Pyramid","Flattop"),col=c(1,4),pch=c(1,9)) abline(intercept.Flattop,slope.Flattop,col=4) abline(intercept.Pyramid,slope.Pyramid,col=1) 6 8 10 12 14 16 18 20 2 4 6 8 10 Diameter Bushels ! !!! ! ! ! ! !! ! ! Pyramid Flattop You can’t see much difference, but the fitted line for Flattop has a slightly steeper slope. 17

  • When interpreting models with dummy vari- ables, it’s important to keep in mind the cod- ing scheme that was used (which variable is associated with 1 and which with 0, for ex- ample).
  • Numerous covariates can be used in conjunc- tion with dummy variables, though we only showed 1 covariate in this example.
  • A model with no interaction terms is said to be additive.
  • If an interaction term between two variables is included in the model, then the ‘main ef- fects’ for those variables should also be in- cluded. So, if xz is in the model, so should you include x and z. 18
  • If there is significant interaction, we do not consider the tests for ‘main effects’. If there is NOT significant interaction, we can con- sider the tests for ‘main effects’. We look at the interaction significance first, and then proceed from there. Categorical Variables with Many Cat- egories Consider the categorical variable of Rank for a professor, which has the categories of Assistant, Associate, and Full. We can represent this categorical variable with 2 dummy variables: Category D 1 D 2 Assistant 1 0 Associate 0 1 Full 0 0 Suppose we wish to predict Salary based on Rank and the quantitative variable Grants, which tells how much grant money a professor has brought in.