Regression Analysis with Dummy Variables: Comparing Means and Interactions, Study notes of Economic Analysis

How to use dummy variables in regression analysis to compare means between two or more groups. the concept of dummy variables following Bernoulli distribution, their use as regressors, and the interpretation of coefficients in the context of comparing means. The document also discusses the limitations of assuming no interaction between variables and how to relax this restriction by using interaction terms.

Typology: Study notes

2021/2022

Uploaded on 09/12/2022

pratic
pratic 🇬🇧

5

(4)

216 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 7, Dummy Variable
1. A dummy variable takes on 1 and 0 only. The number 1 and 0 have no numerical
(quantitative) meaning. The two numbers are used to represent groups. In short
dummy variable is categorical (qualitative).
(a) For instance, we may have a sample (or population) that includes both female
and male. Then a dummy variable can be defined as D= 1 for female and D= 0
for male. Such a dummy variable divides the sample into two subsamples (or two
sub-populations): one for female and one for male.
(b) Dummy variable follows Bernoulli distribution. The distribution is characterized
by the parameter p
D={1,with probability p
0,with probability 1 p(1)
2. Consider using dummy variable as regressor
Y=β0+β1D+u(2)
Regression (2) can be broken into two separate regressions as
Y={β0+u, when D= 0
(β0+β1) + u, when D= 1 (3)
Taking expectation of (3) leads to
E(Y|D= 0) = β0(4)
E(Y|D= 1) = β0+β1(5)
and
β0=E(Y|D= 0) (6)
β1=E(Y|D= 1) E(Y|D= 0) (7)
Therefore β0is the mean of Yconditional on D= 0 (or mean of Yin the subpopulation
with D= 0), β1is the difference in mean Ybetween the two sub-populations.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Regression Analysis with Dummy Variables: Comparing Means and Interactions and more Study notes Economic Analysis in PDF only on Docsity!

Chapter 7, Dummy Variable

  1. A dummy variable takes on 1 and 0 only. The number 1 and 0 have no numerical (quantitative) meaning. The two numbers are used to represent groups. In short dummy variable is categorical (qualitative). (a) For instance, we may have a sample (or population) that includes both female and male. Then a dummy variable can be defined as D = 1 for female and D = 0 for male. Such a dummy variable divides the sample into two subsamples (or two sub-populations): one for female and one for male. (b) Dummy variable follows Bernoulli distribution. The distribution is characterized by the parameter p

D =

1 , with probability p 0 , with probability 1 − p

  1. Consider using dummy variable as regressor

Y = β 0 + β 1 D + u (2)

Regression (2) can be broken into two separate regressions as

Y =

β 0 + u, when D = 0 (β 0 + β 1 ) + u, when D = 1

Taking expectation of (3) leads to

E(Y |D = 0) = β 0 (4) E(Y |D = 1) = β 0 + β 1 (5)

and

β 0 = E(Y |D = 0) (6) β 1 = E(Y |D = 1) − E(Y |D = 0) (7)

Therefore β 0 is the mean of Y conditional on D = 0 (or mean of Y in the subpopulation with D = 0), β 1 is the difference in mean Y between the two sub-populations.

  1. Sample mean is the estimate for population mean, so we have the following interpre- tation for the estimated coefficients in (2)

βˆ 0 = y¯D=0 (8) βˆ 1 = y¯D=1 − y¯D=0 (9)

where ¯yD=0 denotes the average Y in the sub-sample for which D = 0, y¯D=1 denotes the average Y in the sub-sample for which D = 1. Equation (2) provides a simple way to carry out a comparison of means test (or two sample t test) between the two groups. The null hypothesis of two-sample t test says that there is no difference between two groups: H 0 : β 1 = 0 This hypothesis is rejected when the p-value for βˆ 1 is less than 0.05.

  1. For example, let Y be wage, and D = 1 for female, and D = 0 for male. Then consider the regression wage = β 0 + β 1 D + u, and we know βˆ 0 is the average wage for male, and βˆ 1 equals average female wage minus average male wage. The two wages are significantly different if βˆ 1 is significant.
  2. Now consider a regression with regressor X

Y = β 0 + β 1 D + β 2 X + u (10)

which can be rewritten as

Y =

β 0 + β 2 X + u, when D = 0 (β 0 + β 1 ) + β 2 X + u, when D = 1

It follows that

E(Y |X, D = 0) = β 0 + β 2 X (12) E(Y |X, D = 1) = (β 0 + β 1 ) + β 2 X (13) β 1 = E(Y |X, D = 1) − E(Y |X, D = 0) (14)

so β 1 measures the change in mean Y across two groups, holding X constant (or given

  1. Suppose we have two subsamples, one for female and one for male. We want to estimate the effect of education on wage. We have two options. Option 1 is to run two separate regressions, one for female and one for male. Option two is pool (merge) the two subsamples together and just run one regression. Which option is better?

(a) Essentially this problem is about whether the relationship between education and wage depends on gender (b) To answer this question, we just pool the two subsample, and run regression (16). The point is, we need to use dummy variable and interaction term. The null hypothesis is gender does not matter, so

β 1 = β 3 = 0 (18)

We can use F test (called Chow test in this context) for this hypothesis. i. If p-value is less than 0.05, H 0 is rejected, so gender matters. We need to keep the dummy and interaction term in (16). That means, running two separate regressions, one for female and one for male, is better idea. ii. If p-value is greater than 0.05, H 0 is not rejected, so gender does not matter. We need to drop the dummy and interaction term from (16). That means, running one regression using both subsamples is better idea.

  1. What if we have information about gender and marital status? Option one is to define two dummy variables as D 1 =

1 , female 0 , male

D 2 =

1 , married 0 , unmarried

and use them to run the regression of

Y = β 0 + β 1 D 1 + β 2 D 2 + u (21)

For this regression we can show

E(Y ) =

β 0 , if D 1 = 0, D 2 = 0 β 0 + β 1 , if D 1 = 1, D 2 = 0 β 0 + β 2 , if D 1 = 0, D 2 = 1 β 0 + β 1 + β 2 , if D 1 = 1, D 2 = 1

Now we can see regression (22) is restrictive because it assumes

E(Y |D 1 = 1, D 2 = 1)−E(Y |D 1 = 1, D 2 = 0) = E(Y |D 1 = 0, D 2 = 1)−E(Y |D 1 = 0, D 2 = 0), (22) In words, when D2 changes from 0 to 1, the change in mean Y does not depend on D 1. This is a kind of no-interaction restriction. Let Y be wage. Then no-interaction restriction says that when a person changes his/her marital status, the change in wage does not depend on the gender of the person.

  1. In order to relax the no-interaction restriction, we can define four dummy variables (because we have four groups of people) as

E 1 =

1 , female and married 0 , otherwise

E 2 =

1 , female and unmarried 0 , otherwise

E 3 =

1 , male and married 0 , otherwise

E 4 =

1 , male and unmarried 0 , otherwise and run a regression using only three of them

Y = β 0 + β 1 E 1 + β 2 E 2 + β 3 E 3 + u (23)

If we use all four dummies, then E 1 + E 2 + E 3 + E 4 = 1 so is perfectly correlated with the intercept term. This situation is called dummy variable trap. In order to avoid dummy variable trap, we leave out one dummy.

Example: Chapter 7

  1. We use the data file 311 wage1.dta, downloadable at my webpage. See example 7.1 in textbook for detail.
  2. We see for the first observation, wage = 3.1, educ = 11, female = 1 (so is female), and married = 0 (so is unmarried). Female and married are both dummy variables, for which the values 1 and 0 have no quantitative meaning.
  3. Command tab is used to tabulate proportion (probability) for dummy variable. In this case 52.09 percent observations are male (female=0), and 47.91 percent are female.
  4. Next we run regression (2), i.e., regress wage on dummy variable female. The estimated intercept βˆ 0 = ¯yD=0 = 7.099489 is the average wage for male. The estimated slope βˆ 1 = ¯yD=1 − ¯yD=0 = − 2 .51183 is average female wage minus average male wage. In this example female earns less than male since βˆ 1 is negative. The p-value for βˆ 1 is less than 0.05, so we reject the null hypothesis that female wage equals male wage. In other words, the two wages differ significantly.
  5. Alternatively we can summarize wage separately for female and male. The command is

sort female by female: sum wage

On average a male earns 7.099489, and a female earns 4.587659. The difference is

  1. 587659 − 7 .099489 = − 2. 51183 , which is the same as βˆ 1 reported by regression (2). This finding confirms that

Regressing Y on dummy variable carries out the two sample t test.

  1. Next we run regression (16) using X = educ:

wage = β 0 + β 1 f emale + β 2 educ + β 3 (educ ∗ f emale) + u

(a) The estimated intercept is βˆ 0 =. 2004963. It measures the average male wage when educ = 0.

(b) βˆ 1 = − 1. 198523. It measures the average female wage when educ = 0 minus average male wage when educ = 0. In other words, when educ = 0, a female earns .2004963 + (− 1 .198523) = −. 9980267. This number is not very meaningful since in this sample no female has zero education (two males have zero educ, and you can see them using command list if educ==0). (c) βˆ 2 =. 539476. So male wage rises by .539476 when his educ rises by 1 unit. (d) βˆ 3 = −. 085999. So female wage rises by .539476 + (−.085999) = .453477 when her educ rises by 1 unit. (e) The null hypothesis that the relationship between wage and educ does not depend on gender (or there is NO difference in regression functions across female and male) can be formulated as H 0 : β 1 = β 3 = 0. The F test for difference in regression functions across groups is called Chow test The stata command to conduct Chow test is test female fe. It is shown that F = 33. 51 , p-value < 0. 05. So we reject the null hypothesis. That means there IS difference in regression functions across female and male. In other words, the relationship between wage and educ depends on gender. (f) Note that βˆ 1 and βˆ 3 are individually insignificant (the p-values are 0.366 and 0.407, respectively), whereas the Chow test indicates that they are jointly significant. The lesson is, just focusing on individual coefficient can be misleading.

  1. Because the relationship between wage and educ depends on gender, we can run two separate (group-wise) regressions, one using female and one using male. The stata command is by female: reg wage educ. We see the coefficients in the male regres- sion are the same as βˆ 0 and βˆ 2 reported by the pooled regression (16). The female results can also be derived based on the pooled regression (16). In other words,

Regressing on dummy and interaction terms is as informative as groupwise regressions

The pooled regression (16) has one big advantage over groupwise regressions: we can run Chow test based on (16).

  1. Finally you are shown how to define a set of dummy variables to represent multiple categories of gender and marital status. In theory we should define four dummies since

Do File

  • Do file for dummy variable (chapter 7) set more off clear capture log close cd "I:\311" log using 311log.txt, text replace use 311_wage1.dta, clear
  • show first 5 observations list wage educ female married in 1/
  • tabulate female tab female
  • run regression using dummy reg wage female
  • compare the means for male and female sort female by female: sum wage
  • run regression using dummy and interaction term gen fe = female*educ reg wage female educ fe
  • chow test test female fe
  • run separate regressions for male and female by female: reg wage educ
  • multiple category gen d1 = 0 replace d1 = 1 if female == 0 & married == gen d2 = 0 replace d2 = 1 if female == 1 & married == gen d3 = 0 replace d3 = 1 if female == 1 & married == reg wage d1 d2 d log close