







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
How to use dummy variables in regression analysis to compare means between two or more groups. the concept of dummy variables following Bernoulli distribution, their use as regressors, and the interpretation of coefficients in the context of comparing means. The document also discusses the limitations of assuming no interaction between variables and how to relax this restriction by using interaction terms.
Typology: Study notes
1 / 13
This page cannot be seen from the preview
Don't miss anything!








1 , with probability p 0 , with probability 1 − p
Y = β 0 + β 1 D + u (2)
Regression (2) can be broken into two separate regressions as
β 0 + u, when D = 0 (β 0 + β 1 ) + u, when D = 1
Taking expectation of (3) leads to
E(Y |D = 0) = β 0 (4) E(Y |D = 1) = β 0 + β 1 (5)
and
β 0 = E(Y |D = 0) (6) β 1 = E(Y |D = 1) − E(Y |D = 0) (7)
Therefore β 0 is the mean of Y conditional on D = 0 (or mean of Y in the subpopulation with D = 0), β 1 is the difference in mean Y between the two sub-populations.
βˆ 0 = y¯D=0 (8) βˆ 1 = y¯D=1 − y¯D=0 (9)
where ¯yD=0 denotes the average Y in the sub-sample for which D = 0, y¯D=1 denotes the average Y in the sub-sample for which D = 1. Equation (2) provides a simple way to carry out a comparison of means test (or two sample t test) between the two groups. The null hypothesis of two-sample t test says that there is no difference between two groups: H 0 : β 1 = 0 This hypothesis is rejected when the p-value for βˆ 1 is less than 0.05.
Y = β 0 + β 1 D + β 2 X + u (10)
which can be rewritten as
β 0 + β 2 X + u, when D = 0 (β 0 + β 1 ) + β 2 X + u, when D = 1
It follows that
E(Y |X, D = 0) = β 0 + β 2 X (12) E(Y |X, D = 1) = (β 0 + β 1 ) + β 2 X (13) β 1 = E(Y |X, D = 1) − E(Y |X, D = 0) (14)
so β 1 measures the change in mean Y across two groups, holding X constant (or given
(a) Essentially this problem is about whether the relationship between education and wage depends on gender (b) To answer this question, we just pool the two subsample, and run regression (16). The point is, we need to use dummy variable and interaction term. The null hypothesis is gender does not matter, so
β 1 = β 3 = 0 (18)
We can use F test (called Chow test in this context) for this hypothesis. i. If p-value is less than 0.05, H 0 is rejected, so gender matters. We need to keep the dummy and interaction term in (16). That means, running two separate regressions, one for female and one for male, is better idea. ii. If p-value is greater than 0.05, H 0 is not rejected, so gender does not matter. We need to drop the dummy and interaction term from (16). That means, running one regression using both subsamples is better idea.
1 , female 0 , male
1 , married 0 , unmarried
and use them to run the regression of
Y = β 0 + β 1 D 1 + β 2 D 2 + u (21)
For this regression we can show
β 0 , if D 1 = 0, D 2 = 0 β 0 + β 1 , if D 1 = 1, D 2 = 0 β 0 + β 2 , if D 1 = 0, D 2 = 1 β 0 + β 1 + β 2 , if D 1 = 1, D 2 = 1
Now we can see regression (22) is restrictive because it assumes
E(Y |D 1 = 1, D 2 = 1)−E(Y |D 1 = 1, D 2 = 0) = E(Y |D 1 = 0, D 2 = 1)−E(Y |D 1 = 0, D 2 = 0), (22) In words, when D2 changes from 0 to 1, the change in mean Y does not depend on D 1. This is a kind of no-interaction restriction. Let Y be wage. Then no-interaction restriction says that when a person changes his/her marital status, the change in wage does not depend on the gender of the person.
1 , female and married 0 , otherwise
1 , female and unmarried 0 , otherwise
E 3 =
1 , male and married 0 , otherwise
E 4 =
1 , male and unmarried 0 , otherwise and run a regression using only three of them
Y = β 0 + β 1 E 1 + β 2 E 2 + β 3 E 3 + u (23)
If we use all four dummies, then E 1 + E 2 + E 3 + E 4 = 1 so is perfectly correlated with the intercept term. This situation is called dummy variable trap. In order to avoid dummy variable trap, we leave out one dummy.
sort female by female: sum wage
On average a male earns 7.099489, and a female earns 4.587659. The difference is
Regressing Y on dummy variable carries out the two sample t test.
wage = β 0 + β 1 f emale + β 2 educ + β 3 (educ ∗ f emale) + u
(a) The estimated intercept is βˆ 0 =. 2004963. It measures the average male wage when educ = 0.
(b) βˆ 1 = − 1. 198523. It measures the average female wage when educ = 0 minus average male wage when educ = 0. In other words, when educ = 0, a female earns .2004963 + (− 1 .198523) = −. 9980267. This number is not very meaningful since in this sample no female has zero education (two males have zero educ, and you can see them using command list if educ==0). (c) βˆ 2 =. 539476. So male wage rises by .539476 when his educ rises by 1 unit. (d) βˆ 3 = −. 085999. So female wage rises by .539476 + (−.085999) = .453477 when her educ rises by 1 unit. (e) The null hypothesis that the relationship between wage and educ does not depend on gender (or there is NO difference in regression functions across female and male) can be formulated as H 0 : β 1 = β 3 = 0. The F test for difference in regression functions across groups is called Chow test The stata command to conduct Chow test is test female fe. It is shown that F = 33. 51 , p-value < 0. 05. So we reject the null hypothesis. That means there IS difference in regression functions across female and male. In other words, the relationship between wage and educ depends on gender. (f) Note that βˆ 1 and βˆ 3 are individually insignificant (the p-values are 0.366 and 0.407, respectively), whereas the Chow test indicates that they are jointly significant. The lesson is, just focusing on individual coefficient can be misleading.
Regressing on dummy and interaction terms is as informative as groupwise regressions
The pooled regression (16) has one big advantage over groupwise regressions: we can run Chow test based on (16).