






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An introduction to multiple regression analysis with dummy variables, also known as qualitative or categorical variables. How to generate and use dummy variables to evaluate the use of qualitative information in standard regression models with continuous dependent variables. It also covers how to combine quantitative and qualitative information, test for statistical discrimination, and consider interactions involving dummy variables.
Typology: Exams
1 / 12
This page cannot be seen from the preview
Don't miss anything!







state
tab state, gen(st) st1, st2, ... st
Wooldridge, Introductory Econometrics, 2d ed. Chapter 7: Multiple regression analysis with binary (dummy) variables
binary variables dummy variables
We often consider relationships between observed outcomes and qualitative factors: models in which a continuous dependent variable is related to a number of explanatory factors, some of which are quantitative, and some of which are qualitative. In econometrics, we also consider models of qualitative dependent variables, but we will not explore those models in this course due to time constraints. But we can readily evaluate the use of qualitative information in standard regression models with continuous dependent variables. Qualitative information often arises in terms of some coding, or index, which takes on a number of values: for instance, we may know in which one of the six New England states each of the individuals in our sample resides. The data themselves may be coded with the biliteral “MA”, “RI”, “ME”, etc. How can we use this factor in a regression equation? In the data, takes on six distinct values. We must create six , or , each of which will refer to one state–that is, that variable will be 1 if the individual comes from that state, and 0 otherwise. We can generate this set of 6 variables easily in Stata with the command , which will create 6 new variables in our dataset:. Each of these variables are dummies–that is, they only contain 0 or 1 values. If we add up these variables, we get–exactly–a vector of 1’s, suggesting that we will never want to use all 6 variables
th
any five
summ st* st
income st
0 1 1 2 2 3 3 4 4 5 5
0
1 2
(^0 1 2 3 4 5 )
in a regression (since by knowing the values of any 5...) We may also find the proportions of each state’s citizens in our sample very easily: will give the descriptive statistics of all 6 variables, and the mean of each dummy is the sample proportion living in that state. How can we use these dummy variables? Say that we wanted to know whether incomes differed significantly across the 6-state region. What if we regressed on of these dummies?
where I have suppressed the observation subscripts. What are the regression coefficients in this case? is the average income in the 6 state–the dummy for which is excluded from the regression. is the difference between the income in state 1 and the income in state 6. is the difference between the income in state 2 and the income in state 6, and so on. What is the ordinary “ANOVA F” in this context–the test that all the slopes are equal to zero? Precisely the test of the null hypothesis:
(2)
versus the alternative that not all six of the state means are the same value. It turns out that we can test this same hypothesis by excluding any one of the dummies, and including the remaining five in the regression. The coefficients will differ, but the value of the ANOVA F will be identical for any of these regressions. In fact, this regression is an example of “classical one-way ANOVA”–testing whether a qualitative factor (in this
Regression with continuous and dummy variables
nen st1...st
st6, st4 st3 gen nen = st3+st4+st
nen nes
regress inc nen
nen
regress inc nen nes, noc
evaluates that expression and returns true (1) or false (0). The vertical bar ( ) is Stata’s OR operator; since every person in the sample lives in one and only one state, we must use OR to phrase the condition that they live in northern New England. But there is another way to generate this dummy, given that we have defined for the regression above. Let’s say that Vermont, New Hampshire and Maine have been coded as and , respectively. We may just , since the sum of mutually exclusive and exhaustive dummies must be another dummy. To check, the resulting will have a mean equal to the percentage of the sample that live in northern New England; the equivalent dummy will have a mean for southern New England residents; and the sum of those two means will of course be 1. We can then run a simplified form of our model as ; the ANOVA F statistic for that regression tests the null hypothesis that incomes in northern and southern New England do not differ significantly. Since we have excluded nes, the “slope” coefficient on measures the amount by which northern New England income differs from southern New England income; the mean income for southern New England is the constant term. If we want point and interval estimates for those means, we should .
In the above examples, we have estimated “pure ANOVA” models–regression models in which all of the explanatory variables are dummies. In econometric research, we often want
0 1 2
0 1 0 1 2 2
0 2 2
2
1 2
gen female = (gender==”F”)
gen black = (race==“black”)
to combine quantitative and qualitative information, including some regressors that are measurable and others that are dummies. Consder the simplest example: we have data on individuals’ wages, years of education, and their gender. We could create two gender dummies, male and female, but we will only need one in the analysis: say, female. We create this variable as
. We can then estimate the model: (3) The constant term in this model now becomes the wage for a male with zero years of education. Male wages are predicted as while female wages are predicted as The gender differential is thus How would we test for the existence of “statistical discrimination”–that, say, females with the same qualifications are paid a lower wage? This would be The statistic for will provide us with this hypothesis test. What is this model saying about wage structure? Wages are a linear function of the years of education. If is significantly different than zero, then there are two “wage profiles”–parallel lines in space, each with a slope of , with their intercepts differing by What if we wanted to expand this model to consider the possibility that wages differ by both gender and race? Say that each worker is classified as race=white or race=black. Then we could to create the dummy variable, and add it to (3). What, now, is the constant term? The wage for a white male with zero years of education. Is
j
1
2
st
where we include any 5 of the 6 variables designating the New England states. The test that wage levels differ significantly due to state of residence is the joint test that A judgment concerning the relevance of state of residence should be made on the basis of this joint test (an F-test with 5 numerator degrees of freedom). Note that if the dependent variable was measured in log form, the coefficients on dummies would be interpreted as percentage changes; if (5) was respecified to place as the dependent variable, the coefficient would measure the percentage return to education (how many percent does the wage change for each additional year of education), while the coefficient would measure the (approximate) percentage difference in wage levels between females and males, ceteris paribus. The state dummies would, likewise, measure the percentage difference in wage levels between that state and the excluded state (number 6). We must be careful when working with variables that have an ordinal interpretation, and are thus coded in numeric form, to treat them as ordinal. For instance, if we model the interest rate corporations must pay to borrow ( as a function of their credit rating, we consider that Moody’s and Standard and Poor’s assign credit ratings somewhat like grades: et cetera. Those could be coded as 1,2,...,7. Just as we can agree that an “A” grade is better than a “B”, a triple-A bond rating results in a lower borrowing cost than a double-A rating. But while GPAs are measured on a clear four-point scale, the bond ratings are merely ordinal, or ordered: everyone
0 1 2
Interactions involving dummy variables
agrees on the rating scale, but the differential between borrowers’ rates and borrowers’ rates might be much smaller than that between and borrowers’ rates: especially the case if denotes “below investment grade”, which will reduce the market for such bonds. Thus, although we might have a numeric index corresponding to we should not assume that is constant; we should not treat as a cardinal measure. Clearly, the appropriate way to proceed is to create dummy variables for each rating class, and include all but one of those variables in a regression of on bond rating and other relevant factors. For instance, if we leave out the dummy, all of the ratings class dummies’ coefficients will then measure the degree to which those borrowers’ bonds bear higher rates than those of borrowers. But we could just as well leave out the rating class dummy, and measure the effects of ratings classes relative to the worst credits’ cost of borrowing.
Just as continuous variables may be interacted in regres- sion equations, so can dummy variables. We might, for instance, have one set of dummies indicating the gender of respon- dents ( and another set indicating their marital status ( We could regress on these two dummies:
which gives rise to the following classification of mean wages, conditional on the two factors (which is thus a classic “two-way ANOVA” setup):
0 2 0 1 2 3
with two-way ANOVA (considering two factors’ effects), imagine that instead of marital status we consider To run the model without interactions, we would include two of these dummies in the regression–say, and ; the constant term would be the mean wage of a white male (the excluded class). What if we wanted to include interactions? Then we would define and and include those two regressors as well. The test for the significance of interactions is now a joint test that these two coefficients are jointly zero. A second extension of the interaction concept is far more important: what if we want to consider a regular regression, on quantitative variables, but want to allow for different slopes for different categories of observations? Then we create interaction effects between the dummies that define those categories and the measured variables. For instance,
Here, we are in essence estimating two separate regressions in one: a regression for males, with an intercept of and a slope of and a regression for females, with an intercept of and a slope of Why would we want to do this? We could clearly estimate the two separate regressions, but if we did that, we could not conduct any tests (e.g. do males and females have the same intercept? The same slope?). If we use interacted dummies, we can run one regression, and test all of the special cases of this model which are nested within: that the slopes are the same, that the intercepts are the same, and the “pooled” case
in which we need not distinguish between males and females. Since each of these special cases merely involves restrictions on this general form, we can run this equation and then just conduct the appropriate tests. If we extended this logic to include as defined above, as an additional factor, we would include two of the race dummies (say, and and interact each with This would be a model without interactions–where the effects of gender and race are considered to be independent–but it would allow us to estimate different regression lines for each combination of gender and race, and test for the importance of each factor. These interaction methods are often used to test hypotheses about the importance of a qualitative factor–for instance, in a sample of companies from which we are estimating their profitability, we may want to distinguish between companies in different industries, or companies that underwent a significant merger, or companies that were formed within the last decade, and evaluate whether their expenditures on R&D or advertising have the same effects across those categories. All of the necessary tests involving dummy variables and interacted dummy variables may be easily specified and computed, since models without interacted dummies (or without certain dummies in any form) are merely restricted forms of more general models in which they appear. Thus, the standard “subset F” testing strategy that we have discussed for the testing of joint hypotheses on the coefficient vector may be readily applied in this context. The text describes how a “Chow test”