



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
In all of these cases the variable we're interested in is qualitative or categorical; it can be given a numerical coding of some sort but in ...
Typology: Exercises
1 / 5
This page cannot be seen from the preview
Don't miss anything!




In the context of regression analysis we usually think of the variables are being quantitative—monetary magnitudes, years of experience, the percentage of people having some characteristic of interest, and so on. Sometimes, however, we want to bring qualitative variables into play. For example, after allowing for differences attributable to experience and education level, does gender, or marital status, make a difference to people’s pay? Does race make a difference to pay, or to the chance of becoming unemployed? Did the coming of NAFTA^1 make a significant difference to the trade patterns of the USA? In all of these cases the variable we’re interested in is qualitative or categorical; it can be given a numerical coding of some sort but in itself it is non-numerical. Such variables can be brought within the scope of regression analysis using the method of dummy variables. This method is quite general, but let’s start with the simplest case, where the qualitative variable in question is a binary variable, having only two possible values (male versus female, pre-NAFTA versus post-NAFTA). The standard approach is to code the binary variable with the values 0 and 1. For instance we might make a gender dummy variable with the value 1 for males in our sample and 0 for females, or make a NAFTA dummy variable by assigning a 0 in years prior to NAFTA and a 1 in years when NAFTA was in force.
Consider the gender example. Suppose we have data on a sample of men and women, giving their years of work experience and their salaries. We’d expect salary to increase with experience, but we’d like to know whether, controlling for experience, gender makes any difference to pay. Let yi denote individual i’s salary and xi denote his or her years of experience. Let Di (our gender dummy) be 1 for all men in the sample and 0 for the women. (We could assign the 0s and 1s the other way round; it makes no substantive difference, we just have to remember which way round it is when we come to interpret the results.) Now we estimate (say, using OLS) the model
yi = α + βxi + γDi + i (1)
In effect, we’re getting “two regressions for the price of one”. Think about the men in the sample. Since they all have a value of 1 for Di, equation (1) becomes
yi = α + βxi + γ · 1 + i = α + βxi + γ + i = (α + γ) + βxi + i (^1) The North American Free Trade Agreement, which came into force in 1994.
Since the women all have Di = 0, their version of the equation is
yi = α + βxi + γ · 0 + i = α + βxi + i
Thus the male and female variants of our model have different intercepts, α + γ for the men and just α for the women. Suppose we conjecture that men might be paid more, after allowing for experience. If this is true, we’d expect it to show up in the form of a positive value of our estimate for the parameter γ. We can test the idea that gender makes a difference by testing the null hypothesis H 0 : γ = 0. If our estimate of γ is positive and statistically significant we reject the null and conclude that men are paid more. We could, of course, simply calculate the mean salary of the men in the sample and the mean for women and compare them (perhaps doing a t-test for the difference of two means). But that would not accomplish the same as the above approach, since it would not control for years of experience. It could be that male salaries are higher on average, but the men also have more experience on average, and the difference in salary by gender is entirely explained by difference in experience levels. By running a regression including both experience and a gender dummy variable we can distinguish this possibility from the possibility that, over and above any effects of differential experience levels, there is a systematic difference by gender. Here’s output from a regression of this sort run in gretl, using data7-2 from among the Ramanathan practice files. Actually, rather than experience I’m using EDUC (years of education beyond 8th grade when hired) as the control variable. As you can see, in this instance men were paid more, controlling for education level. The GENDER coefficient is positive and significant; it appears that men were paid about $550 more than women with the same educational level.
OLS estimates using the 49 observations 1– Dependent variable: WAGE
Variable Coefficient Std. Error t-statistic p-value const 856.231 227. 835 3. 7581 0. 000481 EDUC 108.061 32. 439 3. 3312 0. 001712 GENDER 549.072 152. 732 3. 5950 0. 000788
Mean of dep. var. 1820. 204 S.D. of dep. variable 648. ESS 13077037 Std Err of Resid. (ˆσ) 533. R^2 0.351 R¯^2 0. 323
There are two main ways in which the basic idea of dummy variables can be extended:
Suppose we have a qualitative variable that is coded as 0, 1, 2 and so on (as is the case with a lot of data available from government sources such as the Census Bureau). We saw above that we can’t use such a coding as is, for the purposes of regression analysis; we’ll have to convert the information into an appropriate set of 0/1 dummy variables first. You could do this using formulas in a spreadsheet, but it’s easier to do it in gretl. Suppose you have a variable in the current dataset called RACE, which is coded 0, 1, 2, 3 and so on, and you want to create a set of dummy variables to represent the different RACE categories. There are two possibilities here: (1) you want a full set of dummies (with just one omitted category, as discussed above), or (2) you want to “collapse” the categorization to eliminate some unnecessary detail. To get the full set of dummies (that is, k − 1 of them), use the dummify function. This takes the name of the original variable as its argument and returns a list—that is, a named object that stands in for the names of several variables. Here’s an example:
list RACEDUMS = dummify(RACE) ols WAGE const EDUC RACEDUMS
Note a few things about this:
If you want to collapse the original coding you have to create the dummy variables manually. Suppose RACE originally had, say, 8 categories but you want to boil this down to white, black and “other”. And let’s say “other” should be the omitted category. First you must take note of the original code numbers for white and black: let’s say these are 1 and 2 respectively. Then you could do:
series white = (RACE==1) series black = (RACE==2) ols WAGE const EDUC white black
The expressions (RACE==1) and (RACE==2) are Boolean (logical) expressions. That is, (RACE==1) gives a result of 1 when the condition evaluates as true, i.e. where RACE does equal 1, and 0 when the condition is false, i.e. for any other values of RACE. And similarly for (RACE==2). For another example, consider the categorization of educational attainment offered in the Current Population Survey.
00 .Children 31 .Less than 1st grade 32 .1st, 2nd, 3rd, or 4th grade 33 .5th or 6th grade 34 .7th and 8th grade 35 .9th grade 36 .10th grade 37 .11th grade 38 .12th grade no diploma
(^2) You can adjust this if you wish: see the entry for dummify in the gretl Function Reference.
39 .High school graduate 40 .Some college but no degree 41 .Associates degree-occup./vocational 42 .Associates degree-academic program 43 .Bachelors degree(BA,AB,BS) 44 .Masters degree(MA,MS,MEng,MEd,MSW,MBA) 45 .Prof. school degree (MD,DDS,DVM,LLB,JD) 46 .Doctorate degree(PhD,EdD)
Suppose we want to make out of this a three-way classification, the categories being “no High school diploma”, “High school diploma but no Bachelors Degree”, and “Bachelors degree or higher”. If the variable shown above is called AHGA, then in gretl we could define two dummy variables thus:
series E1 = (AHGA>38) && (AHGA<43) series E2 = AHGA > 42
The “&&” (logical AND) in the first formula means that E1 will get value 1 only if both conditions, (AHGA>38) and (AHGA<43), are satisfied, corresponding to “High school diploma but no Bachelors De- gree”, while the definition of E2 corresponds to “Bachelors degree or higher”. Those without a High school diploma are the omitted category, with 0s for both E1 and E2.
The regression models above allow the intercept of the regression to differ across qualitative categories. In all cases so far, however, we have imposed a common slope, β, with respect to the (quantitative) independent variable x. We might want to allow the slope to differ too. For example, it might be that while men and women are both paid more highly if they have more experience or education, the degree to which experience or education brings higher pay may differ for men and women. Note that this is a different point from simply saying that men and women at the same level of education or experience are paid differently. To allow for this sort of thing we can define an interaction term, by multiplying a dummy variable into x. Let’s go back to equation (1) but add a new variable S such that Si = Dixi. The model then becomes yi = α + βxi + γDi + δSi + i (2)
which breaks out for men and women as:
Men: yi = α + βxi + γ · 1 + δxi · 1 + i yi = α + βxi + γ + δxi + i yi = (α + γ) + (β + δ)xi + i Women: yi = α + βxi + γ · 0 + δxi · 0 + i yi = α + βxi + i
This now allows for different slopes (β + δ for men, just β for women) as well as different intercepts. To test whether gender makes any difference (either to the intercept or the slope) we would use an F -test on H 0 : γ = δ = 0.