Analysis with Qualitative Information: Dummy Variables, Exams of Qualitative research

Example of qualitative information includes marital status, gender, race, ... shows how two dummy variables (female and married) look in the data set.

Typology: Exams

2022/2023

Uploaded on 03/01/2023

danmarino
danmarino 🇺🇸

4.2

(11)

267 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 6
Analysis with Qualitative Information: Dummy
Variables
In previous chapters, the dependent and the independent variables in our regression
equations had a quantitative meaning. That is, the magnitude of the variable had
a useful information, for example, years of education, years of experience, unem-
ployment rate, or wage. In this chapter we will analyze how to introduce qualitative
information into a regression equation. Example of qualitative information includes
marital status, gender, race, industry (manufacturing, retail, etc.) or geographical
region (south, north, west, etc.).
6.1 Describing qualitative information
Qualitative factors often come in the form of binary information: a person is female
of male; a person does or does not own a computer; a person is married or not. In
all these cases the relevant information can be captured by a binary variable, also
called a dummy variable or zero-one variable. In defining a dummy variable we must
decide which event is assigned a value of one and which a value of zero. Table 6.1
shows how two dummy variables (female and married) look in the data set.
Table 6.1 A partial Listing of the Data in Wage.xls
person wage educ exper female married
1 3.10 11 2 1 0
2 3.24 12 22 1 1
3 3.00 11 2 0 0
4 6.00 8 44 0 1
5 5.30 12 7 0 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
525 11.56 16 5 0 1
526 3.50 14 5 1 0
47
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Analysis with Qualitative Information: Dummy Variables and more Exams Qualitative research in PDF only on Docsity!

Chapter 6

Analysis with Qualitative Information: Dummy

Variables

In previous chapters, the dependent and the independent variables in our regression equations had a quantitative meaning. That is, the magnitude of the variable had a useful information, for example, years of education, years of experience, unem- ployment rate, or wage. In this chapter we will analyze how to introduce qualitative information into a regression equation. Example of qualitative information includes marital status, gender, race, industry (manufacturing, retail, etc.) or geographical region (south, north, west, etc.).

6.1 Describing qualitative information

Qualitative factors often come in the form of binary information: a person is female of male; a person does or does not own a computer; a person is married or not. In all these cases the relevant information can be captured by a binary variable, also called a dummy variable or zero-one variable. In defining a dummy variable we must decide which event is assigned a value of one and which a value of zero. Table 6. shows how two dummy variables (female and married) look in the data set.

Table 6.1 A partial Listing of the Data in Wage.xls person wage educ exper female married 1 3.10 11 2 1 0 2 3.24 12 22 1 1 3 3.00 11 2 0 0 4 6.00 8 44 0 1 5 5.30 12 7 0 1 .. .

.. .

.. .

.. .

.. .

.. . 525 11.56 16 5 0 1 526 3.50 14 5 1 0

47

48 6 Analysis with Qualitative Information: Dummy Variables

Fig. 6.1 Graph of wage = β 0 + δ 0 female + β 1 educ for δ 0 < 0.

6.2 A single dummy independent variable

The simplest case is when we have a single dummy independent variable. Let’s consider the following model:

wage = β 0 + δ 0 female + β 1 educ + ε (6.1)

We use the parameter δ 0 to emphasize the fact that female corresponds to a dummy variable. If the person is a female we have female = 1, and if the person is a male, we have female = 0. The parameter δ 0 has the following interpretation: δ 0 is the difference in hourly wage between females and males, given the same amount of education (and the error term ε). Thus, the coefficient δ 0 determines whether there is discrimination against women: if δ 0 < 0, it means that on average, women earn less than men. The interpretation of δ 0 (when δ < 0) can be depicted graphically in Figure 6. as an intercept shift between males an females. Let’s estimate the following more interesting model:

wage = β 0 + δ 0 female + β 1 educ + β 2 exper + β 1 tenure + ε (6.2)

The regression output in Gretl is:

Model 2: OLS, using observations 1- Dependent variable: wage

50 6 Analysis with Qualitative Information: Dummy Variables

wagê = 7. 09949 ( 0. 21001 )

( 0. 30341 )

female

N = 526 R¯^2 = 0. 1140 F( 1 , 524 ) = 68. 537 σˆ = 3. 4763 (standard errors in parentheses)

The expected (predicted) wage for females is wagê = 7. 099 − 2. 5121 = 4 .587, while the expected wage for males is wagê = 7. 099 − 2. 5120 = 7 .099. This is not controlling for differences in education, experience or tenure. Once we control for those differences, the wage gap between these two groups is smaller and equal to δ 0 = − 1 .81. What is the interpretation of the coefficient on a dummy variable if the dependent variable is in logs? Here the coefficient has a percentage interpretation. Let’s say we want to estimate the following equation:

log wage = β 0 + δ 0 female + β 1 educ + β 2 exper + β 3 tenure + ε (6.4)

that has the following Gretl estimation output:

logwagê = 0. 501348 ( 0. 10190 )

( 0. 037246 )

female + 0. 0874623 ( 0. 0069389 )

educ + 0. 00462938 ( 0. 0016271 )

exper

    1. 0173670 ( 0. 0029762 )

tenure

N = 526 R¯^2 = 0. 3876 F( 4 , 521 ) = 84. 072 σˆ = 0. 41596 (standard errors in parentheses)

The coefficient on female, δ 0 , implies that for the same levels of education, expe- rience, and tenure, women earn approximately 100(0.301) = 30.1% less than men.

6.3 Dummy variables for multiple categories

One can use several dummy variables in the same equation. For example, we can add the dummy variable married to Equation 6.3 to obtain:

wage = β 0 + δ 0 female + δ 1 married + ε (6.5)

In Gretl we have,

wagê = 6. 18043 ( 0. 29634 )

( 0. 30261 )

female + 1. 33948 ( 0. 30971 )

married

N = 526 R¯^2 = 0. 1429 F( 2 , 523 ) = 44. 779 σˆ = 3. 4190 (standard errors in parentheses)

The coefficient on married gives the (approximate) difference in wages between married and non married individuals. Based on these results, married individuals

6.4 Incorporating ordinal information 51

have higher hourly wages. On important restriction in Equation 6.5 is that it restricts the effect of marital status on wages is the same whether you are male of female. If we are interested in this difference we can estimate an alternative model with additional categories. In particular we need four categories: (1) married men, (2) married women (3) single men, and (4) single woman. We must select a base group (for example, single men) and create the dummy variables for the other three groups.

marrmale = married × ( 1 − female) marrfem = married × female singfem = ( 1 − married) × female

The equation we want to estimate is:

log wage = β 0 + δ 0 marrmale + δ 1 marrfem + δ 2 singfem + ε (6.6)

and the estimation output is:

logwagê = 1. 5201 ( 0. 050987 )

( 0. 061554 )

marrmale − 0. 0797 ( 0. 065524 )

marrfem − 0. 1316 ( 0. 066804 )

singfem

N = 526 R¯^2 = 0. 2087 F( 3 , 522 ) = 47. 149 σˆ = 0. 47284 (standard errors in parentheses)

The interpretation of each of the δ coefficients is with respect to the base group. For example δ 2 = 0 .1316 means that single females earn approximately 13.16% lower hourly wages than single men (the base group).

6.4 Incorporating ordinal information

Suppose we want to estimate the effect of city credit ratings on the municipal bond interest rate (MBR). The credit rating (CR) is an ordinal variable and suppose it goes from zero (worst credit) to four (best credit). Under these consideration, a potential candidate for our model is:

MBR = β 0 + β 1 CR + other f actors + ε (6.7)

where other f actors are just other variables in the model. On concern with this spec- ification is that it is hard to interpret one unit increase in CR. It is easy to talk about an additional year of education or an additional year of income, but credit ratings usually have only an ordinal meaning. Moreover, it is restrictive to assume that each additional unit increase in CR has the same effect on MBR. An alternative approach is to create separate dummy variables for each of the values of CR, that is,

6.5 Interactions involving dummy variables 53

Fig. 6.2 Graph of wage = β 0 + δ 0 female + β 1 educ + δ 1 educ × female.

6.5.1 Allowing for different slopes

Consider the case where we want to estimate the effect of education on hourly wage and in addition, we want for the marginal effect to change based on your gender. This can be done by interacting the educ with female and estimating the follow- ing model:

wage = β 0 + δ 0 female + β 1 educ + δ 1 (female × educ) + ε (6.10)

A graphical approach to this problem in presented in Figure 6.2. The output in Gretl is

wagê = 0. 200496 ( 0. 84356 )

( 1. 3250 )

female + 0. 539476 ( 0. 064223 )

educ

− 0. 0859990 ( 0. 10364 )

female × educ

T = 526 R¯^2 = 0. 2555 F( 3 , 522 ) = 61. 070 σˆ = 3. 1865 (standard errors in parentheses)

54 6 Analysis with Qualitative Information: Dummy Variables

6.5.2 Testing for differences in regression functions across groups

So far we saw that interacting a dummy variable with other independent variables is a powerful tool. Now, we can use this tool to test the null hypothesis that two groups follow the same regression function, against the alternative that one or more of the slopes differs across the two groups. Suppose we want to test whether the same regression model describe college GPA for males and for females. The model is cumgpa = β 0 + β 1 sat + β 2 hsperc + β 3 tothrs + ε, (6.11)

where cumgpa is cumulative college GPA, sat is the SAT score, hsperc is the high school rank percentile, and tothrs is the total hours of college courses. The regression results in Gretl are

cumgpâ = 0. 929111 ( 0. 22855 )

( 0. 000208 )

sat − 0. 006379 ( 0. 00157 )

hsperc + 0. 01198 ( 0. 000931 )

tothrs

N = 732 R¯^2 = 0. 2323 F( 3 , 728 ) = 74. 717 σˆ = 0. 86711 (standard errors in parentheses)

To allow for a difference in the intercept we just need to include the dummy variable female. Then, to allow differences in the slope parameters we need to include interaction terms for each of the variables and female. That is

cumgpa = β 0 + δ 0 female + β 1 sat + δ 1 sat · female (6.12)

  • β 2 hsperc + δ 2 hsperc · female
  • β 3 tothrs + δ 3 tothrs · female + ε

The parameter δ 0 is the difference in the intercepts between females and males, δ 1 is the slope difference with respect to sat between females and males, and so on. The null hypothesis that cumgpa follows the same model for females and males is

H 0 : δ 0 = δ 1 = δ 2 = δ 3 = 0 (6.13)

If at least one of the δ (^) j is different from zero, then the model is different for men and women. After creating the interaction terms, the estimated model in Gretl is

Model 2: OLS, using observations 1- Dependent variable: cumgpa

coefficient std. error t-ratio p-value

const 1.21398 0.264828 4.584 5.37e-06 (^) *** sat 0.000611312 0.000235026 2.601 0.0095 (^) *** hsperc -0.00596745 0.00177646 -3.359 0.0008 (^) *** tothrs 0.0103004 0.00109284 9.425 5.65e-020 (^) *** female -1.11364 0.528539 -2.107 0.0355 (^) ** satfemale 0.00111674 0.000500034 2.233 0.0258 (^) ** hspercfemale 5.07597e-05 0.00410253 0.01237 0. tothrsfemale 0.00555989 0.00206958 2.686 0.0074 (^) ***

56 6 Analysis with Qualitative Information: Dummy Variables

F =

RSS − RSSUR

RSSUR

n − 2 k q

where RSS is the residual sum of squares of the model estimates in Equation 6. and RSSUR is the unrestricted model in Equation 6.12. n is the sample size, k is the number of parameters we are estimating, and q is the number of restrictions when comparing the model in Equation 6.11 and in Equation 6.12. Substituting the values we obtain,

F =

An alternative way to calculate this F statistic is to follow the formula,

F =

RSS − (RSS 1 + RSS 2 )

RSS 1 + RSS 2

n − 2 k k

where RSS is the residual sum of squares of the model estimates in Equation 6.11. RSS 1 and RSS 2 are the residual sum of squares of the model estimated in Equa- tion 6.11 using only the females in the sample (RSS 1 ) and using only the males in the sample (RSS 2 ). As before, n is the sample size and k is the number of parameters we are estimating. The estimation of Equation 6.11 with just females is:

Model 5: OLS, using observations 1- Dependent variable: cumgpa

coefficient std. error t-ratio p-value

const 0.100346 0.481095 0.2086 0. sat 0.00172805 0.000464216 3.723 0.0003 (^) *** hsperc -0.00591669 0.00388949 -1.521 0. tothrs 0.0158603 0.00184854 8.580 4.82e-015 (^) ***

Mean dependent var 2.268611 S.D. dependent var 1. Sum squared resid 143.6897 S.E. of regression 0. R-squared 0.367483 Adjusted R-squared 0. F(3, 176) 34.08447 P-value(F) 2.03e- Log-likelihood -235.1319 Akaike criterion 478. Schwarz criterion 491.0356 Hannan-Quinn 483.

and with just males is:

Model 6: OLS, using observations 1- Dependent variable: cumgpa

coefficient std. error t-ratio p-value

const 1.21398 0.260270 4.664 3.90e-06 (^) *** sat 0.000611312 0.000230981 2.647 0.0084 (^) *** hsperc -0.00596745 0.00174588 -3.418 0.0007 (^) *** tothrs 0.0103004 0.00107403 9.590 3.06e-020 (^) ***

Mean dependent var 2.019638 S.D. dependent var 0.

6.6 The dummy variable trap 57

Sum squared resid 390.6194 S.E. of regression 0. R-squared 0.186740 Adjusted R-squared 0. F(3, 548) 41.94377 P-value(F) 2.06e- Log-likelihood -687.8093 Akaike criterion 1383. Schwarz criterion 1400.873 Hannan-Quinn 1390.

Using the formula in Equation 6.17,

F =

which is the same result as in Equation 6.15. This version of the F test is know also as the Cho test. A large F statistic is evidence against the null hypothesis. In our example the F statistic of 4.4227 has an associated p-value of 0.0015, below the usual 0.05 (or 5%). Hence, we reject the null hypothesis that there is no difference between the equation for females and the equation for males. This means that there is difference and we are better off estimating Equation 6.12 instead of Equation 6.11. The key to estimate Equation 6.11 with just the female portion of the data change the sample. To do this go to Sample → Restrict, based on criterion..., then after a new window shows up, select the “use dummy variable” and then female. Once the sample is restricted, just estimate the model using Ordinary Least Squares again.

6.6 The dummy variable trap

The dummy variable trap occurs when there is an exact linear relationship among the variables in the regression model. That is the reason why we do not include female and male in the same regression equation because female + male =

  1. The same occurs when we have more than one category and we should always omit one of the categories (base group). Than is why singmen does not appear in Equation 6.6 (marrmale + singmale + marrfem + singfem = 1).