Qualitative Variables and Regression Analysis

Allin Cottrell

September 25, 2015

1 Introduction

In the context of regression analysis we usually think of the variables are being quantitative—monetary

magnitudes, years of experience, the percentage of people having some characteristic of interest, and so

on. Sometimes, however, we want to bring qualitative variables into play. For example, after allowing for

differences attributable to experience and education level, does gender, or marital status, make a difference

to people’s pay? Does race make a difference to pay, or to the chance of becoming unemployed? Did the

coming of NAFTA1make a significant difference to the trade patterns of the USA? In all of these cases

the variable we’re interested in is qualitative or categorical; it can be given a numerical coding of some

sort but in itself it is non-numerical.

Such variables can be brought within the scope of regression analysis using the method of dummy

variables. This method is quite general, but let’s start with the simplest case, where the qualitative

variable in question is a binary variable, having only two possible values (male versus female, pre-NAFTA

versus post-NAFTA).

The standard approach is to code the binary variable with the values 0 and 1. For instance we might

make a gender dummy variable with the value 1 for males in our sample and 0 for females, or make a

NAFTA dummy variable by assigning a 0 in years prior to NAFTA and a 1 in years when NAFTA was

in force.

2 Gender and salary

Consider the gender example. Suppose we have data on a sample of men and women, giving their years

of work experience and their salaries. We’d expect salary to increase with experience, but we’d like to

know whether, controlling for experience, gender makes any difference to pay. Let yidenote individual

i’s salary and xidenote his or her years of experience. Let Di(our gender dummy) be 1 for all men

in the sample and 0 for the women. (We could assign the 0s and 1s the other way round; it makes no

substantive difference, we just have to remember which way round it is when we come to interpret the

results.) Now we estimate (say, using OLS) the model

yi=α+βxi+γDi+i(1)

In effect, we’re getting “two regressions for the price of one”. Think about the men in the sample.

Since they all have a value of 1 for Di, equation (1) becomes

yi=α+βxi+γ·1 + i

=α+βxi+γ+i

= (α+γ) + βxi+i

1The North American Free Trade Agreement, which came into force in 1994.

Qualitative Variables and Regression Analysis, Exercises of Computational and Statistical Data Analysis