Regression Models: Including Categorical Variables | Study notes Systems Engineering

ISyE 3103 Supply Chain Modeling: Transportation and Logistics

Spring 2006

Regression Models with Indicator Variables

1 Categorical Variables

One of the advantages regression-based forecasting models have over time series extrapolation mod-

els (i.e. moving average, exponential smoothing, etc.) is that regression models can include other

explanatory variables that are not necessarily related to time. It seems logical that demand (and

other dependent variables) could be a function of factors such as advertising expenditures and price

in addition to time series elements. Some of these factors could be categorical in nature as opposed

to quantitative. Examples of categorical variables include gender, color, region of the country,

and the political party currently in power. As we will see, categorical variables are also useful in

modeling seasonal effects in regression models.

Regression models require that all of the variables included, both independent and dependent, be

quantitative in nature. Consequently, we must devise a method of translating these categories into

numbers. We will define the levels of a categorical variable as the number of values that variable

can assume. It is not sufficient for us to haphazardly assign the levels to an integer (or any other)

value in order to convert them. Creating integer values necessarily implies some kind of ordering

on the levels, which does not make any sense in the context of a categorical variable.

Consider the following example. Suppose that we wanted to model the starting salary for Georgia

Tech ISyE graduates and we thought that an important factor might be the color of the graduate’s

hair. For simplicity, let us assume that there are three hair colors: light, dark, and red. If we

were to create an integer variable called COLOR where COLOR=0 denotes light hair, COLOR=1

denotes red hair, and COLOR=2 denotes dark hair, we would be imposing an artificial ordering on

the levels of the COLOR variable. This ordering renders the regression coefficient meaningless since

we would be implying that dark hair is two ”colors” higher (or lower) than light hair.

2 Using Indicator Variables

In order to incorporate categorical variables into a regression model, we must create indicator vari-

ables1that either take on a value of 0 or 1. To represent a categorical variable that has clevels,

we require c−1 indicator variables. If we created cindicator variables along with a constant term

(β0) in the regression model, we would get an error from our statistical software program because

the data matrix (or more accurately, XTX) is singular.2You must choose3one of the levels to be

the base case, and then each of the other levels has a corresponding indicator variable that is equal

to 1 if the observation exhibits that level and 0 otherwise.

Continuing our hair color example from above, suppose we denote light hair as the base case. We

must then create indicator variables4RED and DARK to represent individuals with those particular

1Indicator variables are often called “dummy variables” or “binary variables.”

2Recall that a singular matrix has no inverse, and we must be able to invert that matrix in order to use least-squares

regression.

3The specific choice of the base level does not matter. Any of the levels is as good a base as any of the others.

4Note that these variables must be created. No data set that you receive will include all of these 0’s and 1’s. Once

you choose an indicator variable mechanism to represent a categorical variable, you must go through the data and

determine the appropriate values of each indicator variable for each data observation.

Regression Models: Including Categorical Variables, Study notes of Systems Engineering

Related documents

Partial preview of the text

Download Regression Models: Including Categorical Variables and more Study notes Systems Engineering in PDF only on Docsity!

ISyE 3103 Supply Chain Modeling: Transportation and Logistics

Spring 2006

Regression Models with Indicator Variables

1 Categorical Variables

2 Using Indicator Variables

3 Interpreting Regression Coefficients for Indicator Variables

4 Time Series Example