Regression Models: Including Categorical Variables, Study notes of Systems Engineering

The advantages of using regression models over time series extrapolation models and introduces the concept of indicator variables for incorporating categorical variables into regression analysis. The creation and interpretation of indicator variables, as well as an example application to modeling demand for refrigerated trucks based on temperature and day of the week.

Typology: Study notes

Pre 2010

Uploaded on 08/05/2009

koofers-user-5zq-1
koofers-user-5zq-1 🇺🇸

10 documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ISyE 3103 Supply Chain Modeling: Transportation and Logistics
Spring 2006
Regression Models with Indicator Variables
1 Categorical Variables
One of the advantages regression-based forecasting models have over time series extrapolation mod-
els (i.e. moving average, exponential smoothing, etc.) is that regression models can include other
explanatory variables that are not necessarily related to time. It seems logical that demand (and
other dependent variables) could be a function of factors such as advertising expenditures and price
in addition to time series elements. Some of these factors could be categorical in nature as opposed
to quantitative. Examples of categorical variables include gender, color, region of the country,
and the political party currently in power. As we will see, categorical variables are also useful in
modeling seasonal effects in regression models.
Regression models require that all of the variables included, both independent and dependent, be
quantitative in nature. Consequently, we must devise a method of translating these categories into
numbers. We will define the levels of a categorical variable as the number of values that variable
can assume. It is not sufficient for us to haphazardly assign the levels to an integer (or any other)
value in order to convert them. Creating integer values necessarily implies some kind of ordering
on the levels, which does not make any sense in the context of a categorical variable.
Consider the following example. Suppose that we wanted to model the starting salary for Georgia
Tech ISyE graduates and we thought that an important factor might be the color of the graduate’s
hair. For simplicity, let us assume that there are three hair colors: light, dark, and red. If we
were to create an integer variable called COLOR where COLOR=0 denotes light hair, COLOR=1
denotes red hair, and COLOR=2 denotes dark hair, we would be imposing an artificial ordering on
the levels of the COLOR variable. This ordering renders the regression coefficient meaningless since
we would be implying that dark hair is two ”colors” higher (or lower) than light hair.
2 Using Indicator Variables
In order to incorporate categorical variables into a regression model, we must create indicator vari-
ables1that either take on a value of 0 or 1. To represent a categorical variable that has clevels,
we require c1 indicator variables. If we created cindicator variables along with a constant term
(β0) in the regression model, we would get an error from our statistical software program because
the data matrix (or more accurately, XTX) is singular.2You must choose3one of the levels to be
the base case, and then each of the other levels has a corresponding indicator variable that is equal
to 1 if the observation exhibits that level and 0 otherwise.
Continuing our hair color example from above, suppose we denote light hair as the base case. We
must then create indicator variables4RED and DARK to represent individuals with those particular
1Indicator variables are often called “dummy variables” or “binary variables.”
2Recall that a singular matrix has no inverse, and we must be able to invert that matrix in order to use least-squares
regression.
3The specific choice of the base level does not matter. Any of the levels is as good a base as any of the others.
4Note that these variables must be created. No data set that you receive will include all of these 0’s and 1’s. Once
you choose an indicator variable mechanism to represent a categorical variable, you must go through the data and
determine the appropriate values of each indicator variable for each data observation.
pf3
pf4
pf5

Partial preview of the text

Download Regression Models: Including Categorical Variables and more Study notes Systems Engineering in PDF only on Docsity!

ISyE 3103 Supply Chain Modeling: Transportation and Logistics

Spring 2006

Regression Models with Indicator Variables

1 Categorical Variables

One of the advantages regression-based forecasting models have over time series extrapolation mod- els (i.e. moving average, exponential smoothing, etc.) is that regression models can include other explanatory variables that are not necessarily related to time. It seems logical that demand (and other dependent variables) could be a function of factors such as advertising expenditures and price in addition to time series elements. Some of these factors could be categorical in nature as opposed to quantitative. Examples of categorical variables include gender, color, region of the country, and the political party currently in power. As we will see, categorical variables are also useful in modeling seasonal effects in regression models.

Regression models require that all of the variables included, both independent and dependent, be quantitative in nature. Consequently, we must devise a method of translating these categories into numbers. We will define the levels of a categorical variable as the number of values that variable can assume. It is not sufficient for us to haphazardly assign the levels to an integer (or any other) value in order to convert them. Creating integer values necessarily implies some kind of ordering on the levels, which does not make any sense in the context of a categorical variable.

Consider the following example. Suppose that we wanted to model the starting salary for Georgia Tech ISyE graduates and we thought that an important factor might be the color of the graduate’s hair. For simplicity, let us assume that there are three hair colors: light, dark, and red. If we were to create an integer variable called COLOR where COLOR=0 denotes light hair, COLOR= denotes red hair, and COLOR=2 denotes dark hair, we would be imposing an artificial ordering on the levels of the COLOR variable. This ordering renders the regression coefficient meaningless since we would be implying that dark hair is two ”colors” higher (or lower) than light hair.

2 Using Indicator Variables

In order to incorporate categorical variables into a regression model, we must create indicator vari- ables^1 that either take on a value of 0 or 1. To represent a categorical variable that has c levels, we require c − 1 indicator variables. If we created c indicator variables along with a constant term (β 0 ) in the regression model, we would get an error from our statistical software program because the data matrix (or more accurately, XT^ X) is singular.^2 You must choose^3 one of the levels to be the base case, and then each of the other levels has a corresponding indicator variable that is equal to 1 if the observation exhibits that level and 0 otherwise.

Continuing our hair color example from above, suppose we denote light hair as the base case. We must then create indicator variables^4 RED and DARK to represent individuals with those particular (^1) Indicator variables are often called “dummy variables” or “binary variables.” (^2) Recall that a singular matrix has no inverse, and we must be able to invert that matrix in order to use least-squares regression. (^3) The specific choice of the base level does not matter. Any of the levels is as good a base as any of the others. (^4) Note that these variables must be created. No data set that you receive will include all of these 0’s and 1’s. Once you choose an indicator variable mechanism to represent a categorical variable, you must go through the data and determine the appropriate values of each indicator variable for each data observation.

hair colors. If we list a given observation’s indicator variable values as an ordered pair (RED, DARK), a red-haired person would be denoted by (1, 0), a dark-haired person by (0, 1), and a light-haired person by (0, 0).

If we wanted to include a second categorical variable in a regression model, we would still require the number of indicator variables equal to the number of levels for the new categorical variable minus one. The definition of the base case must be extended to include the chosen base level for the second categorical variable.

To illustrate this concept in the context of our previous example, suppose we thought that the gender of the graduate was an important factor in his or her starting salary. Obviously this variable has two levels: male and female. Let us select the female level as our base case. Consequently, we will create one indicator variable, MALE, that is equal to 1 if the person is male and zero otherwise. The base case of the model now corresponds to a light-haired female graduate. There are six possible types of graduates since we have two genders and three hair colors. We now confirm that these three indicator variables, (RED, DARK, MALE), are sufficient to characterize the six types of graduates.

Graduate Type (RED, DARK, MALE) Light-haired female (0, 0 , 0) Light-haired male (0, 0 , 1) Red-haired female (1, 0 , 0) Red-haired male (1, 0 , 1) Dark-haired female (0, 1 , 0) Dark-haired male (0, 1 , 1)

3 Interpreting Regression Coefficients for Indicator Variables

Regression coefficients (true βi parameters) typically represent the change in the dependent vari- able as the corresponding independent variable increases by one unit. Estimated coefficients ( βˆi statistics) represent the change in the expected value of the dependent variable as the independent variable increases by one. These interpretations must be modified for indicator variables since they are binary and we have a base case. Regression coefficients for indicator variables signify the change in the dependent variable as our corresponding categorical variable changes from the base level to the level represented by that coefficient’s indicator variable.

Suppose that we have the following estimated regression model

SALARYˆ = βˆ 0 + βˆ 1 GPA + βˆ 2 RED + βˆ 3 DARK + βˆ 4 MALE,

where GPA is a quantitative variable^5 representing the graduate’s final Georgia Tech grade point average. Recall that we defined our base graduate as a light-haired female. Consequently, the estimated salary for any light-haired female graduate is SALARYˆ = βˆ 0 + βˆ 1 GPA.

We can interpret βˆ 2 as the change in expected salary for a red-haired female compared with a light-haired female when the two graduates have the same final grade point average. Similarly, βˆ 3 represents the change in expected salary for a dark-haired female over that of a light-haired

(^5) We are allowed to use quantitative independent variables along with indicator variables in regression models. This will enable us to build models of trend and seasonality.

Table 2: First two weeks of data with indicator variable values

TIME TEMP MON TUE WED THU DEMAND 1 13 1 0 0 0 67 2 10 0 1 0 0 54 3 9 0 0 1 0 51 4 10 0 0 0 1 46 5 10 0 0 0 0 62 6 15 1 0 0 0 65 7 14 0 1 0 0 55 8 14 0 0 1 0 47 9 15 0 0 0 1 45 10 13 0 0 0 0 64

female with the same GPA. βˆ 4 corresponds to the change in expected salary for a light-haired male compared with a light-haired female for graduates with the same GPA. We can also determine the change in expected salary when both categorical variables shift from the base levels. For example, the change in expected salary for a red-haired male compared with a light-haired female (with the same GPA) equals βˆ 2 + βˆ 4.

The interpretation of βˆ 1 is free from these complications since GPA is a quantitative variable. It simply represents the change in expected salary for a given type of graduate as the graduate’s final grade point average increases by one point.

4 Time Series Example

The Belgian Trucking Company^6 needs to determine the number of refrigerated^7 trucks to satisfy the transportation demand between Antwerp and Brussels on a daily basis. The demand for refrig- erated trucks is dependent on the daily temperature (because more refrigerated vans are needed when the outside temperature increases), and it also appears that there are trend and daily sea- sonality elements present in the data. Six weeks’ worth of data are provided in Table 1.

Since the data exhibits daily seasonality within each week, we need to create appropriate indicator variables to model this seasonality in the regression model. There are five workdays in each week, so we need four indicator variables. Let us choose Friday as our base day and create variables MON,TUE,WED, and THU. Table 2 contains the first two weeks of data observations, including the indicator variable values. The subsequent weeks of data are modified in the same fashion to obtain the data set on which we will fit the regression model.

Regressing demand on all of the independent variables (2 quantitative and 4 indicator), we obtain the following estimated regression function.

DEMANDˆ = 72.551+0. 474 TIME− 0. 849 TEMP+5. 476 MON− 11. 538 TUE− 17. 811 WED− 22. 143 THU (^6) This example is adapted from Ghiani, Laporte, and Musmanno’s Introduction to Logistics Systems Planning and Control (2004; pg.72), published by John Wiley & Sons. (^7) These are also known as “reefer” trucks.

All of the t-tests for the regression coefficients are significant at even the α = 0.01 level, as is the overall F-test for the model as a whole. The measures of fit for this model are R^2 = 0.9606 and R^2 ADJ = 0.9503, which indicate that we have an extremely good fit.

As we expected, the daily temperature is inversely related to the reefer demand. We would estimate that on any given day a one-degree (Celsius) increase in temperature results in a 0.849 decrease in the expected demand. Similarly, the regression coefficients for the indicator variables denote the change in expected demand for that day of the week compared with our base day of Friday.

The interpretation of the coefficient for the TIME variable is somewhat complicated here. We would ordinarily say that demand should increase by an average of 0.474 as each day goes by. The prob- lem here, though, is that an increase of one day would correspond to a different level in the daily seasonality categorical variable, so we are unable to keep the level of the seasonal variable constant if we increase by one day.

To deal with this problem, we should consider what happens to the expected demand as the time changes by 5 days.^8 This allows us to keep the seasonal variable at a constant level. Consequently, it makes sense for us to say that moving a week forward in time results in an average increase in demand of 0. 474 ∗ 5 = 2.37 reefer trucks (as long as the temperature stays constant).

(^8) Note that we only increase by 5 instead of 7 because the weekend days are not included in the model’s trend formulation. We have modeled the situation as if weekends did not exist. (Sometimes that’s the way it feels since they go by so quickly!)