Understanding R-squared Value and Adjusted R-squared in Regression Analysis, Lecture notes of Business Statistics

An in-depth explanation of r-squared value and its limitations in evaluating the fit of a regression model. It also introduces the concept of adjusted r-squared and discusses how it differs from r-squared. The importance of considering the number of independent variables in a model and the relationship between r-squared and adjusted r-squared.

Typology: Lecture notes

2021/2022

Uploaded on 08/05/2022

nguyen_99
nguyen_99 🇻🇳

4.2

(80)

1K documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
R-Squared Notes:
So far, we have not focused on the R-squared value to evaluate how “well” our model fits the data.
Why? Because too much emphasis can be placed on this particular measure, and if you go on to
study “time-series” data, you will see that the R-squared value can be extremely misleading.
Things to note:
Sthere is no value that R-squared should be for you to claim that your model does a good job
at explaining the variation in the dependent variable. It is simply an estimate of how much
variation can be explained.
Sa small R-squared value implies that the error variance is large relative to the variance of y,
which means that we may have a hard time precisely estimating the $ coefficients. BUT, this
can be offset by a large sample size. This is true even if we have not controlled for many
unobserved factors – which leads to the large error term. EXAMPLE: suppose that some
incoming students at a large university are RANDOMLY given grants to buy computer
equipment. If the amount of the grant is truly randomly determined, we can estimate the
ceteris paribus effect of the grant amount on subsequent college grade point average by using
simple regression analysis. Because of the random assignment, all of the other factors
affecting GPA would be UNCORRELATED with the grant size. Now, it seems pretty
unlikely that grant size would explain very much of the variation in GPA, so the R-squared
from this simple regression would probably be pretty low, BUT we mi ght still (with a large
enough N) get a reasonably precise estimator for the effect on the grant. (NOTE: we don’t
need to worry about omitted variable bias since all the omitted variables would be
uncorrelated with the grant size!)
SThe relative CHANGE in the R-squared value when variables are added to an equation
provides A LOT OF USEFUL INFORMATION. This is related to the joint F-tests that we
talked about earlier in testing joint restrictions.
R-squared and Adjusted R-squared Value: what happens when we add regressors to our equation.
SRecall that R-squared is the ratio between the explained SS/total SS, or:
Now, why is it helpful to write R-squared in this fashion? Think about the following: let Fy
2 be the
population variance of y (unobserved by us) and F,
2 be the population variance on the random
disturbance term (again, unobserved by us). Define the POPULATION R-squared to be:
pf3
pf4

Partial preview of the text

Download Understanding R-squared Value and Adjusted R-squared in Regression Analysis and more Lecture notes Business Statistics in PDF only on Docsity!

R-Squared Notes:

So far, we have not focused on the R-squared value to evaluate how “well” our model fits the data. Why? Because too much emphasis can be placed on this particular measure, and if you go on to study “time-series” data, you will see that the R-squared value can be extremely misleading.

Things to note:

S there is no value that R-squared should be for you to claim that your model does a good job at explaining the variation in the dependent variable. It is simply an estimate of how much variation can be explained.

S a small R-squared value implies that the error variance is large relative to the variance of y, which means that we may have a hard time precisely estimating the $ coefficients. BUT, this can be offset by a large sample size. This is true even if we have not controlled for many unobserved factors – which leads to the large error term. EXAMPLE: suppose that some incoming students at a large university are RANDOMLY given grants to buy computer equipment. If the amount of the grant is truly randomly determined, we can estimate the ceteris paribus effect of the grant amount on subsequent college grade point average by using simple regression analysis. Because of the random assignment, all of the other factors affecting GPA would be UNCORRELATED with the grant size. Now, it seems pretty unlikely that grant size would explain very much of the variation in GPA, so the R-squared from this simple regression would probably be pretty low, BUT we might still (with a large enough N) get a reasonably precise estimator for the effect on the grant. (NOTE: we don’t need to worry about omitted variable bias since all the omitted variables would be uncorrelated with the grant size!)

S The relative CHANGE in the R-squared value when variables are added to an equation provides A LOT OF USEFUL INFORMATION. This is related to the joint F-tests that we talked about earlier in testing joint restrictions.

R-squared and Adjusted R-squared Value: what happens when we add regressors to our equation.

S Recall that R-squared is the ratio between the explained SS/total SS, or:

Now, why is it helpful to write R-squared in this fashion? Think about the following: let Fy^2 be the population variance of y (unobserved by us) and F,^2 be the population variance on the random disturbance term (again, unobserved by us). Define the POPULATION R-squared to be:

which tells us the proportion of the variation of y in the population explained by the independent variables. But we don’t observe the population variances. So, we can use estimators for them:

Okay: so RSS/N is our ESTIMATOR for F,^2 and TSS/N is our estimator for Fy^2 in the “usual” R- squared. That is, the usual R-squared is an estimator for the POPULATION R-squared. BUT WE KNOW THAT BOTH OF THESE ESTIMATORS ARE BIASED (numerator and denominator). We can, instead use unbiased estimators for F,^2 and Fy2.^ In particular, we could use:

RSS/N-k-1 and TSS/N-1.

If we do this, we can get an ADJUSTED-R-squared value that is given by:

BUT: something to keep in mind is that the ratio of unbiased estimators DOES NOT LEAD TO AN UNBIASED ESTIMATOR. And, in fact, the adjusted R-squared estimator is not generally thought to be a better estimator for the population R-squared over the usual R-squared value.

(Recalling that our UNBIASED estimator for the variance on the error term is RSS/N-k-1.)

So, how does the adjusted and regular R-squared differ?

  1. Adjusted R-squared value takes into account the number of INDEPENDENT variables in the model, whereas the regular R-squared does not.
  2. In fact, if we add new independent variables to our model, the adjusted R-squared value will ONLY go up if the t-statistic on the coefficient estimator f the new variable is GREATER THAN ONE in absolute value. (If you add MORE THAN ONE independent variable, the adjusted R-squared will only go up if the F-statistic for the JOINT SIGNIFICANCE of all the new variables is greater than one). SO: this is a little bit different than if you were to look at the individual t-stat or the F-stat, alone (since we would only reject the null if the test statistic is usually LARGER than one...at the usual levels of significance).
  3. is the relationship between the adjusted and regular R-squared values.

Okay: so, now why would we ever look at the adjusted R-squared value and not the R-squared value? Using the Adjusted R-squared to Choose Between Non-nested Models.

R-squared will ALWAYS go up if you add RHS variables. Why? Because the RSS can never go up when you add additional variables to your equation. And, if that’s so, looking at the R-squared alone and whether it goes up doesn’t tell you if you’ve got a “better” model.

trying to show how much of the variation in the LHS variable is explained by the data. But the Var(y) and the Var (lny) are going to be DIFFERENT. So, this just doesn’t make sense.