Understanding the Relationship Between X and Y through Regression Analysis, Study notes of Social Statistics and Data Analysis

The concept of regression analysis, focusing on the use of scatterplots and linear regression to identify the relationship between two variables, x and y. It covers the calculation of regression lines, prediction equations, and error terms, as well as the importance of the least squares criterion in determining the 'best fit' line. The document also discusses the concept of r-square and its role in measuring the proportion of variance in y that can be explained by its linear relationship with x.

Typology: Study notes

2011/2012

Uploaded on 01/23/2012

desmond
desmond šŸ‡ŗšŸ‡ø

4.8

(12)

327 documents

1 / 46

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 6
Bivariate Correlation & Regression
6.1 Scatterplots and Regression Lines
6.2 Estimating a Linear Regression Equation
6.3 R-Square and Correlation
6.4 Significance Tests for Regression Parameters
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e

Partial preview of the text

Download Understanding the Relationship Between X and Y through Regression Analysis and more Study notes Social Statistics and Data Analysis in PDF only on Docsity!

Chapter 6

Bivariate Correlation & Regression

6.1 Scatterplots and Regression Lines

6.2 Estimating a Linear Regression Equation

6.3 R-Square and Correlation

6.4 Significance Tests for Regression Parameters

Scatterplot: a positive relation

Visually display relation of two variables on X-Y coordinates

50 U.S. States Y = per capita income X = % adults with BA degree

Positive relation: increasing X related to higher values of Y

CT

MS

Summarize scatter by regression line

Use linear regression to estimate ā€œbest-fitā€ line thru points:

How can we use sample data on the Y & X variables to estimate population parameters for the best-fitting line?

Slopes and intercepts

We learned in algebra that a line is uniquely located in a coordinate system by specifying: (1) its slope (ā€œrise over runā€); and (2) its intercept (where it crosses the Y-axis)

Equation has a bivariate linear relationship:

Y = a + bX

where:

b is slope

a is intercept

DRAW THESE 2 LINES:

0 1 2 3 4 5 6

6 5 4 3 2 1 0 Y = 0 + 2 X

Y = 3 - 0.5 X

Regression error

The regression error, or residual, for the ith case is the difference between the value of the dependent variable predicted by a regression equation and the observed value of that case. Subtract the prediction equation from the linear regression model to identify the ith case’s error term

Yi  a  bYXXi  ei

Yi a bYXXi  ˆ^   

i Yi ei Y  ˆ 

An analogy: In weather forecasting, an error is the difference between the weatherperson’s predicted high temperature for today and the actual high temperature observed today: Observed temp 86Āŗ - Predicted temp 91Āŗ = Error -5Āŗ

The Least Squares criterion

Scatterplot for state Income & Education has a positive slope

Ordinary least squares (OLS) a method for estimating regression equation coefficients -- intercept (a) and slope (b) -- that minimize the sum of squared errors

To plot the regression line, we apply a criterion yielding the ā€œbest fitā€ of a line through the cloud of points

OLS estimator of the intercept, a

The OLS estimator for the intercept (a) simply changes the mean of Y (the dependent variable) by an amount equaling the regression slope’s effect for the mean of X:

a  Y  bX

Two important facts arise from this relation:

(1) The regression line always goes through the point of both variables’ means!

(2) When the regression slope is zero, for every X we only predict that Y equals the intercept a , which is also the mean of the dependent variable!

bYX  0

a  Y

X

Use these two bivariate regression equations, estimated from the 50 States data, to calculate some predicted values:

Y ˆ i  a  bYXXi
  1. Regress income on bachelor’s degree:
Y ˆ^ i  $ 9. 9  0. 77 Xi^ What predicted incomes for:

Xi = 12%: Y=____________ Xi = 28%: Y=____________

  1. Regress poverty percent on female labor force pct:

What predicted poverty % for: Xi = 55%: Y=____________ Xi = 70%: Y=____________

Y ˆ^ i  45. 2 % 0. 53 Xi

Errors in regression prediction

Every regression line through a scatterplot also passes

through the means of both variables; i.e., point ( Y , X )

We can use this relationship to divide the variance of Y into a double deviation from:

(1) the regression line (2) the Y-mean line Then calculate a sum of squares that reveals how strongly Y is predicted by X.

Illinois double deviation

In Income-Education scatterplot, show the difference between the mean and Illinois’ Y-score as the sum of two deviations:

IL

Yi Yi   ˆ

 Y ˆ i  Y

Error deviation of observed and predicted scores

Regression deviation of predicted score Y from the mean

Naming the sums of squares

 ^    ļ€«ļƒ„  ( Y Y )^2 ( Y Y ˆ )^2 ( Y ˆ Y )^2 i i i i

Each result of the preceding partition has a name:

TOTAL sum of squares

REGRESSION sum of squares

ERROR sum of squares

SSTOTAL = SSERROR + SSREGRESSION

The relative proportions of the two terms on the right indicate how well or poorly we can predict the variance in Y from its linear relationship with X

The SSTOTAL should be familiar to you – it’s the numerator of the variance of Y (see the Notes for Chapter 2). When we partition the sum of squares into the two components, we’re analyzing the variance of the dependent variable in a regression equation.

Hence, this method is called the analysis of variance or ANOVA.

Coefficient of Determination

If we had no knowledge about the regression slope (i.e., bYX = 0 and thus SSREGRESSION = 0), then our only prediction is that the score of Y for every case equals the mean (which also equals the equation’s intercept a ; see slide #10 above).

But, if bYX ≠ 0, then we can use information about the i th case’s score on X to improve our predicted Y for case i. We’ll still make errors, but the stronger the Y-X linear relationship, the more accurate our predictions will be.

Y a

Y a X

Y a b X

i

i i

i YX i

Find the R^2 for these 50-States bivariate regression equations

  1. R-square for regression of income on education

SSREGRESSION = 409. SSERROR = 342. SSTOTAL = 751.

R^2 = _________

  1. R-square for poverty-female labor force equation

SSREGRESSION = ______ SSERROR = 321. SSTOTAL = 576.

R^2 = _________

Here are some R^2 problems from the 2008 GSS

  1. R-square for church attendance regressed on age

SSREGRESSION = 67, SSERROR = 2,861, SSTOTAL = _________

R^2 = _________

  1. R-square for sex frequency-age equation

SSREGRESSION = 1,511, SSERROR = _____________ SSTOTAL = 10,502,

R^2 = _________