Regression Analysis: Lecture File 4 by N. Christopher Phillips - Prof. N. Phillips, Study notes of Probability and Statistics

A lecture file from math 243, where n. Christopher phillips explains regression analysis through examples and calculations. Topics such as regression lines, correlation, and the importance of understanding outliers and lurking variables.

Typology: Study notes

Pre 2010

Uploaded on 07/29/2009

koofers-user-ok3
koofers-user-ok3 🇺🇸

9 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Math 243: Lecture File 4
N. Christopher Phillips
9 April 2009
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 1 / 61
Regression line: Example 2
The regression line depends on which variable is the explanatory variable.
See Example 5.3 in the book for an example with real data. Here is a more
dramatic example with fictitious data.
Data: (2,4),(5,10),(8,4).(Again, just three points.)
0
2
4
6
8
10
2
4
6
8
10
12
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 2 / 61
Example 2 (continued)
Data: (2,4),(5,10),(8,4).
The correlation is r= 0.
So the regression line has slope zero, and turns out to be
b
y= 6.
0
2
4
6
8
10
2
4
6
8
10
12
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 3 / 61
Example 2 (continued)
Exchange the explanatory and response variables.
The data was: (2,4),(5,10),(8,4).
It is now: (4,2),(10,5),(4,8).
0
2
4
6
8
10
12
2
4
6
8
10
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 4 / 61
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Regression Analysis: Lecture File 4 by N. Christopher Phillips - Prof. N. Phillips and more Study notes Probability and Statistics in PDF only on Docsity!

Math 243: Lecture File 4

N. Christopher Phillips

9 April 2009

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 1 / 61

Regression line: Example 2

The regression line depends on which variable is the explanatory variable. See Example 5.3 in the book for an example with real data. Here is a more dramatic example with fictitious data.

Data: (2, 4), (5, 10), (8, 4). (Again, just three points.)

0 2 4 6 8 10

2

4

6

8

10

12

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 2 / 61

Example 2 (continued)

Data: (2, 4), (5, 10), (8, 4). The correlation is r = 0. So the regression line has slope zero, and turns out to be

̂ y = 6.

0 2 4 6 8 10

2

4

6

8

10

12

Example 2 (continued)

Exchange the explanatory and response variables. The data was: (2, 4), (5, 10), (8, 4). It is now: (4, 2), (10, 5), (4, 8).

0 2 4 6 8 10 12

2

4

6

8

10

Example 2 (continued)

Data with explanatory and response variables switched: (4, 2), (10, 5), (4, 8). The correlation is still r = 0. So the regression line again has slope zero, and turns out to be ̂ y = 5.

0 2 4 6 8 10 12

2

4

6

8

10

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 5 / 61

Example 2 (continued)

Data: (2, 4), (5, 10), (8, 4).

Both regression lines on the same plot, with the original choice of explanatory variable:

0 2 4 6 8 10

2

4

6

8

10

12

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 6 / 61

Some good and bad points about the least squares

regression line

It is easy to calculate, compared to other ways of trying to fit a line to data. It is suitable for data whose deviations from lying on a straight line are likely to be approximately normally distributed. It is not suitable for nonlinear relationships. It has the same problems that the mean and standard deviation do. For example, it is not resistant.

Regression lines for previously displayed scatterplots

Here are the scatterplots from the lecture of 7 April, with regression lines added. (Example 9, with its poor choice of scale on the vertical axis, has been omitted.)

Observe that sometimes the line has little to do with the pattern in the scatterplot, and other times it is closely related.

Example 1.

2 4 6 8 10 12 14

2

4

6

8

10

12

14

Clear positive association, roughly linear. Regression line: ̂y ≈ 1 .63308 + 0. 789497 x. Correlation r ≈ 0. 915883 r 2 ≈ 0 .838842: The change in x explains about 84% of the change in y. N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 13 / 61

Example 2.

2 4 6 8 10 12 14

2

4

6

8

10

12

14

Clear positive association, roughly linear, one outlier. (The other points are as in Example 1.)

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 14 / 61

Example 2 (continued).

Correlation r ≈ 0. 671778. (It was r ≈ 0 .915883 without the outlier.)

r 2 ≈ 0 .451285: The change in x explains about 45% of the change in y. (Without the outlier, r 2 ≈ 0. 838842 .)

Regression line: ̂y ≈ 2 .57551 + 0. 595541 x. (It was ̂ y ≈ 1 .63308 + 0. 789497 x without the outlier.)

Comparing the graphs, one sees that the regression line did not move a great deal.

The outlier is influential for r and r 2 , but it is not influential for the regression line.

However, regression lines are not resistant: outliers in other locations can be influential.

Example 3.

2 4 6 8 10 12 14

2

4

6

8

10

12

14

Very strong positive association, linear. Regression line: ̂y ≈ 0 .425816 + 0. 937028 x. Correlation r ≈ 0. 991102

r 2 ≈ 0 .982284: The change in x explains about 98% of the change in y.

Example 4.

5 10 15 20

5

10

15

Weak positive association. Regression line: ̂y ≈ 5 .05521 + 0. 463938 x. Correlation r ≈ 0. 483459 r 2 ≈ 0 .233732: The change in x explains about 23% of the change in y.

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 17 / 61

Example 5.

2 4 6 8 10 12 14

5

10

15

Strong negative association, roughly linear. Regression line: ̂y ≈ 16. 3048 − 1. 06155 x.

Correlation r ≈ − 0. 950719 r 2 ≈ 0 .903866: The change in x explains about 90% of the change in y. N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 18 / 61

Example 6.

2 4 6 8 10 12 14

2

4

6

8

10

12

14

Very strong negative association, linear. Regression line: ̂y ≈ 13. 5742 − 0. 937028 x. Correlation r ≈ − 0. 991102 r 2 ≈ 0 .982284: The change in x explains about 98% of the change in y.

Example 7.

2.5 5 7.5 10 12.5 15

2

4

6

8

10

12

14

16

Weak negative association. Regression line: ̂y ≈ 11. 553 − 0. 463938 x. Correlation r ≈ − 0. 483459 r 2 ≈ 0 .233732: The change in x explains about 23% of the change in y.

Example 12.

2 4 6 8 10 12 14

5

10

15

20

Strong nonlinear negative association. Regression line: ̂y ≈ 23. 2928 − 1. 25295 x.

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 25 / 61

Example 12 (continued).

The regression line does not match the pattern very well. Correlation r ≈ − 0. 913696 r 2 ≈ 0 .834841: The change in x explains about 83% of the linear change in y.

r 2 understates the strength of the pattern, because the association is nonlinear.

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 26 / 61

Example 13.

2 4 6 8 10 12 14

5

10

15

20

Very strong nonlinear negative association. Regression line: ̂y ≈ 23. 0764 − 1. 21755 x.

As in Example 12, the regression line does not match the pattern very well. The pattern is stronger, but r and r 2 are similar to that example.

Outliers and influential points

Which kinds of points are likely to be influential for the correlation? Which kinds of points are likely to be influential for the regression line?

Original data

r 2 ≈ 0. 623402.

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 29 / 61

Original data, with more white space (to allow room to add outliers):

r 2 ≈ 0. 623402.

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 30 / 61

Example 1.

Example 1 (continued).

r 2 ≈ 0 .61667: little change from the original value of 0. 623402.

Example 4.

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 37 / 61

Example 4 (continued).

r 2 ≈ 0 .442955: somewhat less than the original value of 0. 623402.

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 38 / 61

Example 5.

Example 5 (continued).

r 2 ≈ 0 .781681: considerably bigger than the original value of 0. 623402. The red point is influential for r 2.

Example 5 (continued).

r 2 ≈ 0 .781681: considerably bigger than the original value of 0. 623402. The red point is influential for r 2.

This is a lot to have rest on just one data point. See the discussion on pages 129–131 for a similar situation with real data. Warning: You should be suspicious of any outcome involving an influential point.

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 41 / 61

Example 6.

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 42 / 61

Example 6 (continued).

r 2 ≈ 0 .391031: considerably smaller than the original value of 0. 623402.

Example 7.

Residual plot for the previous example

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 49 / 61

Example whose residuals show no pattern.

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 50 / 61

Residual plot for the previous example

Facts about the regression line (continued).

A change in one standard deviation in x gives a change of r standard deviations in ̂y. Thus, the closer the correlation is to zero, the less ̂y responds to changes in x. r 2 measures the success of the regression. (See earlier lectures.) The regression line goes through (x, y ) and has slope rsy /sx. (This was stated before.)

The regression line is not resistant. See examples above. Be very careful with extrapolation! (Examples below.) Caution: Lurking variables can spoil conclusions based on regression. (Examples below.) Caution: Association does not imply causation. (Examples below.)

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 53 / 61

Regression on the TI-83 calculator: See earlier notes.

Caution on notation: Our book uses

̂ y = a + bx,

as does its picture of the screen of the TI-83 calculator. My TI- calculator (several years old) has

ŷ = ax + b.

Make sure you pay attention to which coefficient is which.

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 54 / 61

Hazard: extrapolation.

We find a regression line to predict the height of a child between 0 and 10 years old from the child’s age. Will the resulting prediction for the height of a 25 year old be reasonable? What about a 50 year old?

No. People stop growing in their teens.

A regression line for the best time in the 50 yard dash each year, with x being the year, predicts that eventually the winning time will be negative.

Hazard: Lurking variables.

A lurking variable is one which has an important effect on the relationships among the variables studied but which itself is not studied.

Example: Smoking causes lung cancer.

Observational studies show, for example, that people who smoke are more likely to get lung cancer, even that, say, 60–65 year old men who smoke are more likely to get lung cancer that 60–65 year old men who don’t smoke.

Experiments show that rats which smoke are more likely to get lung cancer that rats which don’t smoke. Here, the rats are divided in two groups, and rats in one group are subjected to cigarette smoke while the rats in the other group are not. We can’t do this with people.

See the end of Chapter 5 for a number of reasons which, taken together, show that the evidence that smoking causes lung cancer in people is very strong even without doing experiments on people.

N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 61 / 61