









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A lecture file from math 243, where n. Christopher phillips explains regression analysis through examples and calculations. Topics such as regression lines, correlation, and the importance of understanding outliers and lurking variables.
Typology: Study notes
1 / 16
This page cannot be seen from the preview
Don't miss anything!










N. Christopher Phillips
9 April 2009
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 1 / 61
The regression line depends on which variable is the explanatory variable. See Example 5.3 in the book for an example with real data. Here is a more dramatic example with fictitious data.
Data: (2, 4), (5, 10), (8, 4). (Again, just three points.)
0 2 4 6 8 10
2
4
6
8
10
12
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 2 / 61
Data: (2, 4), (5, 10), (8, 4). The correlation is r = 0. So the regression line has slope zero, and turns out to be
̂ y = 6.
0 2 4 6 8 10
2
4
6
8
10
12
Exchange the explanatory and response variables. The data was: (2, 4), (5, 10), (8, 4). It is now: (4, 2), (10, 5), (4, 8).
0 2 4 6 8 10 12
2
4
6
8
10
Data with explanatory and response variables switched: (4, 2), (10, 5), (4, 8). The correlation is still r = 0. So the regression line again has slope zero, and turns out to be ̂ y = 5.
0 2 4 6 8 10 12
2
4
6
8
10
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 5 / 61
Data: (2, 4), (5, 10), (8, 4).
Both regression lines on the same plot, with the original choice of explanatory variable:
0 2 4 6 8 10
2
4
6
8
10
12
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 6 / 61
It is easy to calculate, compared to other ways of trying to fit a line to data. It is suitable for data whose deviations from lying on a straight line are likely to be approximately normally distributed. It is not suitable for nonlinear relationships. It has the same problems that the mean and standard deviation do. For example, it is not resistant.
Here are the scatterplots from the lecture of 7 April, with regression lines added. (Example 9, with its poor choice of scale on the vertical axis, has been omitted.)
Observe that sometimes the line has little to do with the pattern in the scatterplot, and other times it is closely related.
2 4 6 8 10 12 14
2
4
6
8
10
12
14
Clear positive association, roughly linear. Regression line: ̂y ≈ 1 .63308 + 0. 789497 x. Correlation r ≈ 0. 915883 r 2 ≈ 0 .838842: The change in x explains about 84% of the change in y. N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 13 / 61
2 4 6 8 10 12 14
2
4
6
8
10
12
14
Clear positive association, roughly linear, one outlier. (The other points are as in Example 1.)
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 14 / 61
Correlation r ≈ 0. 671778. (It was r ≈ 0 .915883 without the outlier.)
r 2 ≈ 0 .451285: The change in x explains about 45% of the change in y. (Without the outlier, r 2 ≈ 0. 838842 .)
Regression line: ̂y ≈ 2 .57551 + 0. 595541 x. (It was ̂ y ≈ 1 .63308 + 0. 789497 x without the outlier.)
Comparing the graphs, one sees that the regression line did not move a great deal.
The outlier is influential for r and r 2 , but it is not influential for the regression line.
However, regression lines are not resistant: outliers in other locations can be influential.
2 4 6 8 10 12 14
2
4
6
8
10
12
14
Very strong positive association, linear. Regression line: ̂y ≈ 0 .425816 + 0. 937028 x. Correlation r ≈ 0. 991102
r 2 ≈ 0 .982284: The change in x explains about 98% of the change in y.
5 10 15 20
5
10
15
Weak positive association. Regression line: ̂y ≈ 5 .05521 + 0. 463938 x. Correlation r ≈ 0. 483459 r 2 ≈ 0 .233732: The change in x explains about 23% of the change in y.
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 17 / 61
2 4 6 8 10 12 14
5
10
15
Strong negative association, roughly linear. Regression line: ̂y ≈ 16. 3048 − 1. 06155 x.
Correlation r ≈ − 0. 950719 r 2 ≈ 0 .903866: The change in x explains about 90% of the change in y. N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 18 / 61
2 4 6 8 10 12 14
2
4
6
8
10
12
14
Very strong negative association, linear. Regression line: ̂y ≈ 13. 5742 − 0. 937028 x. Correlation r ≈ − 0. 991102 r 2 ≈ 0 .982284: The change in x explains about 98% of the change in y.
2.5 5 7.5 10 12.5 15
2
4
6
8
10
12
14
16
Weak negative association. Regression line: ̂y ≈ 11. 553 − 0. 463938 x. Correlation r ≈ − 0. 483459 r 2 ≈ 0 .233732: The change in x explains about 23% of the change in y.
2 4 6 8 10 12 14
5
10
15
20
Strong nonlinear negative association. Regression line: ̂y ≈ 23. 2928 − 1. 25295 x.
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 25 / 61
The regression line does not match the pattern very well. Correlation r ≈ − 0. 913696 r 2 ≈ 0 .834841: The change in x explains about 83% of the linear change in y.
r 2 understates the strength of the pattern, because the association is nonlinear.
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 26 / 61
2 4 6 8 10 12 14
5
10
15
20
Very strong nonlinear negative association. Regression line: ̂y ≈ 23. 0764 − 1. 21755 x.
As in Example 12, the regression line does not match the pattern very well. The pattern is stronger, but r and r 2 are similar to that example.
Which kinds of points are likely to be influential for the correlation? Which kinds of points are likely to be influential for the regression line?
r 2 ≈ 0. 623402.
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 29 / 61
Original data, with more white space (to allow room to add outliers):
r 2 ≈ 0. 623402.
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 30 / 61
r 2 ≈ 0 .61667: little change from the original value of 0. 623402.
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 37 / 61
r 2 ≈ 0 .442955: somewhat less than the original value of 0. 623402.
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 38 / 61
r 2 ≈ 0 .781681: considerably bigger than the original value of 0. 623402. The red point is influential for r 2.
r 2 ≈ 0 .781681: considerably bigger than the original value of 0. 623402. The red point is influential for r 2.
This is a lot to have rest on just one data point. See the discussion on pages 129–131 for a similar situation with real data. Warning: You should be suspicious of any outcome involving an influential point.
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 41 / 61
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 42 / 61
r 2 ≈ 0 .391031: considerably smaller than the original value of 0. 623402.
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 49 / 61
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 50 / 61
A change in one standard deviation in x gives a change of r standard deviations in ̂y. Thus, the closer the correlation is to zero, the less ̂y responds to changes in x. r 2 measures the success of the regression. (See earlier lectures.) The regression line goes through (x, y ) and has slope rsy /sx. (This was stated before.)
The regression line is not resistant. See examples above. Be very careful with extrapolation! (Examples below.) Caution: Lurking variables can spoil conclusions based on regression. (Examples below.) Caution: Association does not imply causation. (Examples below.)
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 53 / 61
Regression on the TI-83 calculator: See earlier notes.
Caution on notation: Our book uses
̂ y = a + bx,
as does its picture of the screen of the TI-83 calculator. My TI- calculator (several years old) has
ŷ = ax + b.
Make sure you pay attention to which coefficient is which.
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 54 / 61
We find a regression line to predict the height of a child between 0 and 10 years old from the child’s age. Will the resulting prediction for the height of a 25 year old be reasonable? What about a 50 year old?
No. People stop growing in their teens.
A regression line for the best time in the 50 yard dash each year, with x being the year, predicts that eventually the winning time will be negative.
A lurking variable is one which has an important effect on the relationships among the variables studied but which itself is not studied.
Example: Smoking causes lung cancer.
Observational studies show, for example, that people who smoke are more likely to get lung cancer, even that, say, 60–65 year old men who smoke are more likely to get lung cancer that 60–65 year old men who don’t smoke.
Experiments show that rats which smoke are more likely to get lung cancer that rats which don’t smoke. Here, the rats are divided in two groups, and rats in one group are subjected to cigarette smoke while the rats in the other group are not. We can’t do this with people.
See the end of Chapter 5 for a number of reasons which, taken together, show that the evidence that smoking causes lung cancer in people is very strong even without doing experiments on people.
N. Christopher Phillips () Math 243: Lecture File 4 9 April 2009 61 / 61