Regression: Understanding Correlation and Predicting Values, Exams of Statistics

An introduction to regression analysis, focusing on simple linear regression. It covers the definition of regression, the calculation of the regression line, and the use of the normal curve to estimate percentile ranks. The document also explains how to predict individual values using the regression method and discusses the regression effect and errors.

Typology: Exams

Pre 2010

Uploaded on 08/19/2009

koofers-user-6tj
koofers-user-6tj 🇺🇸

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Regression
The idea behind the calculation of the coefficient of correlation is
that the scatterplot of the data corresponds to a cloud that follows
a straight line. This idea can be formalized by regression methods.
In this class we will:
Consider the definition of simple linear regression
Find a method to predict an individual value
Use the normal curve to estimate the percentile ranks
Describe the regression effect
Compute the regression errors and its RMS
Study the behavior of regression errors
AMS-5: Statistics
311
Regression
The regression method describes how one variable depends on
another.
2000 4000 6000 8000
60 65 70 75 80 85
feet
Degrees
August Temperatures vs Elevation in Northern California
The Northern California
temperature data have av-
erage altitude of 3,524 feet
and a SD of 1,839 feet; av-
erage temperature of 70.3
degrees and SD 6.5 de-
grees. The correlation be-
tween temperature and al-
titude is -0.76
AMS-5: Statistics
312
The cloud of points shows a mild negative association between the
two variables, as does the value of r. Can we use the values of
altitude to estimate the average values of temperature?
The regression line for yon xestimates the average value of y
corresponding to each value of x
AMS-5: Statistics
313
How does the regression line work?
x
yr x SDy
SDx
Associated with an increase of
one SD in xthere is an increase
of r×SDs in yon average.
Clearly, if the correlation coefficient is negative, then the average
value of ydecreases as xincreases.
In the temperature and altitude example, an increase of height of
1,839 feet produces a increase of 0.76 ×6.5 = 4.95 degrees in the
average temperature.
AMS-5: Statistics
314
pf3
pf4
pf5

Partial preview of the text

Download Regression: Understanding Correlation and Predicting Values and more Exams Statistics in PDF only on Docsity!

Regression

In this class we will:a straight line. This idea can be formalized by regression methods.that the scatterplot of the data corresponds to a cloud that follows The idea behind the calculation of the coefficient of correlation is

Consider the definition of simple linear regression

Find a method to predict an individual value

Use the normal curve to estimate the percentile ranks

Describe the regression effect

Compute the regression errors and its RMS

Study the behavior of regression errors

311

Regression

another. The regression method describes how one variable depends on

2000

4000

6000

8000

60 65 70 75 80 85

feet

Degrees

August Temperatures vs Elevation in Northern California

The

Northern

California

degreeserage temperature of 70.3and a SD of 1,839 feet; av-erage altitude of 3,524 feettemperature data have av-

and

SD

de-

titude is -0.76tween temperature and al-grees. The correlation be-

AMS-5: Statistics

two variables, as does the value of The cloud of points shows a mild negative association between the

r

. Can we use the values of

altitude to estimate the average values of temperature? The regression line for

y

on

x

estimates the average value of

y

corresponding to each value of

x

313

How does the regression line work?

x

y

r x SDy

SDx

one SD in Associated with an increase of

x

there is an increase

of

r

×

SDs in

y

on average.

value ofClearly, if the correlation coefficient is negative, then the average

y

decreases

as

x

increases.

1,839 feet produces a increase ofIn the temperature and altitude example, an increase of height of

×

95 degrees in the

average temperature.

AMS-5: Statistics

If we consider two variablesHow do we use the method to predict an individual value?

x

and

y

and we want to predict the

value of

y

for a specific value of

x

, we use the average value of

y

that corresponds to the value of

x

according to the regression

Example:method.

The first year GPAs and the Math SAT for the students

of a university produce the following data

average SAT score = 550

SD

average 1st-year GPA = 2

SD

r

score of 650.We want to predict the 1st-year GPA of a student with a SAT

315

The student’s SAT score in standard units is

above the average SAT score produces an increase of 0so the score is 1.25 SDs above average. An increase of one SD

×

6 GPA

points. This implies that our student will have an increase of

×

×

predicted GPA ispoints of GPA above average. Since the average GPA is 2.6, the

scores around 650.This is the average GPA that we expect for students with STA

AMS-5: Statistics

WARNING:

You can use the regression method on new subjects

produce the averages, SDs andprovided that they are similar to the ones that were used to

r

used in the regression method.

of a different institution.In the previous example the method will not be valid for students

317

estimates of the We can use the regression method and the normal curve to produce

percentile ranks

Example:

In the previous example suppose a student has a

to Using the normal curve we have that a 90% probability correspondsfor the 1st-year GPA of this student?scores are higher than his. What is the predicted percentile rankpercentile rank of 90% for the SAT scores. That is, only 10% of the

z

score of 1.3. This means that the student’s SAT score is 1.

This corresponds to beingSDs above average.

×

5 SDs above the average GPA

normal curve, of approximately 69%.and this corresponds to an accumulated probability, under the

AMS-5: Statistics

40

60

80

100

120

140

160

0.000 0.005 0.010 0.015 0.020 0.

scores

density

is curve, the first possibility According to the normal

more

likely,

since

the

nario,below 140. Under this sce-probability than the oneterval above 140 has lessmean is 100 and so the in-

the second test is

value below 140.more likely to produce a

323

valid A symmetric situation is

for

those

scoring,

say,

IQ.

It

is

likely

the first.score is likely to be aboveerror, and so the second80 with a negative chancethat the true test is above

40

60

80

100

120

140

160

0.000 0.005 0.010 0.015 0.020 0.

scores

density

AMS-5: Statistics

This explains the regression effect.scoring below average in the first test.be lower than the first. A symmetric situation is true for a personstudent takes the test again, chances are that the second score willit is likely that the true score is lower than the observed one. If the In other words, if a students scores above average in the first test,

325

Regression errors

The regression method can be used to predict

y

from

x

. But actual

values differ from predictions. These are the

regression errors

error = actual value of

y

  • predicted value of

y

(RMS), as we did to obtain the SD. This is equal to The overall size of the error is measured using the root-mean-square How do we measure the error in a regression?some are below the regression line.negative. Reflecting the fact that some observations are above and Some of the errors defined in this way are positive and some are

(error 1)^

2

  • (error 2)

2

  • (error N)

2

N

where

N

is the number of points in the scatter diagram.

Since the RMS error is an average it reflects how far a typical point

AMS-5: Statistics

As a rule of thumb we have thatwhat the SD is to the average.So the RMS error is like a SD. Actually it is to the regression lineis from the regression line.

68% of the points on a scatter diagram fall inside the strip

whose edges are parallel to the regression line and

one

RMS

error away.

95% of the points on a scatter diagram fall inside the strip

whose edges are parallel to the regression line and

two

RMS

error away.

327

errorRMS One errorRMSOne

x

y

y

x

errorRMSTwo errorRMSTwo

68%

95%

What if we ignore the values of

x

?

Then our prediction for

y

is the average of

y

. In this case the RMS

error coincides with the SD of

y

.

AMS-5: Statistics

Computing the RMS error

values of We saw that the error that corresponds to a prediction where the

x

are ignored corresponds to the SD of

y

. The overall size

of the error for a regression using

x

has to be smaller than the SD.

How much smaller?

RMS error =

r

2

×

SD

of

y

We observe the following features

The units of the RMS error are the same as the units of the

variable being predicted.

Perfect correlation corresponds to zero RMS error.

Zero correlation corresponds to maximum RMS error (equal to

SD of

y

).

329

California temperature

In the California temperature example we

had that the SD of

y

is 6.5 degrees and the correlation is -0.76, then

2

×

5 degrees

22 degrees

Shoe sizes4.22 degrees.So, in this case, knowing the altitude reduces the SD from 6.5 to

In the shoe sizes examples we had that the SD of

y

is

2.45 inches and the correlation is 0.93, then

2

×

45 degrees

90 degrees

reduction of the SD from 2.45 to 0.90.So we observe that, knowing the shoe size produces a dramatic

AMS-5: Statistics