



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An introduction to regression analysis, focusing on simple linear regression. It covers the definition of regression, the calculation of the regression line, and the use of the normal curve to estimate percentile ranks. The document also explains how to predict individual values using the regression method and discusses the regression effect and errors.
Typology: Exams
1 / 6
This page cannot be seen from the preview
Don't miss anything!




Regression
In this class we will:a straight line. This idea can be formalized by regression methods.that the scatterplot of the data corresponds to a cloud that follows The idea behind the calculation of the coefficient of correlation is
Consider the definition of simple linear regression
Find a method to predict an individual value
Use the normal curve to estimate the percentile ranks
Describe the regression effect
Compute the regression errors and its RMS
Study the behavior of regression errors
311
Regression
another. The regression method describes how one variable depends on
2000
4000
6000
8000
60 65 70 75 80 85
feet
Degrees
August Temperatures vs Elevation in Northern California
The
Northern
California
degreeserage temperature of 70.3and a SD of 1,839 feet; av-erage altitude of 3,524 feettemperature data have av-
and
de-
titude is -0.76tween temperature and al-grees. The correlation be-
AMS-5: Statistics
two variables, as does the value of The cloud of points shows a mild negative association between the
r
. Can we use the values of
altitude to estimate the average values of temperature? The regression line for
y
on
x
estimates the average value of
y
corresponding to each value of
x
313
How does the regression line work?
x
y
r x SDy
SDx
one SD in Associated with an increase of
x
there is an increase
of
r
×
SDs in
y
on average.
value ofClearly, if the correlation coefficient is negative, then the average
y
decreases
as
x
increases.
1,839 feet produces a increase ofIn the temperature and altitude example, an increase of height of
95 degrees in the
average temperature.
AMS-5: Statistics
If we consider two variablesHow do we use the method to predict an individual value?
x
and
y
and we want to predict the
value of
y
for a specific value of
x
, we use the average value of
y
that corresponds to the value of
x
according to the regression
Example:method.
The first year GPAs and the Math SAT for the students
of a university produce the following data
average SAT score = 550
average 1st-year GPA = 2
r
score of 650.We want to predict the 1st-year GPA of a student with a SAT
315
The student’s SAT score in standard units is
above the average SAT score produces an increase of 0so the score is 1.25 SDs above average. An increase of one SD
points. This implies that our student will have an increase of
predicted GPA ispoints of GPA above average. Since the average GPA is 2.6, the
scores around 650.This is the average GPA that we expect for students with STA
AMS-5: Statistics
You can use the regression method on new subjects
produce the averages, SDs andprovided that they are similar to the ones that were used to
r
used in the regression method.
of a different institution.In the previous example the method will not be valid for students
317
estimates of the We can use the regression method and the normal curve to produce
percentile ranks
Example:
In the previous example suppose a student has a
to Using the normal curve we have that a 90% probability correspondsfor the 1st-year GPA of this student?scores are higher than his. What is the predicted percentile rankpercentile rank of 90% for the SAT scores. That is, only 10% of the
z
score of 1.3. This means that the student’s SAT score is 1.
This corresponds to beingSDs above average.
5 SDs above the average GPA
normal curve, of approximately 69%.and this corresponds to an accumulated probability, under the
AMS-5: Statistics
40
60
80
100
120
140
160
0.000 0.005 0.010 0.015 0.020 0.
scores
density
is curve, the first possibility According to the normal
more
likely,
since
the
nario,below 140. Under this sce-probability than the oneterval above 140 has lessmean is 100 and so the in-
the second test is
value below 140.more likely to produce a
323
valid A symmetric situation is
for
those
scoring,
say,
It
is
likely
the first.score is likely to be aboveerror, and so the second80 with a negative chancethat the true test is above
40
60
80
100
120
140
160
0.000 0.005 0.010 0.015 0.020 0.
scores
density
AMS-5: Statistics
This explains the regression effect.scoring below average in the first test.be lower than the first. A symmetric situation is true for a personstudent takes the test again, chances are that the second score willit is likely that the true score is lower than the observed one. If the In other words, if a students scores above average in the first test,
325
Regression errors
The regression method can be used to predict
y
from
x
. But actual
values differ from predictions. These are the
regression errors
error = actual value of
y
y
(RMS), as we did to obtain the SD. This is equal to The overall size of the error is measured using the root-mean-square How do we measure the error in a regression?some are below the regression line.negative. Reflecting the fact that some observations are above and Some of the errors defined in this way are positive and some are
(error 1)^
2
2
2
where
is the number of points in the scatter diagram.
Since the RMS error is an average it reflects how far a typical point
AMS-5: Statistics
As a rule of thumb we have thatwhat the SD is to the average.So the RMS error is like a SD. Actually it is to the regression lineis from the regression line.
68% of the points on a scatter diagram fall inside the strip
whose edges are parallel to the regression line and
one
error away.
95% of the points on a scatter diagram fall inside the strip
whose edges are parallel to the regression line and
two
error away.
327
errorRMS One errorRMSOne
x
y
y
x
errorRMSTwo errorRMSTwo
68%
95%
What if we ignore the values of
x
?
Then our prediction for
y
is the average of
y
. In this case the RMS
error coincides with the SD of
y
.
AMS-5: Statistics
Computing the RMS error
values of We saw that the error that corresponds to a prediction where the
x
are ignored corresponds to the SD of
y
. The overall size
of the error for a regression using
x
has to be smaller than the SD.
How much smaller?
RMS error =
r
2
of
y
We observe the following features
The units of the RMS error are the same as the units of the
variable being predicted.
Perfect correlation corresponds to zero RMS error.
Zero correlation corresponds to maximum RMS error (equal to
SD of
y
).
329
California temperature
In the California temperature example we
had that the SD of
y
is 6.5 degrees and the correlation is -0.76, then
2
5 degrees
22 degrees
Shoe sizes4.22 degrees.So, in this case, knowing the altitude reduces the SD from 6.5 to
In the shoe sizes examples we had that the SD of
y
is
2.45 inches and the correlation is 0.93, then
2
45 degrees
90 degrees
reduction of the SD from 2.45 to 0.90.So we observe that, knowing the shoe size produces a dramatic
AMS-5: Statistics