







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of scatterplots, correlation, and regression equations as tools to explore the relationship between quantitative variables. The author uses data from various colleges and universities to demonstrate the concepts of scatterplots, positive and negative associations, and regression lines. Students can use this document as a reference for understanding the basics of statistical analysis and data visualization.
Typology: Study notes
1 / 13
This page cannot be seen from the preview
Don't miss anything!








College Example: Data were collected from the 25 top liberal
arts colleges and the 25 top research universities. The
following variables are contained in a data set:
Explore the bivariate relationship between pairs of variables
SAT, Accept, and Top_hs.
College TYPE SAT ACCEPT SPENT Top_HS PHD GRADS Amherst Lib Arts 1315 22 26636 85 81 93 Swarthmore Lib Arts 1310 24 27487 78 93 88 Williams Lib Arts 1336 28 23772 86 90 93 Bowdoin Lib Arts 1300 24 25703 78 95 90 Wellesley Lib Arts 1250 49 27879 76 91 86 Pomona Lib Arts 1320 33 26668 79 98 80 Wesleyan (CT) Lib Arts 1290 35 19948 73 87 91 Middlebury Lib Arts 1255 25 24718 65 89 92 Smith Lib Arts 1195 57 25271 65 90 87 Davidson Lib Arts 1230 36 17721 77 94 89 Vassar Lib Arts 1287 43 20179 53 90 84 Carleton Lib Arts 1300 40 19504 75 82 80 Claremont Lib Arts 1260 36 20377 68 94 74 Oberlin Lib Arts 1247 54 23591 64 98 77 Washington & Lee Lib Arts 1234 29 17998 61 89 78 Grinnell Lib Arts 1244 67 22301 65 79 73 Mount Holyoke Lib Arts 1200 61 23358 47 83 83 Colby Lib Arts 1200 46 18872 52 75 84 Hamilton Lib Arts 1215 38 20722 51 86 85 Bates Lib Arts 1240 36 17554 58 81 88 Haverford Lib Arts 1285 35 19418 71 91 87 Colgate Lib Arts 1258 38 17520 61 78 85 Bryn Mawr Lib Arts 1255 56 18847 70 81 84 Occidental Lib Arts 1170 49 20192 54 93 72 Barnard Lib Arts 1220 53 17653 69 98 80 Harvard Univ 1370 18 46918 90 99 90 Stanford Univ 1370 18 61921 92 96 88 Yale Univ 1350 19 52468 90 97 93
Princeton Univ 1340 17 48123 89 99 93 Cal Tech Univ 1400 31 102262 98 98 75 MIT Univ 1357 30 56766 95 98 86 Duke Univ 1310 25 39504 91 95 91 Dartmouth Univ 1306 25 35804 86 100 95 Cornell Univ 1280 30 37137 85 90 83 Columbia Univ 1268 29 45879 78 93 90 U of Chicago Univ 1300 45 38937 74 100 73 Brown Univ 1281 24 24201 80 98 90 U Penn Univ 1280 41 30882 87 99 86 Berkeley Univ 1176 37 23665 95 93 68 Johns Hopkins Univ 1290 48 45460 69 58 86 Rice Univ 1327 24 26730 85 95 88 UCLA Univ 1142 43 26859 96 100 61 U Va. Univ 1218 37 19365 77 91 88 Georgetown Univ 1278 24 23115 79 89 89 UNC Univ 1109 32 19684 82 84 73 U Michican Univ 1195 60 21853 71 93 77 Carnegie Mellon Univ 1225 64 33607 52 84 77 Northwestern U niv 1230 47 28851 77 79 82 Washington U (MO) Univ 1225 54 39883 71 98 76 U of Rochester Univ 1155 56 38597 52 96 73
Positive/Negative Association
association when the values of
one variable tend to increase as the
values of the other variable increase.
association when the values of
one variable tend to decrease as the
values of the other variable increase.
Top_HS
Positive Association
Schools with higher percentage of students who have graduated
on the top 10% of their graduating class have students with higher
median SAT.
Example 5.2 Driver Age and Maximum
Legibility Distance of Highway Signs
Example Carbon Dioxide
Concentration
Trends in Co2 concentration in the last two
centuries (Source of data: WorldWatch
Institute).
Data file: car_dio.MTW
Year
CO2 (ppm)
There is a positive association between
Year and the Co2 concentration level.
The association, however, is nonlinear.
Consider the College data. Explore the relationship
between pairs of variables.
Groups
represent different subgroups.
This is an example
where there is little
relationship between
the two variables, but
a bogus relationship
is shown because
two variables are
displayed together.
Top_HS
TYPE Lib A rts Univ
What does this graph tell you?
TYPE Lib A rts Univ
What does this graph tell you?
5.2 Describing Linear Patterns
with a Regression Line
Two purposes of the regression line:
specified value of x
given that individual’s x value
When the best equation for describing the
relationship between x and y is a straight line,
the equation is called the regression line.
SAT = 1371 – 2.83 Accept
S 50. R-Sq 36.8% R-Sq(adj) 35.5%
If a school has 50% acceptance rate, then on average the
median SAT score for that school is 1371 – 2.83 (50) =
Prediction Errors and Residuals
between the observed value of y
and the predicted value.
y ˆ
( y − y ˆ)
( x ,yi i )
Predicted value
Residual
LS line
Components in a regression model
Regression equation: = 577 – 3 x
Example 5.2 Driver Age and Maximum
Legibility Distance of Highway Signs (cont)
Can compute the residual for all 30 observations.
Positive residual => observed value higher than predicted.
Negative residual => observed value lower than predicted.
22 516 577 – 3(22)=511 516 – 511 = 5
20 590 577 – 3(20)=517 590 – 517 = 73
18 510 577 – 3(18)=523 510 – 523 = -
x = Age y = Distance y ˆ (^) = 577 − 3 x Residual
y ˆ
Least Squares Line and Formulas
the sum of squared prediction errors.
( )( )
( ) ∑
∑
−
− −
=
i
i
i
i i
x x
x x y y
b (^12)
b y b x 0 1
= −
5.3 Measuring Strength and
Direction with Correlation
determined by the closeness of the points to
a straight line.
variable generally increases or generally
decreases when the other variable increases.
Correlation r indicates the strength and
the direction of a straight-line relationship.
Interpretation of r and a Formula
change the predicted value of y
∑ ⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛ (^) −
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛ (^) −
−
=
s
y y
s
x x
n
r 1
1
Example 5.1 Height and Handspan (cont)
Regression equation : Handspan = -3 + 0.35 Height
Correlation r = +0.74 =>
a somewhat strong positive linear relationship.
Example 5.9 Hours of Sleep
and Hours of Study
Relationship between reported hours of sleep the previous
24 hours and the reported hours of study during the same
period for a sample of 116 college students.
Correlation r = –0.
=> a not too strong
negative association.
A Correlation applet
Interpretation of r
2 and a formula
Squared correlation r
2 is between 0 and 1
and indicates the proportion of variation in
the response explained by x.
SSTO = sum of squares total = sum of squared
differences between observed y values and.
SSE = sum of squared errors (residuals) = sum
of squared differences between observed y values
and predicted values based on least squares line.
SSTO
SSTO SSE r
2
y
Interpretation of r
2
Example 5.6: Left and Right Handspans
r
2 = 0.90 => span of one hand is very
predictable from span of other hand.
Example 5.8: TV viewing and Age
r
2 = 0.014 => only about 1.4%
knowing a person’s age doesn’t
help much in predicting amount
of daily TV viewing.
Example 5.6: Left and Right Handspans
Regression in Minitab:
Stat > Regression > Regression
Regression Analysis: SAT versus ACCEPT
The regression equation is
SAT = 1371 - 2.83 ACCEPT
Predictor Coef SE Coef T P
Constant 1371.05 21.45 63.92 0.
ACCEPT -2.8300 0.5351 -5.29 0.
S = 50.0586 R-Sq = 36.8% R-Sq(adj) = 35.5%
5.4 Why the Answers
May Not Make Sense
the results
equation to describe curvilinear data
Example 5.4 Height and Foot Length (cont)
Regression equation
uncorrected data: 15.4 + 0.13 height
corrected data: -3.2 + 0.42 height
Correlation
uncorrected data: r = 0.
corrected data: r = 0.
Three outliers were
data entry errors.
Example 5.10 Earthquakes in US
Correlation
all data: r = 0.
w/o SF: r = –0.
San Francisco
earthquake of 1906.
Case Study 5.1 A Weighty Issue
Relationship between Actual and Ideal Weight
Females (^) Males