Analyzing Variable Relationship: Scatterplots & Regression Analysis - Jamshidian, 2005, Study notes of Mathematical Statistics

An overview of scatterplots, correlation, and regression equations as tools to explore the relationship between quantitative variables. The author uses data from various colleges and universities to demonstrate the concepts of scatterplots, positive and negative associations, and regression lines. Students can use this document as a reference for understanding the basics of statistical analysis and data visualization.

Typology: Study notes

Pre 2010

Uploaded on 08/16/2009

koofers-user-ay5
koofers-user-ay5 🇺🇸

10 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
1
Relationships
Between
Quantitative
Variables
Chapter 5
2
Three Tools we will use …
Scatterplot, a two-dimensional graph of
data values
Correlation, a statistic that measures the
strength and direction of a linear relationship
Regression equation, an equation that
describes the average relationship between a
response and explanatory variable
3
5.1 Looking for Patterns
with Scatterplots
Questions to Ask about a Scatterplot
What is the average pattern? Does it look like
a straight line or is it curved?
What is the direction of the pattern?
How much do individual points vary from the
average pattern?
Are there any unusual data points?
4
Jamshidian , 2005
Looking for Patterns with Scatterplots
Minitab Command
Graph > Scatterplot > Simple
Graph > Scatterplot > With Group
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Analyzing Variable Relationship: Scatterplots & Regression Analysis - Jamshidian, 2005 and more Study notes Mathematical Statistics in PDF only on Docsity!

Relationships

Between

Quantitative

Variables

Chapter 5

Three Tools we will use …

• Scatterplot , a two-dimensional graph of

data values

• Correlation , a statistic that measures the

strength and direction of a linear relationship

• Regression equation , an equation that

describes the average relationship between a

response and explanatory variable

5.1 Looking for Patterns

with Scatterplots

Questions to Ask about a Scatterplot

• What is the average pattern? Does it look like

a straight line or is it curved?

• What is the direction of the pattern?

• How much do individual points vary from the

average pattern?

• Are there any unusual data points?

Jamshidian , 2005 4

Looking for Patterns with Scatterplots

Minitab Command

Graph > Scatterplot > Simple

Graph > Scatterplot > With Group

Jamshidian , 2005 5

College Example: Data were collected from the 25 top liberal

arts colleges and the 25 top research universities. The

following variables are contained in a data set:

Name: Name of each school

School_Type: Coded 'LibArts' for liberal arts and 'Univ' for university

SAT: Median combined Math and Verbal SAT score of students

Accept: % of applicants accepted

Spent: Money spent per student in dollars

Top_hs: % of students in the top 10% of their h.s. graduating class

PhD: % of faculty at the institution that have PhD degrees

Grads: % of students at institution who eventually graduate

Explore the bivariate relationship between pairs of variables

SAT, Accept, and Top_hs.

Jamshidian , 2005 6

College TYPE SAT ACCEPT SPENT Top_HS PHD GRADS Amherst Lib Arts 1315 22 26636 85 81 93 Swarthmore Lib Arts 1310 24 27487 78 93 88 Williams Lib Arts 1336 28 23772 86 90 93 Bowdoin Lib Arts 1300 24 25703 78 95 90 Wellesley Lib Arts 1250 49 27879 76 91 86 Pomona Lib Arts 1320 33 26668 79 98 80 Wesleyan (CT) Lib Arts 1290 35 19948 73 87 91 Middlebury Lib Arts 1255 25 24718 65 89 92 Smith Lib Arts 1195 57 25271 65 90 87 Davidson Lib Arts 1230 36 17721 77 94 89 Vassar Lib Arts 1287 43 20179 53 90 84 Carleton Lib Arts 1300 40 19504 75 82 80 Claremont Lib Arts 1260 36 20377 68 94 74 Oberlin Lib Arts 1247 54 23591 64 98 77 Washington & Lee Lib Arts 1234 29 17998 61 89 78 Grinnell Lib Arts 1244 67 22301 65 79 73 Mount Holyoke Lib Arts 1200 61 23358 47 83 83 Colby Lib Arts 1200 46 18872 52 75 84 Hamilton Lib Arts 1215 38 20722 51 86 85 Bates Lib Arts 1240 36 17554 58 81 88 Haverford Lib Arts 1285 35 19418 71 91 87 Colgate Lib Arts 1258 38 17520 61 78 85 Bryn Mawr Lib Arts 1255 56 18847 70 81 84 Occidental Lib Arts 1170 49 20192 54 93 72 Barnard Lib Arts 1220 53 17653 69 98 80 Harvard Univ 1370 18 46918 90 99 90 Stanford Univ 1370 18 61921 92 96 88 Yale Univ 1350 19 52468 90 97 93

Princeton Univ 1340 17 48123 89 99 93 Cal Tech Univ 1400 31 102262 98 98 75 MIT Univ 1357 30 56766 95 98 86 Duke Univ 1310 25 39504 91 95 91 Dartmouth Univ 1306 25 35804 86 100 95 Cornell Univ 1280 30 37137 85 90 83 Columbia Univ 1268 29 45879 78 93 90 U of Chicago Univ 1300 45 38937 74 100 73 Brown Univ 1281 24 24201 80 98 90 U Penn Univ 1280 41 30882 87 99 86 Berkeley Univ 1176 37 23665 95 93 68 Johns Hopkins Univ 1290 48 45460 69 58 86 Rice Univ 1327 24 26730 85 95 88 UCLA Univ 1142 43 26859 96 100 61 U Va. Univ 1218 37 19365 77 91 88 Georgetown Univ 1278 24 23115 79 89 89 UNC Univ 1109 32 19684 82 84 73 U Michican Univ 1195 60 21853 71 93 77 Carnegie Mellon Univ 1225 64 33607 52 84 77 Northwestern U niv 1230 47 28851 77 79 82 Washington U (MO) Univ 1225 54 39883 71 98 76 U of Rochester Univ 1155 56 38597 52 96 73

Positive/Negative Association

  • Two variables have a positive

association when the values of

one variable tend to increase as the

values of the other variable increase.

  • Two variables have a negative

association when the values of

one variable tend to decrease as the

values of the other variable increase.

Jamshidian , 2005 8

Top_HS

SAT

Scatterplot of SAT vs Top_HS

Positive Association

Schools with higher percentage of students who have graduated

on the top 10% of their graduating class have students with higher

median SAT.

Example 5.2 Driver Age and Maximum

Legibility Distance of Highway Signs

  • We see a negative association with a linear pattern.
  • We will use a straight-line equation to model this relationship.

Example Carbon Dioxide

Concentration

Trends in Co2 concentration in the last two

centuries (Source of data: WorldWatch

Institute).

Data file: car_dio.MTW

Year

CO2 (ppm)

Scatterplot of CO2 (ppm) vs Year

There is a positive association between

Year and the Co2 concentration level.

The association, however, is nonlinear.

Jamshidian , 2005 16

Consider the College data. Explore the relationship

between pairs of variables.

Groups

  • Use different plotting symbols or colors to

represent different subgroups.

This is an example

where there is little

relationship between

the two variables, but

a bogus relationship

is shown because

two variables are

displayed together.

Jamshidian , 2005 18

Top_HS

ACCEPT

TYPE Lib A rts Univ

Scatterplot of ACCEPT vs Top_HS

What does this graph tell you?

Jamshidian , 2005 19

SAT

ACCEPT

TYPE Lib A rts Univ

Scatterplot of ACCEPT vs SAT

What does this graph tell you?

5.2 Describing Linear Patterns

with a Regression Line

Two purposes of the regression line:

  • to estimate the average value of y at any

specified value of x

  • to predict the value of y for an individual ,

given that individual’s x value

When the best equation for describing the

relationship between x and y is a straight line,

the equation is called the regression line.

Jamshidian , 2005 25

SAT = 1371 – 2.83 Accept

ACCEPT

SAT

S 50. R-Sq 36.8% R-Sq(adj) 35.5%

Fitted Line Plot

SAT = 1371 - 2.830 ACCEPT

If a school has 50% acceptance rate, then on average the

median SAT score for that school is 1371 – 2.83 (50) =

Prediction Errors and Residuals

  • Prediction Error = difference

between the observed value of y

and the predicted value.

  • Residual =

y ˆ

( yy ˆ)

Jamshidian , 2005 27

( x ,yi i )

Predicted value

Residual

LS line

Components in a regression model

Regression equation: = 577 – 3 x

Example 5.2 Driver Age and Maximum

Legibility Distance of Highway Signs (cont)

Can compute the residual for all 30 observations.

Positive residual => observed value higher than predicted.

Negative residual => observed value lower than predicted.

22 516 577 – 3(22)=511 516 – 511 = 5

20 590 577 – 3(20)=517 590 – 517 = 73

18 510 577 – 3(18)=523 510 – 523 = -

x = Age y = Distance y ˆ (^) = 577 − 3 x Residual

y ˆ

Least Squares Line and Formulas

  • Least Squares Regression Line: minimizes

the sum of squared prediction errors.

  • SSE = Sum of squared prediction errors.
  • Formulas for Slope and Intercept:

( )( )

( ) ∑

− −

=

i

i

i

i i

x x

x x y y

b (^12)

b y b x 0 1

= −

5.3 Measuring Strength and

Direction with Correlation

  • The strength of the relationship is

determined by the closeness of the points to

a straight line.

  • The direction is determined by whether one

variable generally increases or generally

decreases when the other variable increases.

Correlation r indicates the strength and

the direction of a straight-line relationship.

Interpretation of r and a Formula

  • r is always between –1 and +
  • magnitude indicates the strength
  • r = –1 or +1 indicates a perfect linear relationship
  • sign indicates the direction
  • r = 0 indicates a slope of 0 so knowing x does not

change the predicted value of y

  • Formula for correlation:

∑ ⎟

⎛ (^) −

⎛ (^) −

=

y

i

x

i

s

y y

s

x x

n

r 1

1

Example 5.1 Height and Handspan (cont)

Regression equation : Handspan = -3 + 0.35 Height

Correlation r = +0.74 =>

a somewhat strong positive linear relationship.

Example 5.9 Hours of Sleep

and Hours of Study

Relationship between reported hours of sleep the previous

24 hours and the reported hours of study during the same

period for a sample of 116 college students.

Correlation r = –0.

=> a not too strong

negative association.

A Correlation applet

Interpretation of r

2 and a formula

Squared correlation r

2 is between 0 and 1

and indicates the proportion of variation in

the response explained by x.

SSTO = sum of squares total = sum of squared

differences between observed y values and.

SSE = sum of squared errors (residuals) = sum

of squared differences between observed y values

and predicted values based on least squares line.

SSTO

SSTO SSE r

2

y

Interpretation of r

2

Example 5.6: Left and Right Handspans

r

2 = 0.90 => span of one hand is very

predictable from span of other hand.

Example 5.8: TV viewing and Age

r

2 = 0.014 => only about 1.4%

knowing a person’s age doesn’t

help much in predicting amount

of daily TV viewing.

Example 5.6: Left and Right Handspans

Jamshidian , 2005 41

Regression in Minitab:

Stat > Regression > Regression

Regression Analysis: SAT versus ACCEPT

The regression equation is

SAT = 1371 - 2.83 ACCEPT

Predictor Coef SE Coef T P

Constant 1371.05 21.45 63.92 0.

ACCEPT -2.8300 0.5351 -5.29 0.

S = 50.0586 R-Sq = 36.8% R-Sq(adj) = 35.5%

5.4 Why the Answers

May Not Make Sense

  • Allowing outliers to overly influence

the results

  • Combining groups inappropriately
  • Using correlation and a straight-line

equation to describe curvilinear data

Example 5.4 Height and Foot Length (cont)

Regression equation

uncorrected data: 15.4 + 0.13 height

corrected data: -3.2 + 0.42 height

Correlation

uncorrected data: r = 0.

corrected data: r = 0.

Three outliers were

data entry errors.

Example 5.10 Earthquakes in US

Correlation

all data: r = 0.

w/o SF: r = –0.

San Francisco

earthquake of 1906.

Case Study 5.1 A Weighty Issue

Relationship between Actual and Ideal Weight

Females (^) Males