Regression and Correlation: Calculating the Sample Correlation Coefficient, Study notes of Data Analysis & Statistical Methods

A lecture note from a statistics course (sta100) on the topic of regression and correlation. It covers the concept of the correlation coefficient as a measure of the strength of the linear association between two variables. Formulas for calculating the sample covariance and the sample correlation coefficient, as well as an example problem for calculating the correlation coefficient between two sets of data. The lecture also mentions the importance of correlation coefficients in determining the goodness of fit of a straight line model.

Typology: Study notes

Pre 2010

Uploaded on 08/09/2009

koofers-user-k5v
koofers-user-k5v 🇺🇸

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
STA100 Lecture 23
1 | P a g e
Regression and Correlation
Text sections 10.1, 10.2, 10.3
Our last presentation involved the prediction of Men’s Body Temperature from Men’s Systolic Blood
Pressure. That is we built a model 𝑡𝑒𝑚𝑝𝑚𝑒𝑛 =𝑠𝑙𝑜𝑝𝑒𝑠𝑦𝑠𝑡𝑜𝑙𝑖𝑐𝑚𝑒𝑛 +𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡. As a review, we’ll do
this for the women in the study.
𝑡𝑒𝑚𝑝𝑤𝑜𝑚𝑒𝑛 =𝑠𝑙𝑜𝑝𝑒 𝑠𝑦𝑠𝑡𝑜𝑙𝑖𝑐𝑤𝑜𝑚𝑒𝑛 +𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
Women
systolic
diastolic
temperature
1
138
70
36.2
2
145
81
36.8
3
121
83
36.7
4
134
85
37.4
5
144
78
36.7
6
109
72
36
7
146
84
36.2
8
108
62
37.1
9
120
68
37.1
10
123
75
36.8
Recall that we would like to find the best slope and will use the formulas
𝑆𝑆𝑥𝑥 = 𝑥𝑖2 1
𝑛 𝑥𝑖 𝑥𝑖
𝑆𝑆𝑥𝑦 = 𝑥𝑖𝑦𝑖 1
𝑛 𝑥𝑖 𝑦𝑖
𝑠𝑙𝑜𝑝𝑒 =𝑏= 𝑥𝑖𝑦𝑖 1
𝑛 𝑥𝑖 𝑦𝑖
𝑥𝑖𝑥𝑖 1
𝑛 𝑥𝑖 𝑥𝑖
Sometimes students don’t immediately see that these are equivalent to the formulas in your text (on
page 511, blue box). Just multiply the top and bottom of each fraction by 𝑛 to get
𝑠𝑙𝑜𝑝𝑒 =𝑏= 𝑛 𝑥𝑖𝑦𝑖 𝑥𝑖 𝑦𝑖
𝑛 𝑥𝑖𝑥𝑖 𝑥𝑖 𝑥𝑖
pf3
pf4
pf5
pf8

Partial preview of the text

Download Regression and Correlation: Calculating the Sample Correlation Coefficient and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Regression and Correlation Text sections 10.1, 10.2, 10.

Our last presentation involved the prediction of Men’s Body Temperature from Men’s Systolic Blood Pressure. That is we built a model 𝑡𝑒𝑚𝑝𝑚𝑒𝑛 = 𝑠𝑙𝑜𝑝𝑒 ∗ 𝑠𝑦𝑠𝑡𝑜𝑙𝑖𝑐𝑚𝑒𝑛 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡. As a review, we’ll do this for the women in the study.

𝑡𝑒𝑚𝑝𝑤𝑜𝑚𝑒𝑛 = 𝑠𝑙𝑜𝑝𝑒 ∗ 𝑠𝑦𝑠𝑡𝑜𝑙𝑖𝑐𝑤𝑜𝑚𝑒𝑛 + 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Women systolic diastolic temperature 1 138 70 36. 2 145 81 36. 3 121 83 36. 4 134 85 37. 5 144 78 36. 6 109 72 36 7 146 84 36. 8 108 62 37. 9 120 68 37. 10 123 75 36.

Recall that we would like to find the best slope and will use the formulas

𝑆𝑆𝑥𝑥 = 𝑥𝑖^2 − (^1) 𝑛 𝑥𝑖 𝑥𝑖 𝑆𝑆𝑥𝑦 = 𝑥𝑖 𝑦𝑖 − (^1) 𝑛 𝑥𝑖 𝑦𝑖

𝑠𝑙𝑜𝑝𝑒 = 𝑏 =

Sometimes students don’t immediately see that these are equivalent to the formulas in your text (on page 511, blue box). Just multiply the top and bottom of each fraction by 𝑛 to get

𝑠𝑙𝑜𝑝𝑒 = 𝑏 = 𝑛 𝑛^ 𝑥𝑥𝑖^ 𝑦𝑖^ −^ 𝑥𝑖^ 𝑦𝑖 𝑖 𝑥𝑖 −^ 𝑥𝑖 𝑥𝑖

I prefer the 𝑆𝑆𝑥𝑥 and 𝑆𝑆𝑥𝑦 terms because they set you up for future work (ANOVA) and make the formulas easier to remember.

Moving to Excel:

x, systolic y, temperature xx xy yy 138 36.2 19044 4995.6 1310. 145 36.8 21025 5336 1354. 121 36.7 14641 4440.7 1346. 134 37.4 17956 5011.6 1398. 144 36.7 20736 5284.8 1346. 109 36 11881 3924 1296 146 36.2 21316 5285.2 1310. 108 37.1 11664 4006.8 1376. 120 37.1 14400 4452 1376. 123 36.8 15129 4526.4 1354. sums 1288 367 167792 47263.1 13470.

𝑆𝑆𝑥𝑥 = 𝑥𝑖^2 − 1 𝑛 𝑥𝑖 𝑥𝑖 = 167792 − 10 1 1288 ∗ 1288 = 1897.

Not much of a slope. The intercept is

𝑎 = 𝑦 − 𝑏 ∗ 𝑥 =^36710 − −0.0034 128810 = 37.

See if you can interpret these numbers off of the graph.

say pretty well where a 𝑦 𝑣𝑎𝑙𝑢𝑒 will lie. Here’s how you can think about this. Look at the simpler pictures below.

If you look at the first you will see that most of the above average x values go along with the above average y values. It is the same thing with the below average x values and the below average y values. Think about the deviations (distance from the mean). This means that  When 𝑥𝑖 − 𝑥 > 0 it is pretty common to find 𝑦𝑖 − 𝑦 > 0 as well. A positive times a positive is positive, so this means that 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 > 0 , too.  On the other side, when 𝑥𝑖 − 𝑥 < 0 it is pretty common to find 𝑦𝑖 − 𝑦 < 0 as well. A negative times a negative is negative, so this means that 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 > 0.  There aren’t many positive x values associated with negative y values. So there aren’t many terms where 𝑥𝑖 − 𝑥 > 0 and 𝑦𝑖 − 𝑦 < 0. That means there aren’t many terms where 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 < 0.  Just to be complete, there aren’t many negative x values associated with positive y values. So there aren’t many terms where 𝑥𝑖 − 𝑥 < 0 and 𝑦𝑖 − 𝑦 > 0. Again, that means there aren’t many terms where 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 < 0.

Summing up, in the first graph most of the products 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 give positive numbers and we would expect the sum of these terms to take us pretty far in the positive direction and we expect 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 to be on the large side.

-4 -2 0 2 4

0

2

4

6

x

y

-4 -2 0 2 4

0

2

4

6

x

y

This is not true in the second graph. In this case the positives tend to cancel the negatives and we’d expect 𝑥𝑖 − 𝑥 𝑦𝑖 − 𝑦 to be on the small side. We have the following definition.

Given a set of ordered pairs of the form 𝑥𝑖 , 𝑦𝑖 we define the sample covariance to be the average of the product of the deviations.

𝑐𝑜𝑣 = 𝑥𝑖^ −^ 𝑛 −𝑥^1 𝑦𝑖 −^ 𝑦 = (^) 𝑛 −𝑆𝑆𝑥𝑦 1

To get to the formula for the correlation coefficient given in your text you need to convert the 𝑥 𝑡𝑒𝑟𝑚𝑠 and the 𝑦 𝑡𝑒𝑟𝑚𝑠 to standard units 𝑥𝑖 − 𝑥 𝑠𝑥

Don’t worry too much, but if you follow through on the algebra you get

𝑥𝑖 − 𝑥 𝑠𝑥

𝑛 − 1 =^

We’re done, and we call the last term on the right the sample correlation coefficient.

You should at this point have the sense that the way we measure whether a straight line really is a good model is measured (somehow) with the sample correlation coefficient. Computing isn’t that difficult is you take the terms one at a time and practice, practice, practice.

Consider the following problem from our text.

What is the optimal time for a scuba diver to be on the bottom? That depends on the depth of the dive. The U.S. Navy has done a lot of research on this topic. The Navy defines the ”optimal time” to be the time at each depth for the best balance between the length of work period and the decompression time after surfacing. Let 𝑥 = 𝑡ℎ𝑒 𝑑𝑒𝑝𝑡ℎ 𝑜𝑓 𝑑𝑖𝑣𝑒 in meters, and let

x, depth of dive y, optimal time xx xy yy

 - 14.1 2.58 198.81 36.378 6. - 24.3 2.08 590.49 50.544 4. - 30.2 1.58 912.04 47.716 2. - 38.3 1.03 1466.89 39.449 1. - 51.3 0.75 2631.69 38.475 0. - 20.5 2.38 420.25 48.79 5. - 22.7 2.2 515.29 49.94 4. 
  • sums 201.4 12.6 6735.46 311.292 25.
    • 𝑥𝑖 = 201.4, 𝑦𝑖 = 12.6, 𝑥𝑖 𝑦𝑖 = 311.292, 𝑥𝑖 𝑥𝑖 = 6735.46, 𝑦𝑖 𝑦𝑖 = 25. Here we go.
      • 𝑆𝑆𝑥𝑦 = 𝑥𝑖 𝑦𝑖 − 1 𝑛 𝑥𝑖 𝑦𝑖 = 311.292 − 17 ∗ 201.4 ∗ 12.6 = −51.
        • 𝑆𝑆𝑥𝑥 = 𝑥𝑖 𝑥𝑖 − 𝑛^1 𝑥𝑖 𝑥𝑖 = 6735.46 − 17 ∗ 201.4 ∗ 201.4 = 940.
          • 𝑆𝑆𝑦𝑦 = 𝑦𝑖 𝑦𝑖 − 1 𝑛 𝑦𝑖 𝑦𝑖 = 25.607 − 17 ∗ 12.6 ∗ 12.6 = 2. - 𝑟 = 940.8943−51.2280 ∗ 2.9270 = −0. This gives us

We will see more clearly in the next lecture how to use this number. This lecture was mainly about the calculations. Here’s yet another presentation problem.

Calculate two correlation coefficients: the correlation between systolic and diastolic blood pressure (should be pretty high) and between systolic and temperature (should be pretty low).

Women systolic diastolic temperature 1 138 70 36. 2 145 81 36. 3 121 83 36. 4 134 85 37. 5 144 78 36. 6 109 72 36 7 146 84 36. 8 108 62 37. 9 120 68 37. 10 123 75 36.