Download Covariance and Correlation: Measuring Linear Dependence between Paired Measurements - Prof and more Study notes Statistics in PDF only on Docsity! Dr. Neal, Spring 2009 MATH 203 Covariance and Correlation Suppose we have a set of paired measurements ( X , Y ). Eventually we will be testing whether or not the measurements are independent; i.e., whether or not one has absolutely no effect on the other. For example, let X be the sum of the last four digits in your social security number and let Y be the sum of the last four digits in your telephone number. It would make sense that one sum's value has no effect on the other sum's value. In other words, there is no correlation between the two. But when there is some correlation, then we will measure the degree of linear dependence between the measurements. For example, for people buying their first life insurance policy, let X be the age of the insured and let Y be the monthly premium. In this case, there probably is some correlation. As the age increases, then the monthly premium tends to increase also. Covariance Recall that the variance of a measurement X is given by “the average of the squares minus the square of the average.” When we have a census of measurements x1 , x2 , . . ., xN , then we can write X 2 = 1 N xi 2 i=1 N ∑ − 1 N xi i=1 N ∑ 2 and then the true standard deviation is X = X 2 . If we have two measurements X and Y , then we can compute the covariance by the average product minus the product of the averages: Cov( X, Y ) = XY − X Y Note that Cov( X, X ) = Var( X) . We previously have noted that the average of a sum is the sum of the averages: X+Y = X + Y ; but it is not always the case that the average of a product equals the product of the averages. But we can say the following: When measurements X and Y are independent, then XY = X Y ; hence, Cov( X, Y ) = 0. Dr. Neal, Spring 2009 Example 1. Let X be an integer randomly chosen from {2, 3, 4} and let Y be an integer randomly chosen from {3, 4, 5, 6, 7}. Compute XY and X Y . Solution. Because each of 2, 3, 4 are equally likely to be chosen, X = 2 + 3 + 4 3 = 3. Likewise, Y = 5, and thus X Y = 15. Intuitively, X and Y should be independent of each other, so it should be the case that XY = X Y = 15. But we can compute the average product directly. By multiplying each possible X value by each possible Y value, we obtain all 15 possible products: {6, 8, 10, 12, 14, 9, 12, 15, 18, 21, 12, 16, 20, 24, 28}. Thus, XY = 6 + 8 +10 + . . . + 28 15 = 225 15 = 15 . Hence, Cov( X, Y ) = XY − X Y = 15 – 15 = 0. Example 2. Let X be an integer randomly chosen from {0, 1, 2, 3, 4} and let Y be an integer randomly chosen from 0 to 4 – X . Compute Cov( X, Y ) . Solution. Clearly, X = 2. But computing Y is a little more involved: X Possible Y Y Average 0 {0, 1, 2, 3, 4} 2 1 {0, 1, 2, 3} 1.5 2 {0, 1, 2} 1 3 {0, 1} 0.5 4 {0} 0 Because each of the five X values are equally likely, the overall Y average is Y = 2 +1.5 +1 + 0.5 + 0 5 = 1. By multiplying each possible X value by each possible Y value, we again obtain 15 possible products: {0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 2, 4, 0, 3, 0}. Thus, XY = 1 + 2 + 3 + 2 + 4 + 3 15 = 15 15 = 1 . Hence, Cov( X, Y ) = XY − X Y = 1 – 2 = –1. In this case, the possible values of Y are clearly dependent on the value of X . Example 3. Compute Cov( X, Y ) if Y is a direct function of X given by Y = 4 X2 + 3 and X has the following distribution: X –2 –1 0 1 2 Prob 0.1 0.2 0.05 0.25 0.40 Dr. Neal, Spring 2009 Example 5. Find the correlation between tar and nicotine for the following 10 brands of cigarettes: Brand Tar (mg) Nicotine (mg) Alpine 14.1 0.86 Benson & Hedges 16.0 1.06 Bull Durham 29.8 2.03 Camel Lights 8.0 0.67 Carlton 4.1 0.4 Chesterfield 15.0 1.04 Golden Lights 8.8 0.76 Kent 12.4 0.95 Kool 16.6 1.12 L&M 14.9 1.02 Solution. First, we enter the data into lists, say lists L1 and L2, and then compute the basic statistics with the 2-VarStats command. = XY − X Y X Y ≈ 164.438 / 10 −13.97 × 0.991 6.525496 ×0.40357 ≈ 0.9871 Because is near 1, we have a strong linear relationship between tar and nicotine. As the tar level increases, the nicotine level tends to increase in a linear manner. The LinReg(ax+b) command from the STAT CALC menu will compute the correlation directly. Enter the command LinReg(ax+b) L1,L2 (or use whatever lists hold the paired data). (Note: If the value of r does not appear, then bring up and enter the DiagnosticOn command from the CATALOG. Then re-enter the LinReg command.) The LinReg command displays the correlation as the symbol r, which is used to denote the sample correlation. If the paired data {{ x1 , y1}, { x2 , y2 }, . . ., { xn , yn }} represents all possible measurements, then we use for the true correlation, just as we use for the true average. But if we only have a sample, then we use r just as we use x . However, just as x is computed the same way as , the sample correlation r is computed the same way as . Dr. Neal, Spring 2009 Sample Correlation Formally, the sample correlation r on sample data {{ x1 , y1}, { x2 , y2 }, . . ., { xn , yn }} is defined by r = x y − (x )( y ) x2 − x ( )2 y2 − y ( )2 = 1 n xi yi i=1 n ∑ − 1 n xi i=1 n ∑ × 1 n yi i=1 n ∑ 1 n xi 2 i=1 n ∑ − x ( )2 × 1n yi 2 i=1 n ∑ − y ( )2 This value of r is used to estimate the true correlation . Of course, it is easily calculated with the LinReg command after entering the paired data into lists.