Analyzing Relationship: Scatterplots, Regression, and Hypothesis Testing, Exams of Probability and Statistics

An introduction to analyzing the relationship between quantitative variables using scatterplots, linear regression, and hypothesis testing. An example of examining the relationship between ice cream consumption and temperature, as well as an explanation of linear and non-linear relationships, correlation coefficients, and hypothesis testing for a linear relationship. The document also covers the importance of interpreting results with caution, avoiding data snooping, and making a bonferroni adjustment.

Typology: Exams

Pre 2010

Uploaded on 08/08/2009

koofers-user-x6r
koofers-user-x6r 🇺🇸

10 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Stat Handout 5 Linear Regression Math 382
1
5. 1 Introduction
Suppose we want to investigate the relationship between two continuous random variables. Note
we will look at the relationship between a qualitative random variable and a continuous random
variable with ANOVA and between two qualitative random variables with a Chi-Squared Test.
Example: Ice Cream Consumption Suppose we are thinking of opening an ice cream business
and we want to know whether there is a relationship between the amount of ice cream consumed
and the temperature. We decided to collect data over a 30-week time period from March to July.
For each week, we recorded the average amount of ice cream consumed (per person) as well as
the mean temperature. The data are presented on the last page of the handout.
We have seen data like this before. It is paired data. For each of the 30 weeks, we have two
pieces of information. However, we are not interested in the difference of the two means in this
case. Our new question is, is there a pattern or predictable relationship that occurs between the
two variables? In other words, as temperature increases, what happens to ice cream consumption
(does it increase, decrease, or stay the same)?
If we do see a pattern that is strong enough, we can build a model from our sample data that
allows us to make predictions about the population, and help our ice cream business be more
successful.
To examine the relationship between two quantitative variables (we will call them X and Y) we
use what is called a scatterplot, a two-dimensional grid system that contains a horizontal axis
(for the X variable) and a vertical axis (for the Y variable.)
Plot of Ice Cream Consumption vs. Temperature
0
0.1
0.2
0.3
0.4
0.5
0.6
0 10 20 30 40 50 60 70 80
Mean Temp (F)
Pints per Person
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Analyzing Relationship: Scatterplots, Regression, and Hypothesis Testing and more Exams Probability and Statistics in PDF only on Docsity!

5. 1 Introduction

Suppose we want to investigate the relationship between two continuous random variables. Note we will look at the relationship between a qualitative random variable and a continuous random variable with ANOVA and between two qualitative random variables with a Chi-Squared Test.

Example: Ice Cream Consumption Suppose we are thinking of opening an ice cream business and we want to know whether there is a relationship between the amount of ice cream consumed and the temperature. We decided to collect data over a 30-week time period from March to July. For each week, we recorded the average amount of ice cream consumed (per person) as well as the mean temperature. The data are presented on the last page of the handout.

We have seen data like this before. It is paired data. For each of the 30 weeks, we have two pieces of information. However, we are not interested in the difference of the two means in this case. Our new question is, is there a pattern or predictable relationship that occurs between the two variables? In other words, as temperature increases, what happens to ice cream consumption (does it increase, decrease, or stay the same)?

If we do see a pattern that is strong enough, we can build a model from our sample data that allows us to make predictions about the population, and help our ice cream business be more successful.

To examine the relationship between two quantitative variables (we will call them X and Y) we use what is called a scatterplot, a two-dimensional grid system that contains a horizontal axis (for the X variable) and a vertical axis (for the Y variable.)

Plot of Ice Cream Consumption vs. Temperature

0

0 10 20 30 40 50 60 70 80 Mean Temp (F)

Pints per Person

Looking at the data, do you see a relationship between temperature and ice cream consumption? If so, describe this relationship.

Types of Relationship between Quantitative Variables

Two quantitative variables can be related to each other in a number of ways. Some variables have a linear relationship; others have relationships that are best described as non- linear.

Variables that have a non-linear relationship might be related through a simple curve (such as the number of fruit flies that multiply over time, or the relationship might be much more involved, requiring more complicated functions to describe (such as stock market prices over time.)

5. 2 Measuring the Strength of the Linear Relationship

A scatterplot can help give us a general idea as to whether or not there is a linear relationship between two variables, how strong the relationship appears to be in the sample, and what the direction of the relationship is.

Direction of the Linear Relationship. If the pattern goes uphill, we say the linear relationship is positive (both variables increase together or decrease together.) For example, height and weight of male adults exhibit a positive linear relationship. If the pattern goes downhill, we say the linear relationship is negative (as one variable increases, the other decreases. For example, we hope that as the number of police officers increases, the number of crimes decreases.

Strength of the Linear Relationship. The strength of the linear relationship is a measure of how close the pattern of observed values resembles a straight line. If the data points line up perfectly, we say there is a perfect linear relationship. If the points lie quite close to the line overall, we say the relationship is strong. If the points don't have too much of a pattern, yet seem to resemble a cloud going uphill, the relationship is weak. If the points are scattered everywhere (or in cases where a different type of pattern exists) we say there is no linear relationship.

To measure the strength and direction of the linear relationship between two quantitative variables we will calculate correlation coefficient. We draw conclusions about the population correla tion coefficient, ρ , by calculating the sample correlation coefficient (r).

( ) ( ) (^11) 1

n i i i xy y x x y

y y x x SS r n s s SS SS

=

where (^) ( ) ( ) 1 1

n n xy i i i i i i

SS y y x x x y n x y = =

= (^) ∑ − − = (^) ∑ − and^ ( ) 2 2 1

n x x i i

SS n s x nx

= − = (^) ∑ − and

( ) 2 2 1

n y y i i

SS n s y ny

= − = (^) ∑ −.

As noted previously, the conclusions that the researcher is able to draw from any scientific study depends on how the study was designed, how the data was collected, and what the strength of their evidence was. In scientific studies where relationships are found, we have to be very careful about how our results are interpreted. The two most important questions we need to address are listed below.

I. Correlation vs. Causation:

If one variable, X, is found to be correlated to another variable, Y (a significant linear relationship is found), does that mean that a change in X causes a change in Y? When does correlation imply causation?

Recall, a designed experiment that controls for other possible variables is the best way to determine a cause and effect relationship.

In cases where it is not possible to conduct a designed experiment, it is difficult to show that two related variables indeed have a causal relationship. Researchers must conduct many observational studies in many different situations to see if they can repeat the same results. The most high profile example of this is the case of showing that smoking causes lung cancer.

For instance, suppose we found that the variable “pints per person of ice cream” is highly correlated to “the number of lifeguard emergencies on a beach”. Should we conclude that ice cream causes accidents?

II. Data Snooping:

Suppose many hypothesis tests were conducted, and the significant results were reported. Do the significant results show evidence of an actual relationship within the population, or is the relationship merely showing up due to a large number of hypothesis tests conducted on the data from this sample? (This misuse of statistics is called "data snooping," or "fishing for results.")

  • We know that whenever a single hypothesis test is conducted there is a chance that the sample data lead you to reject H (^) o when H (^) o was really true. In other words, you found a significant relationship in the sample, but there really was no such relationship in the population. (Remember, this is a Type I Error.) What is this chance?
  • Suppose we use, α = 0.05, then there is a 5% chance for a Type I error for any single hypothesis test that is done. Using this idea, suppose we had a big data set, and tested 100 possible relationships, each with a 5% chance of making a Type I error. How many "false" relationships would we likely find, just by chance?
  • How can we make it "harder" to find "false relationships"? Since we are doing so many tests, perhaps we should require a smaller p-value each time. If you were going to do 100 hypothesis tests, what do you suggest the p-value should be for each test to avoid making so many Type I errors?

Bonferroni Adjustment

To avoid problems associated with data snooping, we should view the p-values from our hypothesis tests with some caution in the case where many tests are conducted. We should make it harder to find a significant result on each test, and thus lower the chance of making a Type I error. One way to do this is to make a "Bonferroni adjustment."

How to Make a Bonferroni Adjustment:

The Bonferroni adjustment asks the researcher to compare their usual p-value to 0.05/n, where n is the number of hypothesis tests being conducted. This adjustment should be made when at least 5 tests are conducted.

For example, if 10 hypothesis tests were conducted, our new cutoff for the p-value for each test is 0.05/10 = 0.005. If your p-value for any of these 10 tests is less than 0.005, then you reject H (^) o for that test. Otherwise, you can't

reject H (^) o.

When evaluating results presented to you, be sure to check to see if they have made an adjustment for the number of tests they conducted. Sometimes people report only the significant results, and don't tell you how many hypothesis tests they conducted before they find that result. That is what we mean by data snooping, or fishing for results.

5. 3 Least Squares Regression Line

If there is a linear relationship between two continuous variables; we assume that the equation for the regression line for the population is

Yij = β 0 (^) + β (^) 1 Xi + εij

where β 0 is the y - intercept, β 1 is the slope (the amount by which y changes when x is changed)

and ε (^) ij is the error (this is the amount of vertical distance from an observation to the line).

Assumptions :

  1. We are investigating only linear relationships.
  2. E Y ( (^) i )= β (^) 0 + β 1 Xi for each Xi
  3. The ε (^) ij ’s are distributed normally with mean 0 and variance σ for each X (^) i.

If we take a random sample of n observations, we measure the explanatory variable, X , and the response variable, Y for each individual. The equation of the least-squares regression line is

0 1

ˆ^ ˆ^ ˆ

yi = β + βxi for i = 1,2, K , n.

Note ˆ yi is the point estimate of E Y ( (^) i )= β 0 (^) + β 1 Xi.

Estimated slope ˆ^ y x

s r s

β = estimated y- intercept β^ ˆ 0 (^) = yβ ˆ 1 x

where x the sample mean and sx is the sample standard deviation for the xi , y the sample

mean and s (^) y is the sample standard deviation for the yi , and r is the sample correlation.

Mean Temp (F) Line Fit Plot

0

0 10 20 30 40 50 60 70 80 Mean Temp (F)

Pints per person

Pints per person Predicted Pints per person

We can use the residuals to help us assess the fit of our regression line.

Mean Temp (F) Residual Plot

-0.

-0.

0

0 20 40 60 80

Mean Temp (F)

Residuals

The residual plot should look like a random scattering of the points about the line y = 0. If there

is a pattern in the residual plot then the line is not providing a good fit.

Notice that in both the scatterplot and in the residual plot of our example there seems to be one observation that is somewhat unusual compared to the rest. Observations like these are called outliers; they are points that lie outside

the overall pattern of the other observations. Points that are outliers in the y direction have large residuals. Some outliers are very influential because if we removed them it would markedly change the results (or regression equation). Points that are outliers in the x direction are often influential for the least squares regression line. If you have a point that is an outlier, you may want to look at the results with and without that point.

5. 4 Inference for Least Squares Regression

Recall that we assumed the model Yi = β (^) 0 + β 1 Xi + εi fit our data where ε are independent and

distributed normally with mean 0 and variance σ^2 and E Y ( (^) i )= β 0 (^) + β 1 Xi = μY X. We call σ the

error of prediction, it is how far on average actual y - values will differ from the expected

average μY X. The estimated expected average is μ ˆ^ Y X = y ˆ = β^ ˆ 0 (^) + β ˆ 1 x. Thus , σ is estimated by

2 1

n i i i

y y s n

=

Comment: ( ) ( )

2 2 2 yiy ˆ^ i = ε^ ˆ i = residual.

Confidence Interval and Hypothesis Test for the Slope

A ( 1 − α )100%confidence interval for the slope β 1 is

β^ ˆ 1 ± tα / 2 SE ( β ˆ 1 )

where tα / 2has df = n − 2 and ( )

(^1) x

s SE n s

β = −

To test H (^) o : β 1 = 0 versus H (^) a : β 1 ≠ 0 we use the test statistic

1 1

c ˆ t SE

β β

=. This will test for a

significant linear relationship between X and Y. Note this test statistic will be the same as the test statistic for testing H (^) o : ρ = 0 versus H (^) a : ρ ≠ 0.

Coefficient of Determination is a measure of the variation of the response variable explained by the regression equation.

2 explained variation total variation

SSR

r SST

Our example: Let’s determine both the prediction interval for a future response and the confidence interval for the mean response when temperature is 55.

Why is the prediction interval wider than the confidence interval?

Week Pints per person Mean Temp (F) Week Pints per person Mean Temp (F) 1 0.386 41 16 0.381 63 2 0.374 56 17 0.47 72 3 0.393 63 18 0.443 72 4 0.425 68 19 0.386 67 5 0.406 69 20 0.342 60 6 0.344 65 21 0.319 44 7 0.327 61 22 0.307 40 8 0.288 47 23 0.284 32 9 0.269 32 24 0.326 27 10 0.256 24 25 0.309 28 11 0.286 28 26 0.359 33 12 0.298 26 27 0.376 41 13 0.329 32 28 0.416 52 14 0.318 40 29 0.437 64 15 0.381 55 30 0.548 71

*Source: Kotswara Rao Kadilyala (1970). "Testing for the independence of regression disturbances" Econometrica , 38, 97-117. Appears in: A Handbook of Small Data Sets, D. J. Hand, et al, editors (1994). Chapman and Hall: London.

Homework 17 Due May 5

1. Use the data from 1993 Consumer Reports on New Cars

(http://www.nmt.edu/~lballou/93cars.xls)

(a) Is the average price of the car related to the average MPG?

(b) Is there a relationship between city and highway MPG?

Use scatterplots, calculate and interpret the correlation coefficient, test to determine if there is a linear relationship. If there is a linear relationship, determine the least squares regression line. Look at the residuals plots and determine if you think the model does a good job fitting the data.

2. The following is an illustration of famous Moore's Law for computer chips.

X = Year (− 1900, for ease of computation)

Y = number of transistors (in 1000) 71.50 2. 78.75 31 82.75 110 85.25 280 89.75 1200 93.25 3100 95.25 5500

(a) Make a scatterplot of the data. Is the growth linear?

(b) Let's try and fit the exponential growth model using a transformation:

if Y = a 1 exp(a 2 X) then

ln(Y) = ln(a 1 ) + a 2 X

Make a regression analysis of ln(Y) on X. Does this model do a good job fitting the data?

(c) Predict the number of transistors in the year 2005. Did this prediction some true?