Linear Regression - Lecture Slides | STAT 301, Exams of Statistics

Material Type: Exam; Professor: Fischer; Class: Introduction to Statistical Methods; Subject: STATISTICS; University: University of Wisconsin - Madison; Term: Spring 2009;

Typology: Exams

Pre 2010

Uploaded on 09/02/2009

koofers-user-kdq
koofers-user-kdq 🇺🇸

9 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Linear Regression
Lisa Chung
Biostatistics, UW-Madison
June 15th. 2009
Lisa Chung (Biostatistics, UW-Madison) Linear Regression June 15th. 2009 1 / 22
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Linear Regression - Lecture Slides | STAT 301 and more Exams Statistics in PDF only on Docsity!

Linear Regression

Lisa Chung

Biostatistics, UW-Madison

June 15th. 2009

Reference

Dr. Ismor Ficsher’s Spring 2005 STAT301 Lecture Note

Introductory Statistics with R, Peter Dalgaard, 2002, Springer

Correlation

Correlation - Example

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

ll

l

l

l

l

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

Old faithful geyser eruption data

Waiting time to next eruption (in mins)

Eruption time in mins

corr = 0.

Simple Linear Regression

Simple Linear Regression - Parameter Estimation

Y = β

X + ,  ∼ N(0, σ

): regression coefficients

Estimate β

that minimizes the Error(or Residual) Sum of Squares

Slope: b

sxy

s

2 x

Intercept: b

β 0 = y¯ −

¯x

Then

Y = b 0 + b 1 X

Simple Linear Regression

Simple Linear Regression - Example

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

ll

l

l

l

l

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

Old faithful geyser eruption data

Waiting time to next eruption (in mins)

Eruption time in mins

eruption time = −1.87 + 0.075*(waiting time)

> model <- lm(eruptions ~ waiting, data = faithful)

Simple Linear Regression

Simple Linear Regression - Example

: eruption time = -1.87 + 0.0756(waiting time) + residual

> summary(model)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.874016 0.160143 -11.70 <2e-16 ***

waiting 0.075628 0.002219 34.09 <2e-16 ***

Residual standard error: 0.4965 on 270 degrees of freedom

Multiple R-Squared: 0.8115, Adjusted R-squared: 0.

F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-

Simple Linear Regression

Coefficient of Determination

R

SSreg

SSTotal

SSErr

SSTotal

, 0 < R

We can assess how well the model fits the data with R

, that is,

R

∗ 100%of the total variation is due to the linear association between the

variables, as determined by the least squares regression line.

by squaring the linear correlation coefficient

by explicitly calculating the ratio

SSreg

SS

Total

from the regression line.

Diagnostics

Regression Diagnostics

Model Assumptions:

The model is correct : Check scatter plot, R

Errors are independent each other, i=1,2,...,n

Errors are normally distributesd with mean 0, and equal variance σ

check residual plot, Q-Q plot (normal probability plot).

> par(mfrow = c(1,4))

> plot(model)

2 3 4 5

−1.

−0.

Fitted values

Residuals l

l l l

l

l

l

l

l l l l

l l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l l

l

l l

l

l

l

l

l

l l

l

l l

l l

l

l l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l l l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l l

l

l l

l

l

l

l

l

l

l

l

l

l l

l

l

l l

l l l

l

l

l

l l

l

l

lll

l

l (^) l

l

l

l

l

l

ll

l

l

l

l

l

l

l l

l

l

l

l

l

l (^) l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

Residuals vs Fitted

l

ll l

l

l

l

l

l l l l

l l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

ll

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

ll

l

l

l

l l

l

l

l

l

l

ll

l

ll

l l

l

ll

l

l

l

l

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l l l

l

l

l

l

l

l l l

l

l

l

l

l

l

l

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l l l

l

l

l

l

ll

l

l l

l

l

l

l

l

l

l

l

l

ll

l

l

ll

l l l

l

l

l

ll

l

l

lll

l

ll

l

l

l

l

l

ll

l

l

l

l

l

l

ll

l

l

l

l

l

lll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

−3 −2 −1 0 1 2 3

0

1

2

3

Theoretical Quantiles

Standardized residuals

Normal Q−Q

2 3 4 5

Fitted values

Standardized residuals

l l (^) l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l l

l

l l

l

l

l

l

l

l l

l

l

l

l l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l ll

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

ll

l

l

l l l

l

l

l

l

l l

l

l

l

l

l

l

l

l l

l

l

l

l l

l

l

l

l

l

l

l

l (^) l

l

l

l l

l l

l

l

l l

l

l

l

l

l l

l

l

l l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l (^) l

l

l

Scale−Location

0.000 0.005 0.010 0.

0

1

2

3

Leverage

Standardized residuals

l

l l l

l

l

l

l

l l l l

l l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l (^) l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l l

l

l l

l

l

l

l

l

l (^) l

l

l l

l l

l

l l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l l l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

ll

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l

l

l

l

l

l

l

l

l

l l

l

l

l

l

l l

l

l l

l

l

l

l

l

l

l

l

l

l l

l

l

l l

l l l

l

l

l

l l

l

l

lll

l

l (^) l

l

l

l

l

l

ll

l

l

l

l

l

l

l l

l

l

l

l

l

l ll

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

Cook's distance

Residuals vs Leverage

Other Possible Models

Dummy Variable

It takes the values 0 or 1 to indicate the absence or presence of some

categorical effect that may be expected to shift the outcome.

Example: (STAT 998, Fall 2006, by Prof. Nordheim)

A group of bacterial pathogens is known to cause damage to soybeans.

In mid-August, 25 different fields were scored for pathogen damage.

The score for each field was a number from 1 to 10 (1: negligible

damage... 10: severe damage). The scoring is based on a visual

examination from a small plane flying overhead. Each field associated

with it a weather station for obtaining climatological data. Also, it was

noted for each field what type of had been grown the previous year

since this might affect pathogen damage.

Other Possible Models

Dummy Variable

Rainfall: Total precipitation in

inches for the 30 days prior to

date of scoring.

Crop History: crop planted on

each field during the previous

growing season

(soybeans = 1 and oats = 2)

2 3 4 5

2

4

6

8

rainfall

score

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

soybean

oats

score rainfall crop

Other Possible Models

Dummy Variable

score ~ rainfall + crop

rainfall

score

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

soybean

oats

score ~ rainfall * crop

rainfall

score

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

l

soybean

oats

Other Possible Models

Logistic Regression

1. Model

Sometimes we wish to model binary outcomes, variables that can have

only two possible values (e.g. die/survive, dicrease/not dicrease, etc.)

Let p = Pr (Y = 1|x 1 , ..., x

k

Use a linear model for transformed probabilities:

logit(p) = log(

p

1 − p

) = β 0 + β 1 x 1 + β 2 x 2 + ... + βk xk

logit(p) is called the log odds, logit(p) ∈ (−∞, ∞)

(other choice: use log(− log(p)))

Other Possible Models

Logistic Regression

3. Fitting in R: Use a class of generalized linear models.

1) On tabular data

[data] Concerning hypertension from Altman (1991)

smoking obesity snoring n.hypertense n.total

1 No No No 5 60

2 Yes No No 2 17

3 No Yes No 1 8

4 Yes Yes No 0 2

5 No No Yes 35 187

6 Yes No Yes 13 85

7 No Yes Yes 15 51

8 Yes Yes Yes 8 23

(* use of ’gl()’, generating levels)

> resp <- cbind(n.hypertense, n.total-n.hypertense)

> glm(resp ~ smoking + obesity + snoring,

family = binomial("logit"))

Other Possible Models

Logistic Regression

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.37766 0.38018 -6.254 4e-10 ***

smokingYes -0.06777 0.27812 -0.244 0.

obesityYes 0.69531 0.28509 2.439 0.0147 *

snoringYes 0.87194 0.39757 2.193 0.0283 *

Signif. codes: 0 ’’ 0.001 ’’ 0.01 ’’ 0.

Null deviance: 14.1259 on 7 degrees of freedom

Residual deviance: 1.6184 on 4 degrees of freedom

AIC: 34.

Number of Fisher Scoring iterations: 4