Pearson's Correlation Coefficient and Simple Linear Regression - Prof. Salvador Gezan, Study notes of Data Analysis & Statistical Methods

Pearson's correlation coefficient and its uses in simple linear regression. It covers the concept of correlation, the formula for calculating pearson's correlation coefficient, and its significance. The document also discusses the assumptions and goals of simple linear regression, as well as the estimation and prediction methods. An example using seed weight and length data is provided.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-usr-1
koofers-user-usr-1 🇺🇸

9 documents

1 / 48

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter11
Si l Li Ri
Si
mp
l
e
Li
near
R
egress
i
on
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30

Partial preview of the text

Download Pearson's Correlation Coefficient and Simple Linear Regression - Prof. Salvador Gezan and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Chapter

Si

l^

Li

R^

i

Si

mple

Li

near

R

egression

Correlation

-^ Correlation

is a^

measure

of

the

strength and

direction of

a^

linear

relationship

between

two

variables,

here

labeled

x^

and

y.

-^ In

general

statistical

usage,

correlation

refers

to

the

amount

of

departure

of

the

two

variables

from

independence.

-^ The

parameter

ρyx

is the

population

correlation

coefficient

and

is

sometimes

called

Pearson’s

product

‐moment

correlation

coefficient

.

-^ Sample

correlation

coefficient:

xy

i i

i i^

SS

n y x

y x

r^

(^

)^

(^

)^

yy xx

y

i

i

i

i

yx

SS SS

n y y n x x

r^

=

×

=

2

2

2

2

-^ Note

that

high

correlation

does

not

necessarily

imply

causation.

Correlation

Testing

Correlation

Simple

Linear

Regression

-^ Simplest

linear

regression

model:

(^

2

β β

  • y is

response

(or dependent) variable

~^

2

1 0

σ

ε ε

β β^

N

x

y^

-^ y

is^

response

(or

dependent)

variable

-^ x

is^

explanatory

(or

independent)

variable

-^ β

and 0

β^1

are

the

parameters

to

be

estimated

-^ ε

is random deviation (‘error’ or ‘residual’)

-^ ε

is^

random

deviation

( error

or

residual )

caused

by: ‐^ uncontrolled

factors,

β^1

> 0

Positive Association

‐^ measurement

errors,

‐^ missing

variables

in

the

model,

rounding of numbers etc

β^1

>^ 0

Positive

Association

β^1

<^ 0

Negative

Association

β^1

=^ 0

No

Association

‐^ rounding

of

numbers

,^ etc

.

β^1

Simple

Linear

Regression

●^ Regression allows

us

to

estimate

the

most

probable

value

of

y^ for

a^ given

value

of^

x. ●^ Expressed statistically as:

(read as

“the expectation of y given x”

)

) | (^

x y E

●^ Expressed

statistically

as:

(read

as

the

expectation

of

y^ given

x^

)

●^ For

the

simple

linear

model

we

have:

●^ y

is^

assumed

to

follow

a^ Normal

distribution

and,

if^ we

have

no

information

on

x

x y E^

1 0 ) | (

β β^

=

) | (^

x y E

x ,^ our

best

estimate

of

y^ corresponds

to

the

mean

of

y ,^

or^

.

(^ ) yE

y y^

= ˆ=

●^ However,

if^ we

have

some

additional

information from a

correlated variable

information

from

a^ correlated

variable

x ,^ then

we

can

improve

our

estimate,

and

we

use: E^

) | (

β β

for

which

the

distributions

are

much

narrower

x

x y E^

1 0 ) | (

β β^

=

narrower

.

Summary Calculations

Fitting

the

Model:

LS

Summary

Calculations

∑ ∑

=^

2

)

)( (

) (

SS

x x

SS

xx

∑ ∑

SS

(^

)^

n x

x

SS

xx^

2

∑ ∑

=

=

(^2) )

(

)

)( (

y y

SS

y y x x

SS

xy^ yy

∑ ∑

=^

n y x

xy

SS

xy^

(^

)^

n y

y

SS

yy^

2

Parameter

Estimates

(^

2

xx

xy

yy^

SS

SS

SS

SSE

SS SS

x x

y y x x

xy xx

2

^^1

β^

∑ ∑

x

y^

^^1

^^0

β

β^

y y n ∑^

)

(^

2 ^

MSE

SSEn

n

y y

s^

i

i i

= − =

− −

=

∑=

2

2

)

( 1

2

Example

  • Several

morphological

traits

from

190

seeds

obtained

from

a^

line

of

diploid

wheat

Triticum monococcum

were

measured

automatically

with

a^

Single

‐Kernel

Characterization

System.

The

variables

recorded

g^

y

were

diameter

,^ length

,^ weight

,^ moisture

content

and

hardness

of

each

seed.

50 45 40 35 30 Weight

Response (

y )?

25 20

Response

( y

)?

Predictor

( x

)?

275 250 15

300

400 375 350 (^325) Length

50

Example

50 45 40 35

i i

i^

length

weight

ε

β β^

=^

1 0

i i

i^

x

y

ε β β^

=^

1 0

35 30 25 20 Weight

(^190) = n^

(^2) = p

(^08). 626 ∑^

= x i

5445 ∑^

= i k^ = 2 y

275 20

250 15

300

400 375 350 (^325) Length

(^30). 2082 2 ∑^

= i x^

163336 2 ∑^

= i y^03. 18273

∑^

= i yxi

268 19 190 08 626 30 2082

2

2

2

.

/ .

.

n )x ( x

SS

i

i

xx^

=

=

=^ ∑

763 7293 190 5445

163336

2

2

2

.

/

n )y ( y

SS

i

i

yy^

=

=

=^ ∑

(^

)^

895 330 (^190) / 5445 08 626 03 18273

=

×

=

=^ ∑

∑^

∑^

n y x yx

SS

i i ii

(^

)^

(^895). 330 (^190) / 5445 (^08). 626 (^03). 18273

=

×

=

=^ ∑

∑^

∑^

n y x yx

SS

i i ii

xy

173 17 268 19

895 330

.

..

SSSS ˆ

xy^ xx

=

= =β 1

(^

)^

(^

)^

(^931). 27 (^08). 626 (^173). 17 5445 (^1190)

ˆ

1 ˆ^

−=

×

= β^ −^1

=^

∑^

∑^

i i^

x y n β^0

Example

proc

gplot

data=Seeds;

plot

weight*length; run;proc

reg

data=Seeds; model

weight

=^

length;

output

out=resdata

p=pred

student=studres;

run;proc

gplot

data=resdata;

plot

studres*pred/vref

0;

plot

studres*pred/vref=

0;

run;proc

univariate

data=resdata

noprint;

var

studres; probplot

studres

/normal(mu=est

sigma=est);

p^

p^

/^

(^

g^

);

histogram

stures

/normal;

run;

Example

Fitted and observed relationship with 95% confidence limits 50

i

i^

x

y^

ˆ^

45 4040 35 ght

30 25 weig

20 15

2.^

length

Model

Assumptions

Assumption 1

0 ) (^

= i E

ε

i i

i^

x

y

ε

β β^

=^

1 0

( i^ = 1…

n )

p The expected mean of the residuals,

ε, is assumed to be zero. i^

Assumption 2

The variance of any residual is equal to a constant value common to all residuals

) (^ i

2 ) (

σ ε^

= i Var

y^

q

(homoscedasticity/homogeneity of variances). Assumption 3

The residuals are independent.

0 ) , (^

= i i Cov

ε ε

p

Assumption 4

x^ are nonstochastic i^

The explanatory variable

x^ is measured without error.

Assumption 5

Each response and its corresponding residual are independent of each other. Assumption 6

ε~ i^

N (0,

(^2) σ 0 ) ) , (^

= i yi Cov

ε

p^

i^

( ,

)

The residuals follow a Normal distribution with mean 0 and variance

(^2) σ .

Making

Inferences

about

the

Slope

•^

The

assumptions

described

earlier

produce

a^

normal

sampling

distribution for

the

slope

estimate: S

N

ˆ^

1 1

1

β σ β

β

SSE 2

S^ xx

s

s^

1 1

ˆ ˆ^

β β σ

A d

fid

I t

l

β^

i

M

SE

SSEn s

where

= − =^

2

2

•^

And

a^

‐α

confidence

Interval

on

β^1

is: s

t

s t^

(^2) /

^^1

(^2) /

^^1

^

α

α

β

β^

±

±

and

t α

/^ is^

based

on

( n

degrees

of

freedom.

xx SS (^2) / 1

(^2) / 1

1

α

β α

β

β

α/

Making

Inferences

about

the

Slope

-^

2 ‐

Sided

Test

-^

1 ‐

sided

Test

H^0

:^ β

H a

:^ β

H^0

:^ β

H a

:^ β

(or

β^1

.^

^^1

^^1 o

SS

s s t S T^

β

β

.^

^^1

^^1 o

SS

s s t S T^

value

, (^2) / ˆ^1

n

o

xx t t P

p

t t R R

SS

s s

≥^

− β α

value

, ˆ^1

n

o

xx t t P

p

t t R R

SS

s s

≥^

− β α

value

to t P

p^

value

to t P

p^