








































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Pearson's correlation coefficient and its uses in simple linear regression. It covers the concept of correlation, the formula for calculating pearson's correlation coefficient, and its significance. The document also discusses the assumptions and goals of simple linear regression, as well as the estimation and prediction methods. An example using seed weight and length data is provided.
Typology: Study notes
1 / 48
This page cannot be seen from the preview
Don't miss anything!









































-^ Correlation
is a^
measure
of
the
strength and
direction of
a^
linear
relationship
between
two
variables,
here
labeled
x^
and
y.
-^ In
general
statistical
usage,
correlation
refers
to
the
amount
of
departure
of
the
two
variables
from
independence.
-^ The
parameter
ρyx
is the
population
correlation
coefficient
and
is
sometimes
called
Pearson’s
product
‐moment
correlation
coefficient
.
-^ Sample
correlation
coefficient:
xy
i i
i i^
SS
n y x
y x
r^
−
yy xx
y
i
i
i
i
yx
SS SS
n y y n x x
r^
=
−
×
−
=
2
2
2
2
-^ Note
that
high
correlation
does
not
necessarily
imply
causation.
Correlation
Testing
Correlation
Simple
Linear
Regression
-^ Simplest
linear
regression
model:
2
β β
response
(or dependent) variable
2
1 0
σ
ε ε
β β^
-^ y
is^
response
(or
dependent)
variable
-^ x
is^
explanatory
(or
independent)
variable
-^ β
and 0
β^1
are
the
parameters
to
be
estimated
-^ ε
is random deviation (‘error’ or ‘residual’)
-^ ε
is^
random
deviation
( error
or
residual )
caused
by: ‐^ uncontrolled
factors,
β^1
> 0
⇒
Positive Association
‐^ measurement
errors,
‐^ missing
variables
in
the
model,
rounding of numbers etc
β^1
>^ 0
⇒
Positive
Association
β^1
<^ 0
⇒
Negative
Association
β^1
=^ 0
⇒
No
Association
‐^ rounding
of
numbers
,^ etc
.
β^1
Simple
Linear
Regression
●^ Regression allows
us
to
estimate
the
most
probable
value
of
y^ for
a^ given
value
of^
x. ●^ Expressed statistically as:
(read as
“the expectation of y given x”
)
) | (^
x y E
●^ Expressed
statistically
as:
(read
as
the
expectation
of
y^ given
x^
)
●^ For
the
simple
linear
model
we
have:
●^ y
is^
assumed
to
follow
a^ Normal
distribution
and,
if^ we
have
no
information
on
x
x y E^
1 0 ) | (
β β^
=
) | (^
x y E
x ,^ our
best
estimate
of
y^ corresponds
to
the
mean
of
y ,^
or^
.
y y^
= ˆ=
●^ However,
if^ we
have
some
additional
information from a
correlated variable
information
from
a^ correlated
variable
x ,^ then
we
can
improve
our
estimate,
and
we
use: E^
) | (
β β
for
which
the
distributions
are
much
narrower
x
x y E^
1 0 ) | (
β β^
=
narrower
.
Summary Calculations
Fitting
the
Model:
LS
Summary
Calculations
∑ ∑
−
=^
2
)
)( (
) (
SS
x x
SS
xx
∑ ∑
∑
SS
(^
)^
n x
x
xx^
2
∑
∑
∑ ∑
−
=
−
−
=
(^2) )
(
)
)( (
y y
SS
y y x x
SS
xy^ yy
∑ ∑
∑
n y x
xy
xy^
(^
)^
n y
y
yy^
2
∑
∑
Parameter
Estimates
2
xx
xy
yy^
x x
y y x x
xy xx
2
^^1
β^
∑ ∑
x
y^
^^1
^^0
β
β^
y y n ∑^
)
(^
2 ^
MSE
SSEn
n
y y
s^
i
i i
= − =
− −
=
∑=
2
2
)
( 1
2
Example
morphological
traits
from
190
seeds
obtained
from
a^
line
of
diploid
wheat
Triticum monococcum
were
measured
automatically
with
a^
Single
‐Kernel
Characterization
System.
The
variables
recorded
g^
y
were
diameter
,^ length
,^ weight
,^ moisture
content
and
hardness
of
each
seed.
50 45 40 35 30 Weight
Response (
y )?
25 20
Response
( y
)?
Predictor
( x
)?
275 250 15
300
400 375 350 (^325) Length
50
Example
50 45 40 35
i i
i^
length
weight
ε
β β^
=^
1 0
i i
i^
x
y
ε β β^
=^
1 0
35 30 25 20 Weight
(^190) = n^
(^2) = p
(^08). 626 ∑^
= x i
5445 ∑^
= i k^ = 2 y
275 20
250 15
300
400 375 350 (^325) Length
(^30). 2082 2 ∑^
= i x^
163336 2 ∑^
= i y^03. 18273
∑^
= i yxi
268 19 190 08 626 30 2082
2
2
2
.
/ .
.
n )x ( x
SS
i
i
xx^
=
−
=
−
=^ ∑
∑
763 7293 190 5445
163336
2
2
2
.
/
n )y ( y
SS
i
i
yy^
=
−
=
−
=^ ∑
∑
(^
)^
895 330 (^190) / 5445 08 626 03 18273
=
×
−
=
−
=^ ∑
∑^
∑^
n y x yx
SS
i i ii
(^
)^
(^895). 330 (^190) / 5445 (^08). 626 (^03). 18273
=
×
=
=^ ∑
∑^
∑^
n y x yx
SS
i i ii
xy
173 17 268 19
895 330
.
..
SSSS ˆ
xy^ xx
=
= =β 1
(^
)^
(^
)^
(^931). 27 (^08). 626 (^173). 17 5445 (^1190)
ˆ
1 ˆ^
−=
×
−
= β^ −^1
=^
∑^
∑^
i i^
x y n β^0
Example
proc
gplot
data=Seeds;
plot
weight*length; run;proc
reg
data=Seeds; model
weight
=^
length;
output
out=resdata
p=pred
student=studres;
run;proc
gplot
data=resdata;
plot
studres*pred/vref
0;
plot
studres*pred/vref=
0;
run;proc
univariate
data=resdata
noprint;
var
studres; probplot
studres
/normal(mu=est
sigma=est);
p^
p^
/^
(^
g^
);
histogram
stures
/normal;
run;
Example
Fitted and observed relationship with 95% confidence limits 50
i
i^
x
y^
45 4040 35 ght
30 25 weig
20 15
2.^
length
Model
Assumptions
Assumption 1
0 ) (^
= i E
ε
i i
i^
x
y
ε
β β^
=^
1 0
( i^ = 1…
n )
p The expected mean of the residuals,
ε, is assumed to be zero. i^
Assumption 2
The variance of any residual is equal to a constant value common to all residuals
) (^ i
2 ) (
σ ε^
= i Var
y^
q
(homoscedasticity/homogeneity of variances). Assumption 3
The residuals are independent.
0 ) , (^
= i i Cov
ε ε
p
Assumption 4
x^ are nonstochastic i^
The explanatory variable
x^ is measured without error.
Assumption 5
Each response and its corresponding residual are independent of each other. Assumption 6
ε~ i^
N (0,
(^2) σ 0 ) ) , (^
= i yi Cov
ε
p^
i^
( ,
)
The residuals follow a Normal distribution with mean 0 and variance
(^2) σ .
Making
Inferences
about
the
Slope
The
assumptions
described
earlier
produce
a^
normal
sampling
distribution for
the
slope
1 1
1
β σ β
β
SSE 2
1 1
ˆ ˆ^
β β σ
A d
fid
I t
l
i
M
SE
SSEn s
where
= − =^
2
2
And
a^
‐α
confidence
Interval
on
is: s
t
s t^
(^2) /
^^1
(^2) /
^^1
^
α
α
β
β^
±
and
t α
/^ is^
based
on
( n
degrees
of
freedom.
xx SS (^2) / 1
(^2) / 1
1
α
β α
β
β
α/
Making
Inferences
about
the
Slope
-^
2 ‐
Sided
Test
-^
1 ‐
sided
Test
H a
H a
(or
^^1
^^1 o
s s t S T^
β
β
^^1
^^1 o
s s t S T^
value
, (^2) / ˆ^1
n
o
xx t t P
p
t t R R
s s
− β α
value
, ˆ^1
n
o
xx t t P
p
t t R R
s s
− β α
value
to t P
p^
value
to t P
p^