Download Regression Methods: Linear, Multiple Linear, OLS, PCA, and PLS and more Study notes Advanced Physics in PDF only on Docsity!
Regression methods
Linear regression
Y = m X + b
A linear relationship is assumed to exist
between to factors.
This was already discussed in an earlier unit.
Regression methods
Multiple linear regression
Y = m
X1 + m
X
+ ... mn
Xn
+ e
This is a linear regression fit that is extended to
several variables.
It is useful when several factors contribute to the
overall observed response.
Multivariate calibration
Typically, a multivariate method implies that
you have multiple X (independent) and
multiple Y (dependent) variable.
We will outline three multivariate
approaches to creating a calibration curve.
Ordinary Least Squares (OLS)
Principal component regression (PCR)
Partial least squares regression (PLS)
While each optimizes the fit of your data
differently, method evaluation, optimization
and the results are often the same.
OLS
With traditional linear and multiple linear regression,
we’re limited to a single Y (dependent) variable.
OLS (also called a general linear model - GLM) can
be seen as an extension of this approach. You have
a Y matrix instead of a Y vector.
Mathematically, the matrix formulations for MLS and
OLS (GLM) are the identical - except for allowing for
a Y matrix. Basically a ‘combination’ of MLS and
‘simultaneous equations.
XLStat will handle either approach - based on the
number of Y variables you give it.
OLS
One limit with OLS is that you need more
observations than X variables and more X
variables than Y variables.
Results can be irratic if you have variables
that are:
Collinear (ones with a high degree of
linear correlation.)
Invariate (ones that don’t vary much.)
Can try to remove all invariate and all but
one collinear (in a block) and hope for the
best (XLStat will do this.)
For these reasons, OLS is not as commonly
for multi-Y type problems (compared to PCR
and PLS).
Principal component regression
This is a simple extension of OLS.
It is assumed that each member of your set can
be assigned a quantitative class value.
First, generate a PCA model for your data.
Using the PCA scores, conduct a multiple
linear regression where your Y values are the
quantitative class values.
Principal component
regression
Raw or
scaled
data
Residual
(noise)
PCA
scores
PCA
scores
slopes
and
multiple
linear
regression
intercepts
OLS
Principal component regression
Advantages of PCR over OLR
- Noise remains in the residual.
- Fewer variables to work with.
- Obtain PCA information as well.
- You can use just the components that relate
to the trend of interest.
Limits of PCR
- It assumes that your data array is valid for
predicting Y values.
It must contain no errors beyond noise.
The first PC(s) may or may not actually
related to any of the Y components.
Partial least squares regression
PLS modeling relies on a simultaneous fit
of both an independent and dependent
matrix.
The objective is to derive latent variables
that are similar to principal components.
Major difference is the attempt to
minimize the variance of both arrays.
Called PLS1 for a Y vector and PLS
where there is a Y matrix.
Partial least squares regression
With PLS, the goal it to extract the latent variables by
using the X array to properly ‘align’ the Y array (or vector)
and then reversing the process.
Y
X
q
w
t
u
Partial least squares regression
Each factor to be determined should end up
with a different PC set.
It may require a different number of
components to adequately model each
quantitative variable.
The approach insures that the ‘best fit’ is obtain
for all variables - which is both good and bad.
Considered best approach when the number of
variables is high and correlated variables are
likely (basically the opposite of OLS).
Validation
All methods require validation.
Again, cross validation is one of the best
approach. ( leave one out method )
It permits you to determine a prediction
error sum of squares (PRESS) or the root-
mean-square value of prediction error
(RMSPE)
Tracking the PRESS value will tell you the
optimum number of components to use.
PLS model quality
QR
Y cum index.
Sum of the coefficients of determination (R
between the dependent variables and the ‘h’ first
components for the dependent variables.
QR
X cum index.
Sum of the coefficients of determination (R
between the independent variables and the ‘h’
first components for the independent variables.
These are similar to the Q
cum (h) index - but only
for one of the ‘blocks’ of data.
Note: other programs will either a) call these
different things or b) use different measures.
Octane number
Rating Octane of Gasoline using Near IR.
- ASTM method is complex and expensive.
A simple method would be more desirable.
Experimental
- Unleaded gasoline samples were assayed
by the ASTM method.
- NIR spectra (900-1600 nm) were obtained.
- OLR, PCR and PLS models were studied.
- X matrix - spectra at 20 nm intervals.
- ASTM octane number by ‘Research method’
was used as the Y matrix (vector).
Octane Number
NIR spectra
A 915 nm, CH2 stretch
B 1021 nm, CH2/CH
combination band
C 1151 nm, aromatic
and CH3 stretch
D 1194 nm, CH3 stretch
E 1394 nm, CH
combination bands
F 1412 nm aromatic &
CH2 combination bands
G 1435 nm aromatic &
CH2 combination bands
A
B
C
D
E
F
G
Octane Number - OLS
High R
value and
RMSE and Press
RMSE are similar.
Octane Number
Only 9 variables ended up being used in
building the OLS model.
Octane Number, OLS
!"
!#
!$
!%
%&
%"
!" !# !$ !% %& %"
!"#$%&'#$()&'+#(,*
)&'+#(,*
'()*+,
-./0.)
OLS residual.
!"#$
!"
!%#$
!%
!&#$
&
&#$
%
%#$
"
'( ') '$ '* '+ '' ', ,& ,% ," ,(
Octane Number, PCR
PCR
OLS
Note that you get a small
improvement in the fit
with PCR. Might have a
problem with outliers
Also, you have a larger
number of degrees of
freedom since all of the
original variables were
used. With OLS, most
were discarded.
Octane Number, PCR
!
"
#!
#"
$!
$"
%!
%"
&# &$ &% &' &" &(
!"#$"%&%'
()&%+,-.&*
!
$!
'!
(!
)!
#!!
!.#.-,')+&/+,0),1)-)'2/
Almost 90% of the
variance is captured in
the first component.
Over 99% in the first
three.
Scores plot - potential outliers
!"#$
!"
!"#% !&#" !'#&
!'#"
!&#$
!"#(
!'#)
&%#"
!"#%
!"#&
!$#"
!$#"
!"#)
&%#"
&%#*
!&#'
&%#"
!'
!"#%
!"#$
!$#(
!"#%
&%#"
!'#*
&*#)
!"#)
&%#'
!"#%
&%#&
&%#!
!"#(
!(#!
!(#$
!(#&
!(#&
!"#%
!$#"
!"
!(#!
!"#'
&%#!
&%#&
!'#)
!$#&
!$#&
!"#$
&*#%
&%#"
!"#$
!(#&
&#
!"#!
!$#" !$#&
!"#*
!"#'
&%#&
+)
,
)
+%, +) , ) %, %)
!"#$%&'()#+*
!(#$""'),#+*
-./012 345064/
Octane Number,
PCR
!"
!#
!$
!%
%&
%"
!" !# !$ !% %& %"
!"#$%&'#$()&'+#(,*
)&'+#(,*
'()+, -./0.)*
!"#$%&'(')'#$%+$,+-.&+',&/-+0$1/*
!"
!#
!$
!%
&
%
$
"
'# '( ') '* *% *#
!"#$%&'(
*#$%+$,+-.&+',&/-+0$1/
Octane Number, PLS
!"#$%&'()%+,&-,&.(/-$0&"1&2"/3".$.+*
!
!"#
!"$
!"%
!"&
'
' # ( $ )
5"/3".$.+
6.#$
OLS
Due to the limited number of samples,
there are too many X points/sample.
By creating replicate copies of the samples
(in triplicate), an initial OLS was conducted
to determine which X points would have
been automatically eliminated.
This left 8 X points/sample.
OLS
Model Quality
Summary.
Si
Pred(% Si) / % Si
0
1
0 0.2 0.4 0.6 0.8 1 1.2 1.
Pred(% Si)
% Si
% Si / Standardized residuals
-1.
-0.
0
1
0 0.5 1
% Si
Standardized residuals
Mn
Pred(% Mn) / % Mn
0
1
2
0 0.5 1 1.5 2 2.
Pred(% Mn)
% Mn
% Mn / Standardized residuals
-1.
-0.
0
1
0 0.5 1 1.5 2 2.
% Mn
Standardized residuals
Ni
Pred(% Ni) / % Ni
0
5
10
15
20
0 5 10 15 20
Pred(% Ni)
% Ni
% Ni / Standardized residuals
-1.
-0.
0
1
2
0 5 10 15 20
% Ni
Standardized residuals
Cr
Pred(% Cr) / % Cr
5
7
9
11
13
15
17
19
21
5 7 9 11 13 15 17 19 21
Pred(% Cr)
% Cr
% Cr / Standardized residuals
-1.
-0.
0
1
5 10 15 20
% Cr
Standardized residuals
Pred(% Mo) / % Mo
-0.
0
1
2
3
4
-0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.
Pred(% Mo)
% Mo
% Mo / Standardized residuals
-1.
-0.
0
1
0 1 2 3 4 5
% Mo
Standardized residuals
Ti
Pred(% Ti) / % Ti
-0.
-0.
0
1
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Pred(% Ti)
% Ti
% Ti / Standardized residuals
-0.
0
1
0 0.2 0.4 0.6 0.8 1
% Ti
Standardized residuals
Fe
60
65
70
75
80
85
90
60 65 70 75 80 85 90
Pred(% Fe)
% Fe
-1.
-0.
0
1
60 65 70 75 80 85 90
% Fe
Standardized residuals
PCR gives higher R
2
values but look at the Press RMSE
OLS results
Fe
!"
!#
$"
$#
%"
%#
&"
!" !# $" $# %" %# &"
'()+,-)./.0)
/.0)
!"#$
!"#%
!"#&
!"#'
!"#(
!"#)
"
"#)
"#(
"#'
"#&
$" $% *" *% +" +% ,"
-./
123453657805.60975:3;
PLS
!"#$%&'()%*+,&-,&.(/-$0&"1&2"/3".$.+
5
567
568
569
56:
56;
56<
56=
56>
56?
7
7 8 9 : ; < = >? 75 77 78 79
@"/3".$.+
A.#$B
The Q
2
pattern indicates that most of the information if brought out in the first
few components but then noise is brought out to finally find a way to fit the
‘non-spectral’ species.
PLS
PLS gives an even better fit - but not a
huge improvement compared to PCR.
Si
!
!"#
!"$
!"%
!"&
'
'"#
'"$
! !"# !"$ !"% !"& ' '"# '"$
()+,-.+/0/1,
0/1,
!"#$
!"#%
!"#&
"#&
"#%
"#$
" "#' "#( "#) "#* & &#' &#(
+,-.
-/012032.452,356.
Mn
!
!"#
$
$"#
%
%"#
! !"# $ $"# % %"#
&'()*+,()-.-/
.-/
!"#$
!"#%
!"#&
"#&
"#%
"#$
" "#$ & &#$ ' '#$
()*+
,-.+/.0/123/)0341/5.
Ni
!
"
$
%
&!
&"
&#
&$
! " # $ % &! &" &# &$
'()+,-)./.0+
/.0+
!"#$
!"#%
!"#&
"#&
"#%
"#$
" $ &" &$
'()*
+,-./-0/12/(023/4-
Cr
!
"
#$
#%
#!
#"
&#
! " ## #$ #% #! #" &#
'()+,-)./.0(
/.0(
!"#$
!"#%
!"#&
"
"#&
"#%
' ( )) )* )+ )' )( &)
,-./
012342/45674-/785492:
!"#$
"
"#$
"#%
"#&
"#'
(
(#$
!"#$ " "#$ "#% "#& "#' ( (#$
)*+,-./+,0102-
102-
!"#$
"#$
%#$
&#$
'#$
(#$
$#$
!"#$ "#$ %#$ &#$ '#$ (#$ $#$
)*+,-./+,
1023
!"#$
!"#%
!"#&
"#&
"#%
"#$
" & ' % (
)*+,
-./01/213451*256317/
!"#$
!"#%
!"#&
"#&
"#%
"#$
" "#' "#( "#) "#* &
+,-.
/0123143.563,467.
Mo
and Ti
Fantastic fits
considering
that there is
NO data to
support
them.
Fe Results
!"
!#
$"
$#
%"
%#
&"
!" !# $" $# %" %# &"
'()+,-)./.0)
/.0)
!"#$
!"#%
!"#&
"
"#&
"#%
$" $' (" (' )" )' *"
+,-.
/012314356.3,4.
Summary
Although we were able to develop models
that appeared to be able to predict the
amounts of all seven species, there is actually
only ‘real’ information about four of them.
The PCR and PLS modes will produce a ‘fit’
regardless of noise, lack of a positive
response, ….
Care must be taken to ensure that your data
set contains real information about all of the
components.
One last example
30 different hydrocarbon blends were
assayed by UV/Vis-NIR.
Each blend had known levels of isooctane,
toluene and decane but also contained
other hydrocarbons at unknown levels.
Each blend was measured using two
different UV/Vis-NIR instruments.
Because we have more X variables than
samples, we can’t use OLS.
470490510530550570590610630650670690710730750770790810830850870890910930950970990
10101030105010701090
The data
!
"
#!
#"
$!
$"
%!
%"
! " #! #" $! $" %! %"
!"#$%&'(#)+#,*
&'(#)+#*
!"#$%&'$"(")&'+&,+-.$+",$/-+0&1/*
!"#$
!"#%
!"#&
!"#'
"
"#'
"#&
"#%
"#$
"#(
" ( '" '( &" &( %" %(
!"#$%&'$
)&'+&,+-.$+",$/-+0&1/*
!
!"#
!"$
!"%
!"&
!"'
!"(
!")
!"*
!"+
$ %
5"/3".$.+
6.#$
PLS has a harder time with this dataset. Basically
because of the variations due to having two
instruments.
PLS insists on using that as latent variable.
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#" !"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
!"
#"
$%"
&"
%"
$!&" $%" &" %" !&"
!"#
!$#
Plot of first and second latent variables.
!"#$%&'()+,-.#/'0'&'()+,-.#
!
"
#!
#"
$!
$"
%!
%"
! " #! #" $! $" %! %"
!"#$%&'()+,-.#/**
&'()+,-.#**
!"#$%%&'()"+",'()-(.-/0-".$/-1(2$*
!"#$
!"
!%#$
!%
!&#$
&
&#$
%
%#$
"
& $ %& %$ "& "$ '& '$
!"#$%%&'()*
,'()-(.-/0-".$/-1(2$**
!
"!
#!
$!
%!
&!
'!
! "! #! $! %! &! '!
!"#$%&'()+#,#-*
&'()+#,#*
!"#$%&'('")"+,(-,.-/0'-".'1/-&,%*
!"
!#$%
!#
!&$%
&
&$%
#$%
"
& #& "& '& (& %& )&
!"#$%&'('
*+,(-,.-/0'-".'1/-&,%
!"
"
$#
$"
%#
%"
&#
&"
!" # " $# $" %# %" &# &"
!"#$%&'(#)+#,*
&'(#)+#*
!"#$%&'$"(")&'+&,+-.$+",$/-+0&1/*
!"#$
!"
!%#$
!%
!&#$
&
&#$
%
%#$
"
& $ %& %$ "& "$ '& '$
!"#$%&'$
)&'+&,+-.$+",$/-+0&1/*
!"#$%&'()+$,(%(**
-./012)3).4'%#-2567897:
!
!"#
$
$"#
%
%"#
$ & # ' ( $$ $& $# $' $( %$ %& %# %' %( &$ && &# &' &( )$ )& )# )' )( #$ #& ## #' #(
!;(&'<#%0+(*
=#+1'1%>&1)1/**
An outlier analysis indicates that over half of the
observations are considered outliers. Not a good
model. PCR offers a better choice this time.
Continuum
Regression
Continuum Regression (CR)
A recent attempt to unify PCR, PLS and OLR into a
single technique.
It is a continuously adjustable technique that uses
PLS as its base and includes PCR and OLR at the
opposite ends of the continuum.
PCR
CR parameter = ∞
PLS
CR parameter = 1
MLR
CR parameter = o
Continuum regression