Regression Methods: Linear, Multiple Linear, OLS, PCA, and PLS, Study notes of Advanced Physics

Various regression methods, including linear regression, multiple linear regression, ordinary least squares (ols), principal component regression (pcr), and partial least squares regression (pls). The document compares the optimization and results of each method and discusses their advantages and limitations. It also includes examples using octane number data.

Typology: Study notes

2010/2011

Uploaded on 09/10/2011

gerrard_11
gerrard_11 🇬🇧

4.3

(6)

234 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Regression methods
Linear regression
Y = m X + b
A linear relationship is assumed to exist
between to factors.
This was already discussed in an earlier unit.
Regression methods
Multiple linear regression
Y = m1 X1 + m2 X2 + ... mn Xn + e
This is a linear regression fit that is extended to
several variables.
It is useful when several factors contribute to the
overall observed response.
Multivariate calibration
Typically, a multivariate method implies that
you have multiple X (independent) and
multiple Y (dependent) variable.
We will outline three multivariate
approaches to creating a calibration curve.
Ordinary Least Squares (OLS)
Principal component regression (PCR)
Partial least squares regression (PLS)
While each optimizes the fit of your data
differently, method evaluation, optimization
and the results are often the same.
OLS
With traditional linear and multiple linear regression,
we’re limited to a single Y (dependent) variable.
OLS (also called a general linear model - GLM) can
be seen as an extension of this approach. You have
a Y matrix instead of a Y vector.
Mathematically, the matrix formulations for MLS and
OLS (GLM) are the identical - except for allowing for
a Y matrix. Basically a ‘combination’ of MLS and
‘simultaneous equations.
XLStat will handle either approach - based on the
number of Y variables you give it.
OLS
One limit with OLS is that you need more
observations than X variables and more X
variables than Y variables.
Results can be irratic if you have variables
that are:
Collinear (ones with a high degree of
linear correlation.)
Invariate (ones that don’t vary much.)
Can try to remove all invariate and all but
one collinear (in a block) and hope for the
best (XLStat will do this.)
For these reasons, OLS is not as commonly
for multi-Y type problems (compared to PCR
and PLS).
Principal component regression
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Regression Methods: Linear, Multiple Linear, OLS, PCA, and PLS and more Study notes Advanced Physics in PDF only on Docsity!

Regression methods

Linear regression

Y = m X + b

A linear relationship is assumed to exist

between to factors.

This was already discussed in an earlier unit.

Regression methods

Multiple linear regression

Y = m

X1 + m

X

+ ... mn

Xn

+ e

This is a linear regression fit that is extended to

several variables.

It is useful when several factors contribute to the

overall observed response.

Multivariate calibration

Typically, a multivariate method implies that

you have multiple X (independent) and

multiple Y (dependent) variable.

We will outline three multivariate

approaches to creating a calibration curve.

Ordinary Least Squares (OLS)

Principal component regression (PCR)

Partial least squares regression (PLS)

While each optimizes the fit of your data

differently, method evaluation, optimization

and the results are often the same.

OLS

With traditional linear and multiple linear regression,

we’re limited to a single Y (dependent) variable.

OLS (also called a general linear model - GLM) can

be seen as an extension of this approach. You have

a Y matrix instead of a Y vector.

Mathematically, the matrix formulations for MLS and

OLS (GLM) are the identical - except for allowing for

a Y matrix. Basically a ‘combination’ of MLS and

‘simultaneous equations.

XLStat will handle either approach - based on the

number of Y variables you give it.

OLS

One limit with OLS is that you need more

observations than X variables and more X

variables than Y variables.

Results can be irratic if you have variables

that are:

Collinear (ones with a high degree of

linear correlation.)

Invariate (ones that don’t vary much.)

Can try to remove all invariate and all but

one collinear (in a block) and hope for the

best (XLStat will do this.)

For these reasons, OLS is not as commonly

for multi-Y type problems (compared to PCR

and PLS).

Principal component regression

This is a simple extension of OLS.

It is assumed that each member of your set can

be assigned a quantitative class value.

First, generate a PCA model for your data.

Using the PCA scores, conduct a multiple

linear regression where your Y values are the

quantitative class values.

Principal component

regression

Raw or

scaled

data

Residual

(noise)

PCA

scores

PCA

scores

slopes

and

multiple

linear

regression

intercepts

OLS

Principal component regression

Advantages of PCR over OLR

  • Noise remains in the residual.
  • Fewer variables to work with.
  • Obtain PCA information as well.
  • You can use just the components that relate

to the trend of interest.

Limits of PCR

  • It assumes that your data array is valid for

predicting Y values.

It must contain no errors beyond noise.

The first PC(s) may or may not actually

related to any of the Y components.

Partial least squares regression

PLS modeling relies on a simultaneous fit

of both an independent and dependent

matrix.

The objective is to derive latent variables

that are similar to principal components.

Major difference is the attempt to

minimize the variance of both arrays.

Called PLS1 for a Y vector and PLS

where there is a Y matrix.

Partial least squares regression

With PLS, the goal it to extract the latent variables by

using the X array to properly ‘align’ the Y array (or vector)

and then reversing the process.

Y

X

q

w

t

u

Partial least squares regression

Each factor to be determined should end up

with a different PC set.

It may require a different number of

components to adequately model each

quantitative variable.

The approach insures that the ‘best fit’ is obtain

for all variables - which is both good and bad.

Considered best approach when the number of

variables is high and correlated variables are

likely (basically the opposite of OLS).

Validation

All methods require validation.

Again, cross validation is one of the best

approach. ( leave one out method )

It permits you to determine a prediction

error sum of squares (PRESS) or the root-

mean-square value of prediction error

(RMSPE)

Tracking the PRESS value will tell you the

optimum number of components to use.

PLS model quality

QR

Y cum index.

Sum of the coefficients of determination (R

between the dependent variables and the ‘h’ first

components for the dependent variables.

QR

X cum index.

Sum of the coefficients of determination (R

between the independent variables and the ‘h’

first components for the independent variables.

These are similar to the Q

cum (h) index - but only

for one of the ‘blocks’ of data.

Note: other programs will either a) call these

different things or b) use different measures.

Octane number

Rating Octane of Gasoline using Near IR.

  • ASTM method is complex and expensive.

A simple method would be more desirable.

Experimental

  • Unleaded gasoline samples were assayed

by the ASTM method.

  • NIR spectra (900-1600 nm) were obtained.
  • OLR, PCR and PLS models were studied.
  • X matrix - spectra at 20 nm intervals.
  • ASTM octane number by ‘Research method’

was used as the Y matrix (vector).

Octane Number

NIR spectra

A 915 nm, CH2 stretch

B 1021 nm, CH2/CH

combination band

C 1151 nm, aromatic

and CH3 stretch

D 1194 nm, CH3 stretch

E 1394 nm, CH

combination bands

F 1412 nm aromatic &

CH2 combination bands

G 1435 nm aromatic &

CH2 combination bands

A

B

C

D

E

F

G

Octane Number - OLS

High R

value and

RMSE and Press

RMSE are similar.

Octane Number

Only 9 variables ended up being used in

building the OLS model.

Octane Number, OLS

!"

!#

!$

!%

%&

%"

!" !# !$ !% %& %"

!"#$%&'#$()&'+#(,*

)&'+#(,*

'()*+,

-./0.)

OLS residual.

!"#$

!"

!%#$

!%

!&#$

&

&#$

%

%#$

"

'( ') '$ '* '+ '' ', ,& ,% ," ,(

Octane Number, PCR

PCR

OLS

Note that you get a small

improvement in the fit

with PCR. Might have a

problem with outliers

Also, you have a larger

number of degrees of

freedom since all of the

original variables were

used. With OLS, most

were discarded.

Octane Number, PCR

!

"

#!

#"

$!

$"

%!

%"

&# &$ &% &' &" &(

!"#$"%&%'

()&%+,-.&*

!

$!

'!

(!

)!

#!!

!.#.-,')+&/+,0),1)-)'2/

Almost 90% of the

variance is captured in

the first component.

Over 99% in the first

three.

Scores plot - potential outliers

!"#$

!"

!"#% !&#" !'#&

!'#"

!&#$

!"#(

!'#)

&%#"

!"#%

!"#&

!$#"

!$#"

!"#)

&%#"

&%#*

!&#'

&%#"

!'

!"#%

!"#$

!$#(

!"#%

&%#"

!'#*

&*#)

!"#)

&%#'

!"#%

&%#&

&%#!

!"#(

!(#!

!(#$

!(#&

!(#&

!"#%

!$#"

!"

!(#!

!"#'

&%#!

&%#&

!'#)

!$#&

!$#&

!"#$

&*#%

&%#"

!"#$

!(#&

&#

!"#!

!$#" !$#&

!"#*

!"#'

&%#&

+)

,

)

+%, +) , ) %, %)

!"#$%&'()#+*

!(#$""'),#+*

-./012 345064/

Octane Number,

PCR

!"

!#

!$

!%

%&

%"

!" !# !$ !% %& %"

!"#$%&'#$()&'+#(,*

)&'+#(,*

'()+, -./0.)*

!"#$%&'(')'#$%+$,+-.&+',&/-+0$1/*

!"

!#

!$

!%

&

%

$

"

'# '( ') '* *% *#

!"#$%&'(

*#$%+$,+-.&+',&/-+0$1/

Octane Number, PLS

!"#$%&'()%+,&-,&.(/-$0&"1&2"/3".$.+*

!

!"#

!"$

!"%

!"&

'

' # ( $ )

5"/3".$.+

6.#$

OLS

Due to the limited number of samples,

there are too many X points/sample.

By creating replicate copies of the samples

(in triplicate), an initial OLS was conducted

to determine which X points would have

been automatically eliminated.

This left 8 X points/sample.

OLS

Model Quality

Summary.

Si

Pred(% Si) / % Si

0

1

0 0.2 0.4 0.6 0.8 1 1.2 1.

Pred(% Si)

% Si

% Si / Standardized residuals

-1.

-0.

0

1

0 0.5 1

% Si

Standardized residuals

Mn

Pred(% Mn) / % Mn

0

1

2

0 0.5 1 1.5 2 2.

Pred(% Mn)

% Mn

% Mn / Standardized residuals

-1.

-0.

0

1

0 0.5 1 1.5 2 2.

% Mn

Standardized residuals

Ni

Pred(% Ni) / % Ni

0

5

10

15

20

0 5 10 15 20

Pred(% Ni)

% Ni

% Ni / Standardized residuals

-1.

-0.

0

1

2

0 5 10 15 20

% Ni

Standardized residuals

Cr

Pred(% Cr) / % Cr

5

7

9

11

13

15

17

19

21

5 7 9 11 13 15 17 19 21

Pred(% Cr)

% Cr

% Cr / Standardized residuals

-1.

-0.

0

1

5 10 15 20

% Cr

Standardized residuals

Pred(% Mo) / % Mo

-0.

0

1

2

3

4

-0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.

Pred(% Mo)

% Mo

% Mo / Standardized residuals

-1.

-0.

0

1

0 1 2 3 4 5

% Mo

Standardized residuals

Ti

Pred(% Ti) / % Ti

-0.

-0.

0

1

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Pred(% Ti)

% Ti

% Ti / Standardized residuals

-0.

0

1

0 0.2 0.4 0.6 0.8 1

% Ti

Standardized residuals

Fe

60

65

70

75

80

85

90

60 65 70 75 80 85 90

Pred(% Fe)

% Fe

-1.

-0.

0

1

60 65 70 75 80 85 90

% Fe

Standardized residuals

PCR gives higher R

2

values but look at the Press RMSE

OLS results

Fe

!"

!#

$"

$#

%"

%#

&"

!" !# $" $# %" %# &"

'()+,-)./.0)

/.0)

!"#$

!"#%

!"#&

!"#'

!"#(

!"#)

"

"#)

"#(

"#'

"#&

$" $% *" *% +" +% ,"

-./

123453657805.60975:3;

PLS

!"#$%&'()%*+,&-,&.(/-$0&"1&2"/3".$.+

5

567

568

569

56:

56;

56<

56=

56>

56?

7

7 8 9 : ; < = >? 75 77 78 79

@"/3".$.+

A.#$B

The Q

2

pattern indicates that most of the information if brought out in the first

few components but then noise is brought out to finally find a way to fit the

‘non-spectral’ species.

PLS

PLS gives an even better fit - but not a

huge improvement compared to PCR.

Si

!

!"#

!"$

!"%

!"&

'

'"#

'"$

! !"# !"$ !"% !"& ' '"# '"$

()+,-.+/0/1,

0/1,

!"#$

!"#%

!"#&

"#&

"#%

"#$

" "#' "#( "#) "#* & &#' &#(

+,-.

-/012032.452,356.

Mn

!

!"#

$

$"#

%

%"#

! !"# $ $"# % %"#

&'()*+,()-.-/

.-/

!"#$

!"#%

!"#&

"#&

"#%

"#$

" "#$ & &#$ ' '#$

()*+

,-.+/.0/123/)0341/5.

Ni

!

"

$

%

&!

&"

&#

&$

! " # $ % &! &" &# &$

'()+,-)./.0+

/.0+

!"#$

!"#%

!"#&

"#&

"#%

"#$

" $ &" &$

'()*

+,-./-0/12/(023/4-

Cr

!

"

#$

#%

#!

#"

&#

! " ## #$ #% #! #" &#

'()+,-)./.0(

/.0(

!"#$

!"#%

!"#&

"

"#&

"#%

' ( )) )* )+ )' )( &)

,-./

012342/45674-/785492:

!"#$

"

"#$

"#%

"#&

"#'

(

(#$

!"#$ " "#$ "#% "#& "#' ( (#$

)*+,-./+,0102-

102-

!"#$

"#$

%#$

&#$

'#$

(#$

$#$

!"#$ "#$ %#$ &#$ '#$ (#$ $#$

)*+,-./+,

1023

!"#$

!"#%

!"#&

"#&

"#%

"#$

" & ' % (

)*+,

-./01/213451*256317/

!"#$

!"#%

!"#&

"#&

"#%

"#$

" "#' "#( "#) "#* &

+,-.

/0123143.563,467.

Mo

and Ti

Fantastic fits

considering

that there is

NO data to

support

them.

Fe Results

!"

!#

$"

$#

%"

%#

&"

!" !# $" $# %" %# &"

'()+,-)./.0)

/.0)

!"#$

!"#%

!"#&

"

"#&

"#%

$" $' (" (' )" )' *"

+,-.

/012314356.3,4.

Summary

Although we were able to develop models

that appeared to be able to predict the

amounts of all seven species, there is actually

only ‘real’ information about four of them.

The PCR and PLS modes will produce a ‘fit’

regardless of noise, lack of a positive

response, ….

Care must be taken to ensure that your data

set contains real information about all of the

components.

One last example

30 different hydrocarbon blends were

assayed by UV/Vis-NIR.

Each blend had known levels of isooctane,

toluene and decane but also contained

other hydrocarbons at unknown levels.

Each blend was measured using two

different UV/Vis-NIR instruments.

Because we have more X variables than

samples, we can’t use OLS.

470490510530550570590610630650670690710730750770790810830850870890910930950970990

10101030105010701090

The data

!

"

#!

#"

$!

$"

%!

%"

! " #! #" $! $" %! %"

!"#$%&'(#)+#,*

&'(#)+#*

!"#$%&'$"(")&'+&,+-.$+",$/-+0&1/*

!"#$

!"#%

!"#&

!"#'

"

"#'

"#&

"#%

"#$

"#(

" ( '" '( &" &( %" %(

!"#$%&'$

)&'+&,+-.$+",$/-+0&1/*

!

!"#

!"$

!"%

!"&

!"'

!"(

!")

!"*

!"+

$ %

5"/3".$.+

6.#$

PLS has a harder time with this dataset. Basically

because of the variations due to having two

instruments.

PLS insists on using that as latent variable.

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#" !"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

!"

#"

$%"

&"

%"

$!&" $%" &" %" !&"

!"#

!$#

Plot of first and second latent variables.

!"#$%&'()+,-.#/'0'&'()+,-.#

!

"

#!

#"

$!

$"

%!

%"

! " #! #" $! $" %! %"

!"#$%&'()+,-.#/**

&'()+,-.#**

!"#$%%&'()"+",'()-(.-/0-".$/-1(2$*

!"#$

!"

!%#$

!%

!&#$

&

&#$

%

%#$

"

& $ %& %$ "& "$ '& '$

!"#$%%&'()*

,'()-(.-/0-".$/-1(2$**

!

"!

#!

$!

%!

&!

'!

! "! #! $! %! &! '!

!"#$%&'()+#,#-*

&'()+#,#*

!"#$%&'('")"+,(-,.-/0'-".'1/-&,%*

!"

!#$%

!#

!&$%

&

&$%

#$%

"

& #& "& '& (& %& )&

!"#$%&'('

*+,(-,.-/0'-".'1/-&,%

!"

"

$#

$"

%#

%"

&#

&"

!" # " $# $" %# %" &# &"

!"#$%&'(#)+#,*

&'(#)+#*

!"#$%&'$"(")&'+&,+-.$+",$/-+0&1/*

!"#$

!"

!%#$

!%

!&#$

&

&#$

%

%#$

"

& $ %& %$ "& "$ '& '$

!"#$%&'$

)&'+&,+-.$+",$/-+0&1/*

!"#$%&'()+$,(%(**

-./012)3).4'%#-2567897:

!

!"#

$

$"#

%

%"#

$ & # ' ( $$ $& $# $' $( %$ %& %# %' %( &$ && &# &' &( )$ )& )# )' )( #$ #& ## #' #(

!;(&'<#%0+(*

=#+1'1%>&1)1/**

An outlier analysis indicates that over half of the

observations are considered outliers. Not a good

model. PCR offers a better choice this time.

Continuum

Regression

Continuum Regression (CR)

A recent attempt to unify PCR, PLS and OLR into a

single technique.

It is a continuously adjustable technique that uses

PLS as its base and includes PCR and OLR at the

opposite ends of the continuum.

PCR

CR parameter = ∞

PLS

CR parameter = 1

MLR

CR parameter = o

Continuum regression