Multivariate Regression and Principal Component Regression/Analysis: An Introduction - Pro, Papers of Chemistry

An introductory tutorial on multivariate regression, a statistical technique used to understand the relationship between multiple independent and dependent variables. The tutorial covers various examples, including chemical reactions, economic data, and student test scores, and explains how to find a set of regression coefficients that minimize the error between the dependent and predicted variables. The document also discusses the importance of sample size, representative samples, and avoiding extrapolation.

Typology: Papers

Pre 2010

Uploaded on 07/30/2009

koofers-user-cd9
koofers-user-cd9 🇺🇸

9 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1 pca2.mcd
Multivariate Regression, leading up to Principal Component Regression/Analysis
-- an introductory tutorial to some of the most important ideas in multivariate regression.
Instructor: Nam Sun Wang
Multivariate Regression.
Let us expand the number of independent variables and dependent variables. Here, we are given a
set of data consisted of a series of m+1 independent variables x<0>, x<1>, ..., x<m>, and l+1
dependent variables y<0>, y<1>, ..., y<l>. An example is how the quality, thickness, and strength of a
paper product (Y) depend on water content, source of fiber, digestion temperature, pH, etc. (X).
Another example is how the yield and composition in a chemical reactor (Y) depend on stirrer speed,
feed flow rate, reactant concentrations, ... (X). The chemical composition (Y) measured with a
chemical sensor may be related to the response of an array sensor (X). The mechanical or chemical
property of a material (Y) may depend on its color spectrum (X). An economic example may be how
the stock price and trading volume (Y) depend on the prevailing interest rate, the company's earning,
the quarter in the calendar, ... (X). The gross national product (Y) may depend on a country's
population, literacy rate, average age, level of rainfall, ... (X). The probability of death, thus, the
premium of a life insurance policy, may depend on the many attributes of the insured. The salary
and popularity of a football player (Y) may depend on his height, weight, running speed, strength,
running yards gained, passing yards gained, number of touchdowns, number of fumbles, hours of
practice per day, ... (X). The standardized test scores or the grade point average of a student (Y)
may depend on the number of hours spent in school, amount of daily TV time, the household
income, gender, the time of the day the test is taken, and maybe even the number of whip lashes
received since one's birth or the average number of glasses of milk one consumes daily (X).
Furthermore, a student's standardized test scores and grade point average may be closely
correlated. The examples are endless.
What we include as an independent variable need not actually affect the dependent variables in any
way. It is not necessarily a reflection of what we believe to affect the process. If we so desire, we
can throw in everything that may remotely affect the dependent variables. One thing regression tells
us is whether there is indeed any correlation between the two. A word of caution: existence of a
correlation does not imply the existence of an actual connection or the existence of a direct
cause-effect relationship. It is often true that "look and thou shall find." To judge whether a particular
degree of correlation is significant, we need to resort to tools from probability, hypothesis testing,
metrics, reliability, controlled experimentation, etc. In addition, we need to worry about a lot of other
things: how to include representative samples, adequate sample size, define the domain of validity to
avoid extrapolation, and detection and rectification of outliers and gross errors -- none of which will be
addressed in this worksheet.
pf3
pf4
pf5

Partial preview of the text

Download Multivariate Regression and Principal Component Regression/Analysis: An Introduction - Pro and more Papers Chemistry in PDF only on Docsity!

Multivariate Regression, leading up to Principal Component Regression/Analysis

-- an introductory tutorial to some of the most important ideas in multivariate regression.

Instructor: Nam Sun Wang

Multivariate Regression.

Let us expand the number of independent variables and dependent variables. Here, we are given a

set of data consisted of a series of m+1 independent variables x

<0> , x

<1> , ..., x

, and l+

dependent variables y

<0> , y

<1> , ..., y

. An example is how the quality, thickness, and strength of a

paper product (Y) depend on water content, source of fiber, digestion temperature, pH, etc. (X).

Another example is how the yield and composition in a chemical reactor (Y) depend on stirrer speed,

feed flow rate, reactant concentrations, ... (X). The chemical composition (Y) measured with a

chemical sensor may be related to the response of an array sensor (X). The mechanical or chemical

property of a material (Y) may depend on its color spectrum (X). An economic example may be how

the stock price and trading volume (Y) depend on the prevailing interest rate, the company's earning,

the quarter in the calendar, ... (X). The gross national product (Y) may depend on a country's

population, literacy rate, average age, level of rainfall, ... (X). The probability of death, thus, the

premium of a life insurance policy, may depend on the many attributes of the insured. The salary

and popularity of a football player (Y) may depend on his height, weight, running speed, strength,

running yards gained, passing yards gained, number of touchdowns, number of fumbles, hours of

practice per day, ... (X). The standardized test scores or the grade point average of a student (Y)

may depend on the number of hours spent in school, amount of daily TV time, the household

income, gender, the time of the day the test is taken, and maybe even the number of whip lashes

received since one's birth or the average number of glasses of milk one consumes daily (X).

Furthermore, a student's standardized test scores and grade point average may be closely

correlated. The examples are endless.

What we include as an independent variable need not actually affect the dependent variables in any

way. It is not necessarily a reflection of what we believe to affect the process. If we so desire, we

can throw in everything that may remotely affect the dependent variables. One thing regression tells

us is whether there is indeed any correlation between the two. A word of caution: existence of a

correlation does not imply the existence of an actual connection or the existence of a direct

cause-effect relationship. It is often true that "look and thou shall find." To judge whether a particular

degree of correlation is significant, we need to resort to tools from probability, hypothesis testing,

metrics, reliability, controlled experimentation, etc. In addition, we need to worry about a lot of other

things: how to include representative samples, adequate sample size, define the domain of validity to

avoid extrapolation, and detection and rectification of outliers and gross errors -- none of which will be

addressed in this worksheet.

Let us start with some independent data x and some dependent data y.

Raw data:

x

0

i

x

1

i

x

m

i

and

y

0

i

y

1

i

y

l

i

i 0 ..n

Combine these data into an independent matrix X and a dependent matrix Y.

Raw data: X and Y

As before, our task is to find a set of regression coefficients a such that the given model function

f(X,a) passes through the given data (X,Y) as closely as possible. That is to say that we try to

minimize some sort of error between Y and f(X,a).

Minimize error Y f( X a, )

a

There are many measures of this error. One of them is the absolute error, which is mathematically

cumbersome to work with because the absolute function is not differentiable at zero.

Minimize error Y f( X a, )

a

As before, we try to minimize the sum of squared errors, which is mathematically more tractable.

Minimize sse

n

i

error i

2

n

i = 0

l

k

y i k,

f X, i

a

k

2

a

T

E E

T

( Y f( X a, )) ( Y f( X a, ))

Multivariate Linear Regression (MLR).

The simplest model is a linear one where the X matrix is simply the plain given set of dependent

variables x

<0> , x

<1> , ..., x

. We can also have functions and combinations of x

<•> in X (e.g., auto-

or cross-terms of two independent variables such as x

<0> ⋅x

<0> and x

<0> ⋅x

<1>; or functions of one or

several independent variables such as x

<0> /x

<1> , sin(x

<0> ), x

<0> ⋅exp(x

<1> ), etc.). At any rate, the

linear combination of these terms is expressed as:

Y

X a E

Example -- Multivariate Y. If there are more than one dependent variable, we perform regression

analysis no differently.

Y

i 1,

X

i 0,

X

i 1,

( rnd( 0.1) 0.05 ) Y y( X) ( rnd( 0.1 ) 0.05)

y( x )

x

↑ Toggled off because the noise is not

a added correctly with this formula.

T

X X

1

T

X Y

a=

← Linear regression has captured the underlying structure. Compare

it to the matrix in y(x).

y regress

( x )

x a

Examples: y = regress

( ( 5 0.5 0.05 )) ( 6.507 5.486) y( ( 5 0.5 0.05 ))=( 6.5 5.5)

The following arguments are out of the calibration range. Extrapolation is dangerous.

y = regress

( ( 1 1 1 )) ( 3.894 1.996) y( ( 1 1 1 ))=( 4 2 )

y = regress

y =

Example -- Linear Combinations of Nonlinear Basis Functions.

Generate the dependent variable which varies nonlinearly with the first two independent variables

plus a small amount of noise.

Y 0 ... reset the variable Y for a new assignment.

Y

i

X

i 0,

3 X

i 1,

X

i 0,

X

i 1,

10000 X

i 2,

2

( rnd( 0.1 ) 0.05)

y( x )

x

x

T

x

The least squares solution that ignores the auto- and cross-terms is again given by the same

normal equation.

a

T

X X

1

T

X Y

As expected, with just the plain X terms, we fail to recover the underlying

a= structural relationship between X and Y.

The regression equation is:

y regress

( x )

x a

Examples: y = regress

( ( 5 0.5 0.05 )) 9.774 y( ( 5 0.5 0.05 ))= 34

Goodness of fit:

sse old

Y Y sse = old

3

sse

Y y regress

( X) Y y regress

( X) sse =5.318 10

3

r

sse old

sse

sse old

r2 =8.019 %

r r2 r =28.318 % ← Not too much variation in Y has been captured.

Repeat regression by including the cross terms. Below we expand the independent variable X to

include the cross terms. The resulting expanded matrix is Xx.

Xx X

Xx i 3,

X

i 0,

X

i 0,

Xx i 4,

X

i 1,

X

i 1,

Xx i 5,

X

i 2,

X

i 2,

Xx i 6,

X

i 0,

X

i 1,

Xx i 7,

X

i 0,

X

i 2,

Xx i 8,

X

i 1,

X

i 2,

The least squares solution that ignores the auto- and cross-terms is again given by the same

normal equation.

a

T

Xx Xx

1

T

Xx Y

T

a 0.998 2.967 0.036 3.329 10

4

0.098 1 10

4

( 0.99 0.081 0.528)

Example -- Linearly correlated X. The independent variables have the same range as before;

however, the first two independent variables x

<0> and x

<1> are mostly dependent, with x

<0> being 10

times of x

<1> .

conde =

T

X X 1.205 10

4

... condition number of the last example

X 0 ... reset X to prepare for a new assignment.

X

i . ( rnd( 1 ) 0.5)

( rnd( 1 ) 0.5)

( rnd( 1 ) 0.5)

X

T

X

↑ Without this small noise term, x

<0> and x

<1> are completely

dependent and X

T X is singular.

Y 0 Y

i

X

i 0,

3 X

i 1,

( rnd( 0.1 ) 0.05)

y( x )

x

a

T

X X

1

T

X Y

← The recovered structure from linear regression is not what we had put in.

(Compare to the vector in y(x))

a =

conde =

T

X X 8.735 10

7

← The condition number is very large, which means X

T ⋅X is

almost singular. This provides a warning that linear regression is

breaking down.

Regression model (which is not to be trusted):

y regress

( x )

x a

Examples. The first example works O.K. because x

<0> and x

<1> are correlated. On the other

hand, the output numbers in the second example do not agree at all with the original model

because the given x

<0> and x

<1> are not correlated the same way as the calibration data are.

Note that although the input numbers are each within the range of the calibration data, the

uncorrelated pattern in the second example is not included in the calibration data. Technically,

this, too, is a case of extrapolation.

y = regress

( ( 5 0.5 0.05 )) 6.507 y( ( 5 0.5 0.05 ))=6.5 ← O.K.

y = regress

( ( 5 0.5 0.05 )) 13.516 y( ( 5 0.5 0.05 ))=3.5 ← totally off -- extrapolation.