



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An introductory tutorial on multivariate regression, a statistical technique used to understand the relationship between multiple independent and dependent variables. The tutorial covers various examples, including chemical reactions, economic data, and student test scores, and explains how to find a set of regression coefficients that minimize the error between the dependent and predicted variables. The document also discusses the importance of sample size, representative samples, and avoiding extrapolation.
Typology: Papers
1 / 7
This page cannot be seen from the preview
Don't miss anything!




Multivariate Regression, leading up to Principal Component Regression/Analysis
-- an introductory tutorial to some of the most important ideas in multivariate regression.
Instructor: Nam Sun Wang
Multivariate Regression.
Let us expand the number of independent variables and dependent variables. Here, we are given a
set of data consisted of a series of m+1 independent variables x
<0> , x
<1> , ..., x
dependent variables y
<0> , y
<1> , ..., y
. An example is how the quality, thickness, and strength of a
paper product (Y) depend on water content, source of fiber, digestion temperature, pH, etc. (X).
Another example is how the yield and composition in a chemical reactor (Y) depend on stirrer speed,
feed flow rate, reactant concentrations, ... (X). The chemical composition (Y) measured with a
chemical sensor may be related to the response of an array sensor (X). The mechanical or chemical
property of a material (Y) may depend on its color spectrum (X). An economic example may be how
the stock price and trading volume (Y) depend on the prevailing interest rate, the company's earning,
the quarter in the calendar, ... (X). The gross national product (Y) may depend on a country's
population, literacy rate, average age, level of rainfall, ... (X). The probability of death, thus, the
premium of a life insurance policy, may depend on the many attributes of the insured. The salary
and popularity of a football player (Y) may depend on his height, weight, running speed, strength,
running yards gained, passing yards gained, number of touchdowns, number of fumbles, hours of
practice per day, ... (X). The standardized test scores or the grade point average of a student (Y)
may depend on the number of hours spent in school, amount of daily TV time, the household
income, gender, the time of the day the test is taken, and maybe even the number of whip lashes
received since one's birth or the average number of glasses of milk one consumes daily (X).
Furthermore, a student's standardized test scores and grade point average may be closely
correlated. The examples are endless.
What we include as an independent variable need not actually affect the dependent variables in any
way. It is not necessarily a reflection of what we believe to affect the process. If we so desire, we
can throw in everything that may remotely affect the dependent variables. One thing regression tells
us is whether there is indeed any correlation between the two. A word of caution: existence of a
correlation does not imply the existence of an actual connection or the existence of a direct
cause-effect relationship. It is often true that "look and thou shall find." To judge whether a particular
degree of correlation is significant, we need to resort to tools from probability, hypothesis testing,
metrics, reliability, controlled experimentation, etc. In addition, we need to worry about a lot of other
things: how to include representative samples, adequate sample size, define the domain of validity to
avoid extrapolation, and detection and rectification of outliers and gross errors -- none of which will be
addressed in this worksheet.
Let us start with some independent data x and some dependent data y.
Raw data:
x
0
i
x
1
i
x
m
i
and
y
0
i
y
1
i
y
l
i
i 0 ..n
Combine these data into an independent matrix X and a dependent matrix Y.
Raw data: X and Y
As before, our task is to find a set of regression coefficients a such that the given model function
f(X,a) passes through the given data (X,Y) as closely as possible. That is to say that we try to
minimize some sort of error between Y and f(X,a).
Minimize error Y f( X a, )
a
There are many measures of this error. One of them is the absolute error, which is mathematically
cumbersome to work with because the absolute function is not differentiable at zero.
Minimize error Y f( X a, )
a
As before, we try to minimize the sum of squared errors, which is mathematically more tractable.
Minimize sse
n
i
error i
2
n
i = 0
l
k
y i k,
f X, i
a
k
2
a
T
E E
T
( Y f( X a, )) ( Y f( X a, ))
Multivariate Linear Regression (MLR).
The simplest model is a linear one where the X matrix is simply the plain given set of dependent
variables x
<0> , x
<1> , ..., x
. We can also have functions and combinations of x
<•> in X (e.g., auto-
or cross-terms of two independent variables such as x
<0> ⋅x
<0> and x
<0> ⋅x
<1>; or functions of one or
several independent variables such as x
<0> /x
<1> , sin(x
<0> ), x
<0> ⋅exp(x
<1> ), etc.). At any rate, the
linear combination of these terms is expressed as:
X a E
Example -- Multivariate Y. If there are more than one dependent variable, we perform regression
analysis no differently.
i 1,
i 0,
i 1,
( rnd( 0.1) 0.05 ) Y y( X) ( rnd( 0.1 ) 0.05)
y( x )
x
↑ Toggled off because the noise is not
a added correctly with this formula.
T
X X
1
T
X Y
a=
← Linear regression has captured the underlying structure. Compare
it to the matrix in y(x).
y regress
( x )
x a
Examples: y = regress
( ( 5 0.5 0.05 )) ( 6.507 5.486) y( ( 5 0.5 0.05 ))=( 6.5 5.5)
The following arguments are out of the calibration range. Extrapolation is dangerous.
y = regress
( ( 1 1 1 )) ( 3.894 1.996) y( ( 1 1 1 ))=( 4 2 )
y = regress
y =
Example -- Linear Combinations of Nonlinear Basis Functions.
Generate the dependent variable which varies nonlinearly with the first two independent variables
plus a small amount of noise.
Y 0 ... reset the variable Y for a new assignment.
i
i 0,
i 1,
i 0,
i 1,
i 2,
2
( rnd( 0.1 ) 0.05)
y( x )
x
x
T
x
The least squares solution that ignores the auto- and cross-terms is again given by the same
normal equation.
a
T
X X
1
T
X Y
As expected, with just the plain X terms, we fail to recover the underlying
a= structural relationship between X and Y.
The regression equation is:
y regress
( x )
x a
Examples: y = regress
( ( 5 0.5 0.05 )) 9.774 y( ( 5 0.5 0.05 ))= 34
Goodness of fit:
sse old
Y Y sse = old
3
sse
Y y regress
( X) Y y regress
( X) sse =5.318 10
3
r
sse old
sse
sse old
r2 =8.019 %
r r2 r =28.318 % ← Not too much variation in Y has been captured.
Repeat regression by including the cross terms. Below we expand the independent variable X to
include the cross terms. The resulting expanded matrix is Xx.
Xx X
Xx i 3,
i 0,
i 0,
Xx i 4,
i 1,
i 1,
Xx i 5,
i 2,
i 2,
Xx i 6,
i 0,
i 1,
Xx i 7,
i 0,
i 2,
Xx i 8,
i 1,
i 2,
The least squares solution that ignores the auto- and cross-terms is again given by the same
normal equation.
a
T
Xx Xx
1
T
Xx Y
T
a 0.998 2.967 0.036 3.329 10
4
0.098 1 10
4
( 0.99 0.081 0.528)
Example -- Linearly correlated X. The independent variables have the same range as before;
however, the first two independent variables x
<0> and x
<1> are mostly dependent, with x
<0> being 10
times of x
<1> .
conde =
T
X X 1.205 10
4
... condition number of the last example
X 0 ... reset X to prepare for a new assignment.
i . ( rnd( 1 ) 0.5)
( rnd( 1 ) 0.5)
( rnd( 1 ) 0.5)
T
X
↑ Without this small noise term, x
<0> and x
<1> are completely
dependent and X
T X is singular.
i
i 0,
i 1,
( rnd( 0.1 ) 0.05)
y( x )
x
a
T
X X
1
T
X Y
← The recovered structure from linear regression is not what we had put in.
(Compare to the vector in y(x))
a =
conde =
T
X X 8.735 10
7
← The condition number is very large, which means X
T ⋅X is
almost singular. This provides a warning that linear regression is
breaking down.
Regression model (which is not to be trusted):
y regress
( x )
x a
Examples. The first example works O.K. because x
<0> and x
<1> are correlated. On the other
hand, the output numbers in the second example do not agree at all with the original model
because the given x
<0> and x
<1> are not correlated the same way as the calibration data are.
Note that although the input numbers are each within the range of the calibration data, the
uncorrelated pattern in the second example is not included in the calibration data. Technically,
this, too, is a case of extrapolation.
y = regress
( ( 5 0.5 0.05 )) 6.507 y( ( 5 0.5 0.05 ))=6.5 ← O.K.
y = regress
( ( 5 0.5 0.05 )) 13.516 y( ( 5 0.5 0.05 ))=3.5 ← totally off -- extrapolation.