








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Exploratory data analysis (eda) is a statistical approach used when dealing with complex samples consisting of many components. Unlike confirmatory data analysis, which focuses on making one type of measurement, eda aims to learn something beyond simple numbers by collecting as much information as possible and using multivariate data analysis. The importance of eda, its steps, and the tools available for complex data sets.
Typology: Study notes
1 / 14
This page cannot be seen from the preview
Don't miss anything!









Up to now, we’ve dealt with simple statistical problems.
Primary goals were to
detect and quantify a single analyte.
develop relationships between an analyte and a response.
optimize an experiment design and methods used to
measure a response.
confirmatory data analysis.
When we obtain a set of samples and
make one type of measurement.
Many analytical methods are developed
to quantify a single analyte or a limited
number of analytes.
All other factors are held constant or
eliminated.
When we obtain many measurements
from a number of samples and attempt
to learn something about our sample
beyond simple numbers.
‘Real world’ problems are typically much
more complex. A true understanding of
a system may only be possible if many
factors are considered.
Complex sample consists of many components.
Each may contribute to the overall properties
of the sample.
A measurement of any single component or
property is unlikely to tell you much about
what the sample is.
Any type of sample can be either simple or
complex based on the type of information
desired regarding the sample.
Examples
Gasoline
Its overall performance as a fuel is
not based on the amount of any
single component.
Coffee
This material contains hundreds of
components. The flavor can’t be
attributed to any single component.
With current analytical tools, its possible to
detect and quantify most materials in a
complex sample.
Knowing that information, its still impossible
to state what the original sample was or be
able to precisely reproduce it.
Example - perfume reproductions.
Exploratory data analysis attempts to
detect and evaluate underlying trends a
data set.
This is accomplished by collecting as
much information about a problem as
possible and multivariate data analysis.
The introduction of the personal
computer made it possible for routine
evaluation of complex data sets (many
variables and samples.)
Example.
Assume you are doing QA/QC for a fertilizer
company.
You are provided with representative samples at 30
minute intervals. If there is a problem, you must
stop production. If you are wrong - you are fired!
Let’s see what happens to you level of knowledge as
we increase the amount of data.
Time!! % N
7:00 am! 15.
The 9:00 value appears low.
What should you do?
A simple statistical calculation for the first
four samples shows:
!! mean = 14.9, s x
! Your 9:00 sample is -9.6 s.
So you know that the value is significantly
lower (different) than the first four.
You don’t know why!
Your analysis could be bad or something
could be truly wrong in the plant.
Time!! % N! %P
7:00 am! 15.1!! 6.
By evaluating two components in our
sample, we now know more.
Another statistical evaluation shows
that for the first four samples:
mean!! 14.9!! 6.
s x
The 9:00 sample is low by 9.6 s for both
nitrogen and phosphorous.
You can be pretty confident that
something is wrong with the sample.
But what?
Artificial intelligence
Pattern recognition Neural networks
Parametric Non-parametric
Preprocessing Learning Display
Supervised Unsupervised
Preprocessing
! Data transformations such as scaling.
Display
Projection of our data into a limited
number of dimensions.
Unsupervised learning
Methods that require no initial assumptions.
Examples - cluster analysis and PCA.
Supervised learning
Methods that require initial assumptions or
a model. SIMCA and KNN are examples.
Methods assume that nearness in n-
dimensional space reflects similarities in
measured properties.
Each variable is treated as a dimension so
a data set with 10 measured properties
would be considered as existing in 10-
dimensional space.
Since we typically have a large number of
dimensions, we need a ‘standard’ way of
working with our data.
The first step is to convert our data into a
matrix where:
1 2 3 ... NV
1
X
1,
X
1,
X
1,
...
X
1,NV
2
X
2,
.. ....
3
X
3,
.. ....
... ..... ....
NP
X
NP,
.. ...
X
NP,
NV
Cases
A row of data where each value
corresponds to measured properties of a
specific sample
Variables or features
A column of data which corresponds to
one measured property for all samples.
While many of our methods would still
work if the definitions were reversed, its
useful if we have a ‘standard’ matrix.
We typically must initially convert our
data so that all measurements can be
compared.
It would be difficult to directly relate pH
of a solution to the peak area resulting
from its chromatographic analysis.
Qualitative data must also be converted
to a form that we can process.
Category data.
Convert to a numerical form.
!
Examples
hot/cold, day/night, gender
color - convert to RGB index
!
Your goal is to convert descriptive
information into a representative numerical
format.
One of the best approaches is autoscaling.
Use mean-centering and units of
standard deviation.
In essence, you are converting the data
into the ‘reduced variable.’ Actual units
are ‘standard deviation.’
All variables will have the same units
and occur over the same range.
Total variance of each variable = 1.
Total variance of your autoscaled
matrix will be = NV
ik
k
ik
k
s k
=
N - 1
x ik
k
` j !
2
1/
If your variables are already correlated -
in the same units - you can use:
This results in a variance of 1/(NP-1) for
each feature and NV/(NP-1) overall.
l x ik
=
x ik
k
` j
2
! : D
1/
x ik
k
` j
We commonly do a type of autoscaling when we
produce a graph.
Example - ppm Cl
vs. ppm Fe
200 300
8
0
ppm
Cl
ppm
Fe
300 0
300
0
ppm Cl
ppm
Fe
Autoscaling insures that all features are
expressed with the same units and weight.
1
-
1
-
100
-
-100 100
Absorbance
0
2
4
6
8
10
12
14
16
18
20
0 5 10 15 20
Original data
ppm As Absorbance
0 0.
2 0.
4 0.
6 0.
8 0.
10 0.
12 0.
14 0.
16 0.
18 0.
20 0.
Absorbance
0
2
4
6
8
10
-10 - 5 0 5 10
ppm As Absorbance
-10 -0.
-8 -0.
-6 -0.
-4 -0.
-2 -0.
0 0.
2 0.
4 0.
6 0.
8 0.
10 0.
Absorbance
0 0.2 0.4 0.6 0.8 1
ppm As Absorbance
0 0.
0.1 0.
0.2 0.
0.3 0.
0.4 0.
0.5 0.
0.6 0.
0.7 0.
0.8 0.
0.9 0.
1.0 1.
Absorbance
-2.
-1.
-1.
-0.
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.
ppm As Absorbance
-1.0508 -1.
-1.206 -1.
-0.905 -0.
-0.603 -0.
-0.302 -0.
0.000 0.
0.302 0.
0.603 0.
0.905 0.
1.206 1.
1.508 1.
You have several scaling
options that can be
accessed via ‘Variables
Transformation.’
Standardize (n-1) is the
same as autoscaling.
Results are best saved
to a new sheet or
workbook.
Weighting can be used to:
Measure the discriminating ability of a
variable in category separation.
Improve your classification results.
Weʼre essentially trying
to evaluate and/or
improve the resolution
of two features.
What do the weights show?
Paper grade
All weights are large.
All can provide a way to
classify grade.
You might want to consider
only using 4-6 variables with
the largest weights to save
time and money.
Paper GradePaper Grade SourceSource
Variance Fisher Variance Fisher
Na 7.39 6.62 1.49 0.
Al 66.95 22240.00 3.03 0.
Cl 9.96 137.10 1.67 0.
Ca 13.24 11.08 1.98 0.
Ti 17.94 12.78 1.66 0.
Cr 8.67 41.53 1.75 0.
Mn 13.01 15.53 2.37 0.
Zn 4.87 18.24 1.99 0.
Sb 10.19 31.68 1.92 0.
Ta 2.06 1.71 1.25 0.
N grade
N = 40 grade
= 40 N source
N = 9 source
= 9
Paper Source
This would be harder to do
since the weights are smaller.
However, they are still > 1 for
variance weighting, so it can
be done.
Again, it would be best to pick
the 4-6 variables with the
largest weights.
Paper GradePaper Grade SourceSource
Variance Fisher Variance Fisher
Na 7.39 6.62 1.49 0.
Al 66.95 22240.0 3.03 0.
Cl 9.96 137.10 1.67 0.
Ca 13.24 11.08 1.98 0.
Ti 17.94 12.78 1.66 0.
Cr 8.67 41.53 1.75 0.
Mn 13.01 15.53 2.37 0.
Zn 4.87 18.24 1.99 0.
Sb 10.19 31.68 1.92 0.
Ta 2.06 1.71 1.25 0.
N gradegrade
= 40 N sourcesource
= 9
In general, if we treat our data set as a
matrix, we are free to translate it.
This does not alter the significance of
any of the information.
This translation can be some form of
scaling or weighting. X’ = X
a
We can also rotate the matrix by
multiplying by a transform matrix.
This rotation changes the coordinates of
our matrix but not its variance.
Autoscaling and eigenvector rotations
work together to give us the best
possible viewpoint for our dataset.
As an example, lets say that you are
going to purchase your first truck.
From this vantage,
its difficult to make
any sort of choice.
It might not even be
a truck.
Here, we
are too
close.
This is an
ʻautoscaledʼ
view.
Its centered
and full
scale.
Unfortunately,
from this
angle, we
only get a
limited
amount of
information
We get as much information from a single view as
possible. Some information still can’t be seen.
A scaled, rotated view
of our example
The goal is to rotate our matrix so that we
have the maximum amount of variation
present in the minimum number of axes.
Eigenvector rotation
Create a new set of orthogonal axis.
σ
EV
σ
EV
σ
EV
... > σ
EVNV
Data structure is not changed.
diagonalization of either the
correlation or covariance matrix.
on the actual pattern recognition
method is being evaluated.
introduce the various methods.
In this example, our original data is reduced to
one variable after it is scaled and rotated.
Why?
Original Autoscaled Rotated
This example shows
both the original
variables and
the resulting
eigenvectors
EV 1
EV 2
Variable
1
Variable
2
Variable
3
Assume that our data had been
scaled such that the total variance
was NV.
You can then determine how much of
the original information is contained
in each eigenvector by
% variance =
NV
eigenvalue i
Autoscaled arson
related samples,
19 variables
Inherent data reduction
It is often possible to reduce complex
data sets to 2-5 eigenvector / score sets
and still express the majority of the
information.
Display
Reduction of the number of variables
makes it easier to evaluate our data.
Noise reduction
Truly random noise never correlates so
it remains in the residual matrix.
The car exhaust problem from the first
exam.
CO, NOx and HC levels from a set of
‘tailpipe’ tests.
We already know that the three
measurements are correlated.
Now, let’s look that the effect of
autoscaling and an eigenvector rotation.
y = -0.074x +
R
2 = 0.
y = 0.0279x + 0.
R
2 = 0.
0.00 5.00 10.00 15.00 20.00 25.
NOx
HC
Linear (NOx)
Linear (HC)
y = 0.9541x - 1E-
R
2 = 0.
y = -0.9584x - 3E-
R
2 = 0.
-3.
-2.
-1.
-1.500 -1.000 -0.500 0.000 0.500 1.000 1.500 2.000 2.500 3.
NOx
HC
Linear (HC)
Linear (NOx)
Now, intercepts are both zero and |slopes| are the same.
Note that variable labels have been replace to reflect
the fact that these are no longer our original ones.
-0.
-0.
-0.
-10 -5 0 5 10 15 20
F
F
Note that almost all of the data variance
now is on the X axis (F1). Also, F
accounts for more variance than F3.