Exploratory Data Analysis: An Approach to Complex Sample Analysis, Study notes of Advanced Physics

Exploratory data analysis (eda) is a statistical approach used when dealing with complex samples consisting of many components. Unlike confirmatory data analysis, which focuses on making one type of measurement, eda aims to learn something beyond simple numbers by collecting as much information as possible and using multivariate data analysis. The importance of eda, its steps, and the tools available for complex data sets.

Typology: Study notes

2010/2011

Uploaded on 09/10/2011

gerrard_11
gerrard_11 🇬🇧

4.3

(6)

234 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Exploratory data
analysis
Up to now, we’ve dealt with simple statistical problems.
Primary goals were to
detect and quantify a single analyte.
develop relationships between an analyte and a response.
optimize an experiment design and methods used to
measure a response.
confirmatory data analysis.
Confirmatory data analysis
When we obtain a set of samples and
make one type of measurement.
Many analytical methods are developed
to quantify a single analyte or a limited
number of analytes.
All other factors are held constant or
eliminated.
Exploratory data analysis
When we obtain many measurements
from a number of samples and attempt
to learn something about our sample
beyond simple numbers.
‘Real world’ problems are typically much
more complex. A true understanding of
a system may only be possible if many
factors are considered.
Complex samples
Complex sample consists of many components.
Each may contribute to the overall properties
of the sample.
A measurement of any single component or
property is unlikely to tell you much about
what the sample is.
Any type of sample can be either simple or
complex based on the type of information
desired regarding the sample.
Complex samples
Complex samples
With current analytical tools, its possible to
detect and quantify most materials in a
complex sample.
Knowing that information, its still impossible
to state what the original sample was or be
able to precisely reproduce it.
Example - perfume reproductions.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Exploratory Data Analysis: An Approach to Complex Sample Analysis and more Study notes Advanced Physics in PDF only on Docsity!

Exploratory data

analysis

Up to now, we’ve dealt with simple statistical problems.

Primary goals were to

detect and quantify a single analyte.

develop relationships between an analyte and a response.

optimize an experiment design and methods used to

measure a response.

confirmatory data analysis.

Confirmatory data analysis

When we obtain a set of samples and

make one type of measurement.

Many analytical methods are developed

to quantify a single analyte or a limited

number of analytes.

All other factors are held constant or

eliminated.

Exploratory data analysis

When we obtain many measurements

from a number of samples and attempt

to learn something about our sample

beyond simple numbers.

‘Real world’ problems are typically much

more complex. A true understanding of

a system may only be possible if many

factors are considered.

Complex samples

Complex sample consists of many components.

Each may contribute to the overall properties

of the sample.

A measurement of any single component or

property is unlikely to tell you much about

what the sample is.

Any type of sample can be either simple or

complex based on the type of information

desired regarding the sample.

Complex samples

Examples

Gasoline

Its overall performance as a fuel is

not based on the amount of any

single component.

Coffee

This material contains hundreds of

components. The flavor can’t be

attributed to any single component.

Complex samples

With current analytical tools, its possible to

detect and quantify most materials in a

complex sample.

Knowing that information, its still impossible

to state what the original sample was or be

able to precisely reproduce it.

Example - perfume reproductions.

When more is better

Exploratory data analysis attempts to

detect and evaluate underlying trends a

data set.

This is accomplished by collecting as

much information about a problem as

possible and multivariate data analysis.

The introduction of the personal

computer made it possible for routine

evaluation of complex data sets (many

variables and samples.)

When more is better

Example.

Assume you are doing QA/QC for a fertilizer

company.

You are provided with representative samples at 30

minute intervals. If there is a problem, you must

stop production. If you are wrong - you are fired!

Let’s see what happens to you level of knowledge as

we increase the amount of data.

When more is better

Time!! % N

7:00 am! 15.

The 9:00 value appears low.

What should you do?

When more is better

A simple statistical calculation for the first

four samples shows:

!! mean = 14.9, s x

! Your 9:00 sample is -9.6 s.

So you know that the value is significantly

lower (different) than the first four.

You don’t know why!

Your analysis could be bad or something

could be truly wrong in the plant.

When more is better

Time!! % N! %P

7:00 am! 15.1!! 6.

By evaluating two components in our

sample, we now know more.

When more is better

Another statistical evaluation shows

that for the first four samples:

!!! % N!! %P

mean!! 14.9!! 6.

s x

The 9:00 sample is low by 9.6 s for both

nitrogen and phosphorous.

You can be pretty confident that

something is wrong with the sample.

But what?

Pattern recognition

Artificial intelligence

Pattern recognition Neural networks

Parametric Non-parametric

Preprocessing Learning Display

Supervised Unsupervised

Pattern recognition

Preprocessing

! Data transformations such as scaling.

Pattern recognition

Display

Projection of our data into a limited

number of dimensions.

Pattern recognition

Unsupervised learning

Methods that require no initial assumptions.

Examples - cluster analysis and PCA.

Pattern recognition

Supervised learning

Methods that require initial assumptions or

a model. SIMCA and KNN are examples.

Data

For most systems, we want an

overdetermined dataset with at

least three samples for each

measured variable.

This is not always possible but the

ratio of samples to variables

should always be greater than one.

Data

Methods assume that nearness in n-

dimensional space reflects similarities in

measured properties.

Each variable is treated as a dimension so

a data set with 10 measured properties

would be considered as existing in 10-

dimensional space.

Since we typically have a large number of

dimensions, we need a ‘standard’ way of

working with our data.

The data matrix

The first step is to convert our data into a

matrix where:

1 2 3 ... NV

1

X

1,

X

1,

X

1,

...

X

1,NV

2

X

2,

.. ....

3

X

3,

.. ....

... ..... ....

NP

X

NP,

.. ...

X

NP,

NV

The data matrix

Cases

A row of data where each value

corresponds to measured properties of a

specific sample

Variables or features

A column of data which corresponds to

one measured property for all samples.

While many of our methods would still

work if the definitions were reversed, its

useful if we have a ‘standard’ matrix.

Pre-processing methods

We typically must initially convert our

data so that all measurements can be

compared.

It would be difficult to directly relate pH

of a solution to the peak area resulting

from its chromatographic analysis.

Qualitative data must also be converted

to a form that we can process.

Initial data evaluation

Category data.

Convert to a numerical form.

!

Examples

hot/cold, day/night, gender

  • convert to 1 and 0

color - convert to RGB index

!

Your goal is to convert descriptive

information into a representative numerical

format.

Initial data evaluation

Missing data

Some samples may be missing one

or more variables.

Its best to avoid this by only using

cases that are complete.

If you must use incomplete data

then you have several filling

options.

One of the best approaches is autoscaling.

Use mean-centering and units of

standard deviation.

In essence, you are converting the data

into the ‘reduced variable.’ Actual units

are ‘standard deviation.’

All variables will have the same units

and occur over the same range.

Total variance of each variable = 1.

Autoscaling

Total variance of your autoscaled

matrix will be = NV

l

x

ik

s

k

x

ik

  • x

k

s k

=

N - 1

x ik

  • x

k

` j !

2

> H

1/

Autoscaling

If your variables are already correlated -

in the same units - you can use:

This results in a variance of 1/(NP-1) for

each feature and NV/(NP-1) overall.

l x ik

=

x ik

  • x

k

` j

2

! : D

1/

x ik

  • x

k

` j

Autoscaling

We commonly do a type of autoscaling when we

produce a graph.

Example - ppm Cl

vs. ppm Fe

200 300

8

0

ppm

Cl

ppm

Fe

300 0

300

0

ppm Cl

ppm

Fe

Autoscaling

Autoscaling insures that all features are

expressed with the same units and weight.

1

-

1

-

100

-

-100 100

Scaling example

Absorbance

0

2

4

6

8

10

12

14

16

18

20

0 5 10 15 20

Original data

ppm As Absorbance

0 0.

2 0.

4 0.

6 0.

8 0.

10 0.

12 0.

14 0.

16 0.

18 0.

20 0.

Mean Centered Scaling example

Absorbance

    • 8
    • 6
    • 4
    • 2

0

2

4

6

8

10

-10 - 5 0 5 10

ppm As Absorbance

-10 -0.

-8 -0.

-6 -0.

-4 -0.

-2 -0.

0 0.

2 0.

4 0.

6 0.

8 0.

10 0.

Range Scaling example

Absorbance

0 0.2 0.4 0.6 0.8 1

ppm As Absorbance

0 0.

0.1 0.

0.2 0.

0.3 0.

0.4 0.

0.5 0.

0.6 0.

0.7 0.

0.8 0.

0.9 0.

1.0 1.

Autoscaling example

Absorbance

-2.

-1.

-1.

-0.

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.

ppm As Absorbance

-1.0508 -1.

-1.206 -1.

-0.905 -0.

-0.603 -0.

-0.302 -0.

0.000 0.

0.302 0.

0.603 0.

0.905 0.

1.206 1.

1.508 1.

Scaling in XLStat

You have several scaling

options that can be

accessed via ‘Variables

Transformation.’

Standardize (n-1) is the

same as autoscaling.

Results are best saved

to a new sheet or

workbook.

XLStat results

Feature weighting

Weighting can be used to:

Measure the discriminating ability of a

variable in category separation.

Improve your classification results.

Weʼre essentially trying

to evaluate and/or

improve the resolution

of two features.

Example

What do the weights show?

Paper grade

All weights are large.

All can provide a way to

classify grade.

You might want to consider

only using 4-6 variables with

the largest weights to save

time and money.

Paper GradePaper Grade SourceSource

Variance Fisher Variance Fisher

Na 7.39 6.62 1.49 0.

Al 66.95 22240.00 3.03 0.

Cl 9.96 137.10 1.67 0.

Ca 13.24 11.08 1.98 0.

Ti 17.94 12.78 1.66 0.

Cr 8.67 41.53 1.75 0.

Mn 13.01 15.53 2.37 0.

Zn 4.87 18.24 1.99 0.

Sb 10.19 31.68 1.92 0.

Ta 2.06 1.71 1.25 0.

N grade

N = 40 grade

= 40 N source

N = 9 source

= 9

Example

Paper Source

This would be harder to do

since the weights are smaller.

However, they are still > 1 for

variance weighting, so it can

be done.

Again, it would be best to pick

the 4-6 variables with the

largest weights.

Paper GradePaper Grade SourceSource

Variance Fisher Variance Fisher

Na 7.39 6.62 1.49 0.

Al 66.95 22240.0 3.03 0.

Cl 9.96 137.10 1.67 0.

Ca 13.24 11.08 1.98 0.

Ti 17.94 12.78 1.66 0.

Cr 8.67 41.53 1.75 0.

Mn 13.01 15.53 2.37 0.

Zn 4.87 18.24 1.99 0.

Sb 10.19 31.68 1.92 0.

Ta 2.06 1.71 1.25 0.

N gradegrade

= 40 N sourcesource

= 9

Eigenvector rotations

In general, if we treat our data set as a

matrix, we are free to translate it.

This does not alter the significance of

any of the information.

This translation can be some form of

scaling or weighting. X’ = X

a

We can also rotate the matrix by

multiplying by a transform matrix.

X’ = X A
T

Eigenvector rotations

This rotation changes the coordinates of

our matrix but not its variance.

Autoscaling and eigenvector rotations

work together to give us the best

possible viewpoint for our dataset.

As an example, lets say that you are

going to purchase your first truck.

Example

From this vantage,

its difficult to make

any sort of choice.

It might not even be

a truck.

Here, we

are too

close.

Example

This is an

ʻautoscaledʼ

view.

Its centered

and full

scale.

Unfortunately,

from this

angle, we

only get a

limited

amount of

information

Example

We get as much information from a single view as

possible. Some information still can’t be seen.

A scaled, rotated view

of our example

Eigenvector rotations

The goal is to rotate our matrix so that we

have the maximum amount of variation

present in the minimum number of axes.

Eigenvector rotation

Create a new set of orthogonal axis.

σ

EV

σ

EV

σ

EV

... > σ

EVNV

Data structure is not changed.

Eigenvector rotations

  • These rotations are accomplished by

diagonalization of either the

correlation or covariance matrix.

  • Which matrix you use will be based

on the actual pattern recognition

method is being evaluated.

  • We’ll discuss the differences as we

introduce the various methods.

Eigenvector rotations

In this example, our original data is reduced to

one variable after it is scaled and rotated.

Why?

Original Autoscaled Rotated

Eigenvector rotations

This example shows

both the original

variables and

the resulting

eigenvectors

EV 1

EV 2

Variable

1

Variable

2

Variable

3

Eigenvalues

Another term that we typically

obtain is the eigenvalue.

One eigenvalue for each

eigenvector.

It indicates how much of the

original information is contained

in each eigenvector.

Eigenvalues

Assume that our data had been

scaled such that the total variance

was NV.

You can then determine how much of

the original information is contained

in each eigenvector by

% variance =

NV

eigenvalue i

Eigenvalue example

Autoscaled arson

related samples,

19 variables

Advantages of eigenvector rotation

Inherent data reduction

It is often possible to reduce complex

data sets to 2-5 eigenvector / score sets

and still express the majority of the

information.

Display

Reduction of the number of variables

makes it easier to evaluate our data.

Noise reduction

Truly random noise never correlates so

it remains in the residual matrix.

Example

The car exhaust problem from the first

exam.

CO, NOx and HC levels from a set of

‘tailpipe’ tests.

We already know that the three

measurements are correlated.

Now, let’s look that the effect of

autoscaling and an eigenvector rotation.

Original

data

y = -0.074x +

R

2 = 0.

y = 0.0279x + 0.

R

2 = 0.

0.00 5.00 10.00 15.00 20.00 25.

NOx

HC

Linear (NOx)

Linear (HC)

Autoscaled

y = 0.9541x - 1E-

R

2 = 0.

y = -0.9584x - 3E-

R

2 = 0.

-3.

-2.

-1.

-1.500 -1.000 -0.500 0.000 0.500 1.000 1.500 2.000 2.500 3.

NOx

HC

Linear (HC)

Linear (NOx)

Now, intercepts are both zero and |slopes| are the same.

Rotated data

Note that variable labels have been replace to reflect

the fact that these are no longer our original ones.

-0.

-0.

-0.

-10 -5 0 5 10 15 20

F

F

Note that almost all of the data variance

now is on the X axis (F1). Also, F

accounts for more variance than F3.