Chemometric Classification, Lecture Notes- Physics, Study notes of Advanced Physics

Classification methods, Data preprocessing, leave and out validation, rule building method, Discriminant analysis, Coffee example, Iris example, K nearest neighbour classification, KNN, K-means

Typology: Study notes

2010/2011

Uploaded on 09/10/2011

gerrard_11
gerrard_11 🇬🇧

4.3

(6)

234 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Classification Methods
Up to now we have been concerned with
methods that:
Display complex information.
Detect patterns or trends.
Now we will introduce methods that can
be used to classify samples based on
models that are developed.
Classification problems
Level I
Simple classification into predefined categories.
Level II
Level I + detection of outliers,
Level III
Level II + prediction of an external property.
Level IV
Level II + prediction of more than one property.
Classification Methods
Many methods have been developed with new ones
being published all of the time.
Well look a some representative approaches.
Linear Learning Machine
Discriminant Analysis
Classification Trees
K Nearest Neighbor
SIMCA
The available methods and approaches may vary
based on the package use.
Supported by
XLStat
Classification Methods
All of these methods are considered
supervised learning.
Initial assumptions regarding
membership or properties are made
when developing a model.
An initial evaluation of the data using
exploratory data analysis is useful.
Data sets
Needed develop and evaluate a classification model.
Training set
Representative samples used to build the model.
The modeling software uses the class information.
Evaluation set
Samples of known class, used to test the model.
The modeling software does not know the classes.
Test se t
True unknowns.
Data pre-processing
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Chemometric Classification, Lecture Notes- Physics and more Study notes Advanced Physics in PDF only on Docsity!

Classification Methods

Up to now we have been concerned with

methods that:

Display complex information.

Detect patterns or trends.

Now we will introduce methods that can

be used to classify samples based on

models that are developed.

Classification problems

Level I

Simple classification into predefined categories.

Level II

Level I + detection of outliers,

Level III

Level II + prediction of an external property.

Level IV

Level II + prediction of more than one property.

Classification Methods

Many methods have been developed with new ones
being published all of the time.
We’ll look a some representative approaches.
Linear Learning Machine
Discriminant Analysis
Classification Trees
K Nearest Neighbor
SIMCA
The available methods and approaches may vary
based on the package use.
Supported by
XLStat

Classification Methods

All of these methods are considered

supervised learning.

Initial assumptions regarding

membership or properties are made

when developing a model.

An initial evaluation of the data using

exploratory data analysis is useful.

Data sets

Needed develop and evaluate a classification model.
Training set
Representative samples used to build the model.
The modeling software uses the class information.
Evaluation set
Samples of known class, used to test the model.
The modeling software does not know the classes.
Test set
True unknowns.

Data pre-processing

With any of these methods, you may choose

to do some sort of data preprocessing.

Raw

Is fastest.

Scaled

Gives equal weight to the variables.

PCA

Can be used to reduce noise, insignificant

variables.

Data pre-processing

With some data sets, you may also want to

some other types of pre-processing.

Example. Spectral or chromatographic traces.

Options may include:

Smoothing, baseline correction, signal

averaging, using the first or second

derivative.

Creating an evaluation set

The evaluation set is typically a sub-set of

the training set that was omitted when

building a model.

Randomly pick a subset of the data.

Random pick members from each class.

Any approach that selectively removes a

portion of the data could cause bias.

Leave-one-out validation

A standardized approach for

validation of a model where each

sample serves as an evaluation set.

1. Omit a single sample from the set

2. Build the model

3. Test the omitted sample

4. Repeat the above steps until each sample has

been omitted and tested once.

Your data

While ‘Leave-One-Out’ testing is the best approach, it

can be slow for large sets.

Alternate approaches are to leave two or more

samples out with each pass.

Samples should be randomly listed in the matrix.

The same two (or more) sample should never be

omitted together more than once.

Rule building methods

Methods where a set of rules are created to discriminate
between classes.
Linear learning machine
One or more linear vectors are created to
discriminate between classes.
Discriminate analysis
Linear or quadratic equations are used to separate
classes.
Classification trees
Series of rules are used to sequentially classify.

Linear learning machine

The assumption is
that one or more
vector can be found
that can be used to
discriminate between
our classes.
This can make use
of our ‘raw’ data or
work in PC space.
PC space would be
better as there would
be noise reduction.

Petal width

Petal length

Sepal Width

Sepal Length

-0.

-0.

-0.

0

1

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

F1 (99.01 %)

F2 (0.99 %)

1

11

1

1

1

1

1

1

1

1 1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1 1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2 2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

3

3 3

3

3

3 3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

3

2

1

1

-10 -5 0 5 10

Coffee example

This consisted of 6 types of

coffee - identified based on MS

data.

To avoid colinearity and null

variable problems, PCA scores

were used (first 5 components).

C

C

C C

C

C

C

C

C C

C

C

E E

E

E

E E

E E

E

E

E E

K

K K

K

K

K

K

K K

K

K

K

R R

R R

R

R

R R

R R

R

R

S

S

S S

S S

S

S

S S

S S

U

U

U U U

U

U

U

U U U

U

C

E

K

R

S

U

0

5

10

15

-15 -10 -5 0 5 10 15 20

F1 (56.19 %)

F2 (22.79 %)

Classification trees

Predicts class membership by sequential
application of rules based on predictor variables.
With DA and LLM, you create a set of math
models that are all applied at once.
With classification trees, the predictor variables
are evaluated as ordinal rules, one at a time.

Classification trees

Solid - liquid
Density > 1 Red or green
Density > 1

Iris example (yet again!)

XLStat supports the use of

classification and regression trees.

Classification if the Y variable (class)

is qualitative, regression if the Y

variable is quantitative.

The iris example is a classification

example.

Iris example

= If Petal width is between 1 and 8
the assign to Species 1

Node: 1

Size: 150 %: 100

Purity(%):

50

50

1 50

2

3

Node: 2

Size: 50

%: 33. Purity(%):

100

[1, 8[ 0

0

1 50

2

3

Node: 3

Size: 100

%: 66. Purity(%):

50

[8, 25[ 50

50

10

2

3

Node: 4 Size: 53

%: 35.

Purity(%):

[8, 16.5[ 5

48

0 1

2

3

Node: 6

Size: 46 %: 30.

Purity(%):

[30,

50.5[ 1

45

10

2

3

Node: 8

Size: 41

%: 27. Purity(%):

100

[30,

47.5[ 0

41

10

2

3

Node: 9

Size: 5

%: 3. Purity(%):

80

[47.5,

50.5[ 1

4

10

2

3

Node: 10 Size: 2

%: 1. Purity(%):

50

[22,

23.5[

1

1

0 1

2

3

Node: 11 Size: 3

%: 2 Purity(%):

100

[23.5,

31[

0

3

0 1

2

3

Sepal Width

Petal length

Node: 7

Size: 7 %: 4.

Purity(%):

[50.5,

58[ 4

3

10

2

3

Node: 12

Size: 3

%: 2 Purity(%):

[60,

62.5[

1

2

10

2

3

Node: 13

Size: 4

%: 2. Purity(%):

75

[62.5,

72[

3

1

10

2

3

Sepal Length

Petal length

Node: 5 Size: 47

%: 31.

Purity(%):

[16.5,

25[

45

2

0 1

2

3

Node: 14

Size: 10 %: 6.

Purity(%):

80

[45,

50.5[ 8

2

10

2

3

Node: 15

Size: 37 %: 24.

Purity(%):

100

[50.5,

69[ 37

0

10

2

3

Petal length

Petal width

Petal width

Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:
from \ to CaC CaR OhC OhR Total % correct
CaC 17 0 0 0 17 100.0%
CaR 0 9 0 1 10 90.0%
OhC 0 1 7 0 8 87.5%
OhR 0 1 0 5 6 83.3%
Total 17 11 7 6 41 92.7%

K nearest neighbor classification

A similarity-based classification method.

It attempts to assign categories to unknown

samples based on multivariate proximity to other

samples.

It works best with discrete classification types and

is tolerant of poor data sets.

K - !The number of closest neighbors being

compared.

Consider this as the supervised version of HCA.

K nearest neighbor classification

In its simplest form, KNN is conducted by:

First, a training set is collected that

contains examples of each class.

Intersample distances are then

calculated.

where

N = # of variables or components used.

d a " b = a j

  • b b ^ h

j = 1

N

!

2

KNN

The distance matrix is sorted and the

distance of the unknown sample can be

compared to:

1. The K nearest neighbors

2. The nearest class cluster.

Option 2 requires that K = 1.

KNN

When using the distance to a class, you

can use the same link options that were

discussed earlier.

The distance can be based on:

Single link - closest member of class.

Complete link - farthest member of class.

Centroid - center of class cluster.

KNN - single link

In this example, the

unknown is compared

to the 3 closest

known samples.

In this case, the three

closest samples are

all ʻred.ʼ

Single link

K = 3

KNN - centroid link

Centroid link

With this approach,

the distance to the

center of a class

cluster is determined

and compared.

KNN

Ideally, if a test sample falls well within a known class,
its closes neighbors should all be of one class.
Here, all of the
‘blue’ samples
would be closer
to the unknown
than any of the
green.

Mycobacteria - HCA

46464646464646464646464646464646464646464646464646464646464646464646464646464646494949494949494949494949494949494949464646444444444447444444444444444444444344434343434343434343434545454545454545454545454545454545434343434343454545454545454545454545454545434343434343424242424242424242424242424242424242424343454747474747474747474747474747474747474747

0

100

200

300

400

500

600

700

800

900

1000

A quick review of ALL of the ways
that this data set was difficult to get
useful information from.

Mycobacteria - k means

Mycobacteria - PCA

-3.

-2.

-1.

-6.000 -4.000 -2.000 0.000 2.000 4.000 6.000 8.000 10.

42

43

44

45

46

47

49

Mycobacteria - DA

Getting out the vote

Example - K = 5

Sample Class Distance

1 A 0.

2 B 0.

3 A 0.

4 B 0.

5 C 0.

Here, A and B would tie. The tie-breaker would

be that A averages a smaller distance so

would be made the winner.

KNN validation

The optimum number for K can be found

by trial an error but for a close match, it

should make no difference.

The classifying power of your data can be

evaluated by leave one out validation of

your training set.

This should be done before any sort of

real classification begins.

KNN validation

Validation

You can sequentially leave out each of your

samples and test it for ‘votes’ at several K

values.

You end up with a vote matrix that will tell you

the optimum K value for each class.

You will also get a misclassification matrix "-

this tells you how often one of your knowns

are incorrectly classified.

K nearest neighbor classification

So KNN will always assign a class.
What if you have a material that is not a member of
an existing class?
One option is to set a maximum distance.
Example
Your intraclass distances run about 0.2 for all of
your classes, you might want to omit votes with
distances that exceed 0.2.

Iris (of course)

  • The Iris data set is included with a demo of the

program Pirouette.

  • We’ll be using the Pirouette demo to show how

to conduct KNN and SIMCA classifications.

  • You can download a copy of the demo from

www.infometrix.com.

The demo is fully functional but only with the

data sets that are provided by Infometrix.

The actual software is pretty easy to use but

too expensive for our use in the course.

Iris example

Iris - scores by class Iris - voting results

Iris - class partitions

What? NOT the Iris data set?

  • Headspace MS of 4 cola classes.

Two cola brands.

  • Diet and regular.
  • m/e 44 - 149.

May need to preprocess to

eliminate any nonvariant data.

Cola example

Class
1 Brand 1
2 Diet brand 1
3 Brand 2
4 Diet brand 2

PCA scores

SIMCA models

Since the number
of components
used can vary,
each class will be
best described
by its own
hypervolume.

SIMCA models

Limitation of a class hypervolume.

You can limit the size of a hypervolume

by setting a standard deviation cutoff.

This results in better defined classes.

SD = 3 SD = 2

SIMCA models

Once a model has been created for each

class, you are ready to classify unknowns.

For each model/sample combination:

+ The sample is transformed into PC space

and compared to see if is a likely class

member.

+ If it is within the ‘hypervolume’ of a single

class, you have a match.

SIMCA classification

The potential still exists for a sample to be
classified as a member of more than one class.
It may also not
be a member of
any known class

SIMCA classification

SIMCA will give you an estimate as to

the probability of class membership.

Example - two possible classes.

" " Probability

" Class A" " 0.

" Class B 0.

Here, the sample is more likely to be

a member of Class A.

SIMCA summary

Of the methods covered, SIMCA offers the

most options for developing a classification

model when the classes are well known.

It also requires the most development time as

you must determine the optimum model

conditions for each class.

If used, plan on spending quite a bit of time

working with all of the available options.

SIMCA example - Iris.

Of course we’ll look at the iris dataset again.
Note: We have a
separate model for
each class in the data
set - in this case three.

SIMCA example - Iris.

Pirouette will
provide an
estimate as to the
class hypervolumes
based on the first
three PCs.

SIMCA example - Iris.

SIMCA example - Iris.

It appears that
petal length is
the most useful
for classifying.

SIMCA example - Iris.

These plots show the relative
positions of each sample when
projected into any of the three class
models - two classes at a time - with
color coding based on known class.

Mycobacteria SIMCA

Discriminating

Power is a

measure of which

variables show

the biggest ‘class’

differences.

Mycobacteria SIMCA

Example shows that a different

number of components were

used in developing the

individual SIMCA

hypervolumes.

Mycobacteria SIMCA

Modeling power

indicates the relative

importance of each

variable for

classification.

Loadings, as always show the

relative significance of each

variable in constructing each PC

There are relatively unimportant.

Mycobacteria SIMCA

PC plots are pretty boring since you only have one class. However, it can

be used to see if you have any ‘sub-classes.’

Outliers are test for by plotting sample residuals (difference between

sample and center of hypervolume) vs it’s Mahalanobis distance from the

center of the cluster - similar to a Euclidian distance but takes into account

correlations of the data and is scale invariant.

Mycobacteria SIMCA