Download Chemometric Classification, Lecture Notes- Physics and more Study notes Advanced Physics in PDF only on Docsity!
Classification Methods
Up to now we have been concerned with
methods that:
Display complex information.
Detect patterns or trends.
Now we will introduce methods that can
be used to classify samples based on
models that are developed.
Classification problems
Level I
Simple classification into predefined categories.
Level II
Level I + detection of outliers,
Level III
Level II + prediction of an external property.
Level IV
Level II + prediction of more than one property.
Classification Methods
Many methods have been developed with new ones
being published all of the time.
We’ll look a some representative approaches.
Linear Learning Machine
Discriminant Analysis
Classification Trees
K Nearest Neighbor
SIMCA
The available methods and approaches may vary
based on the package use.
Supported by
XLStat
Classification Methods
All of these methods are considered
supervised learning.
Initial assumptions regarding
membership or properties are made
when developing a model.
An initial evaluation of the data using
exploratory data analysis is useful.
Data sets
Needed develop and evaluate a classification model.
Training set
Representative samples used to build the model.
The modeling software uses the class information.
Evaluation set
Samples of known class, used to test the model.
The modeling software does not know the classes.
Test set
True unknowns.
Data pre-processing
With any of these methods, you may choose
to do some sort of data preprocessing.
Raw
Is fastest.
Scaled
Gives equal weight to the variables.
PCA
Can be used to reduce noise, insignificant
variables.
Data pre-processing
With some data sets, you may also want to
some other types of pre-processing.
Example. Spectral or chromatographic traces.
Options may include:
Smoothing, baseline correction, signal
averaging, using the first or second
derivative.
Creating an evaluation set
The evaluation set is typically a sub-set of
the training set that was omitted when
building a model.
Randomly pick a subset of the data.
Random pick members from each class.
Any approach that selectively removes a
portion of the data could cause bias.
Leave-one-out validation
A standardized approach for
validation of a model where each
sample serves as an evaluation set.
1. Omit a single sample from the set
2. Build the model
3. Test the omitted sample
4. Repeat the above steps until each sample has
been omitted and tested once.
Your data
While ‘Leave-One-Out’ testing is the best approach, it
can be slow for large sets.
Alternate approaches are to leave two or more
samples out with each pass.
Samples should be randomly listed in the matrix.
The same two (or more) sample should never be
omitted together more than once.
Rule building methods
Methods where a set of rules are created to discriminate
between classes.
Linear learning machine
One or more linear vectors are created to
discriminate between classes.
Discriminate analysis
Linear or quadratic equations are used to separate
classes.
Classification trees
Series of rules are used to sequentially classify.
Linear learning machine
The assumption is
that one or more
vector can be found
that can be used to
discriminate between
our classes.
This can make use
of our ‘raw’ data or
work in PC space.
PC space would be
better as there would
be noise reduction.
Petal width
Petal length
Sepal Width
Sepal Length
-0.
-0.
-0.
0
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1
F1 (99.01 %)
F2 (0.99 %)
1
11
1
1
1
1
1
1
1
1 1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1 1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3 3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
2
1
1
-10 -5 0 5 10
Coffee example
This consisted of 6 types of
coffee - identified based on MS
data.
To avoid colinearity and null
variable problems, PCA scores
were used (first 5 components).
C
C
C C
C
C
C
C
C C
C
C
E E
E
E
E E
E E
E
E
E E
K
K K
K
K
K
K
K K
K
K
K
R R
R R
R
R
R R
R R
R
R
S
S
S S
S S
S
S
S S
S S
U
U
U U U
U
U
U
U U U
U
C
E
K
R
S
U
0
5
10
15
-15 -10 -5 0 5 10 15 20
F1 (56.19 %)
F2 (22.79 %)
Classification trees
Predicts class membership by sequential
application of rules based on predictor variables.
With DA and LLM, you create a set of math
models that are all applied at once.
With classification trees, the predictor variables
are evaluated as ordinal rules, one at a time.
Classification trees
Solid - liquid
Density > 1 Red or green
Density > 1
Iris example (yet again!)
XLStat supports the use of
classification and regression trees.
Classification if the Y variable (class)
is qualitative, regression if the Y
variable is quantitative.
The iris example is a classification
example.
Iris example
= If Petal width is between 1 and 8
the assign to Species 1
Node: 1
Size: 150 %: 100
Purity(%):
50
50
1 50
2
3
Node: 2
Size: 50
%: 33. Purity(%):
100
[1, 8[ 0
0
1 50
2
3
Node: 3
Size: 100
%: 66. Purity(%):
50
[8, 25[ 50
50
10
2
3
Node: 4 Size: 53
%: 35.
Purity(%):
[8, 16.5[ 5
48
0 1
2
3
Node: 6
Size: 46 %: 30.
Purity(%):
[30,
50.5[ 1
45
10
2
3
Node: 8
Size: 41
%: 27. Purity(%):
100
[30,
47.5[ 0
41
10
2
3
Node: 9
Size: 5
%: 3. Purity(%):
80
[47.5,
50.5[ 1
4
10
2
3
Node: 10 Size: 2
%: 1. Purity(%):
50
[22,
23.5[
1
1
0 1
2
3
Node: 11 Size: 3
%: 2 Purity(%):
100
[23.5,
31[
0
3
0 1
2
3
Sepal Width
Petal length
Node: 7
Size: 7 %: 4.
Purity(%):
[50.5,
58[ 4
3
10
2
3
Node: 12
Size: 3
%: 2 Purity(%):
[60,
62.5[
1
2
10
2
3
Node: 13
Size: 4
%: 2. Purity(%):
75
[62.5,
72[
3
1
10
2
3
Sepal Length
Petal length
Node: 5 Size: 47
%: 31.
Purity(%):
[16.5,
25[
45
2
0 1
2
3
Node: 14
Size: 10 %: 6.
Purity(%):
80
[45,
50.5[ 8
2
10
2
3
Node: 15
Size: 37 %: 24.
Purity(%):
100
[50.5,
69[ 37
0
10
2
3
Petal length
Petal width
Petal width
Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:
from \ to CaC CaR OhC OhR Total % correct
CaC 17 0 0 0 17 100.0%
CaR 0 9 0 1 10 90.0%
OhC 0 1 7 0 8 87.5%
OhR 0 1 0 5 6 83.3%
Total 17 11 7 6 41 92.7%
K nearest neighbor classification
A similarity-based classification method.
It attempts to assign categories to unknown
samples based on multivariate proximity to other
samples.
It works best with discrete classification types and
is tolerant of poor data sets.
K - !The number of closest neighbors being
compared.
Consider this as the supervised version of HCA.
K nearest neighbor classification
In its simplest form, KNN is conducted by:
First, a training set is collected that
contains examples of each class.
Intersample distances are then
calculated.
where
N = # of variables or components used.
d a " b = a j
j = 1
N
!
2
KNN
The distance matrix is sorted and the
distance of the unknown sample can be
compared to:
1. The K nearest neighbors
2. The nearest class cluster.
Option 2 requires that K = 1.
KNN
When using the distance to a class, you
can use the same link options that were
discussed earlier.
The distance can be based on:
Single link - closest member of class.
Complete link - farthest member of class.
Centroid - center of class cluster.
KNN - single link
In this example, the
unknown is compared
to the 3 closest
known samples.
In this case, the three
closest samples are
all ʻred.ʼ
Single link
K = 3
KNN - centroid link
Centroid link
With this approach,
the distance to the
center of a class
cluster is determined
and compared.
KNN
Ideally, if a test sample falls well within a known class,
its closes neighbors should all be of one class.
Here, all of the
‘blue’ samples
would be closer
to the unknown
than any of the
green.
Mycobacteria - HCA
46464646464646464646464646464646464646464646464646464646464646464646464646464646494949494949494949494949494949494949464646444444444447444444444444444444444344434343434343434343434545454545454545454545454545454545434343434343454545454545454545454545454545434343434343424242424242424242424242424242424242424343454747474747474747474747474747474747474747
0
100
200
300
400
500
600
700
800
900
1000
A quick review of ALL of the ways
that this data set was difficult to get
useful information from.
Mycobacteria - k means
Mycobacteria - PCA
-3.
-2.
-1.
-6.000 -4.000 -2.000 0.000 2.000 4.000 6.000 8.000 10.
42
43
44
45
46
47
49
Mycobacteria - DA
Getting out the vote
Example - K = 5
Sample Class Distance
1 A 0.
2 B 0.
3 A 0.
4 B 0.
5 C 0.
Here, A and B would tie. The tie-breaker would
be that A averages a smaller distance so
would be made the winner.
KNN validation
The optimum number for K can be found
by trial an error but for a close match, it
should make no difference.
The classifying power of your data can be
evaluated by leave one out validation of
your training set.
This should be done before any sort of
real classification begins.
KNN validation
Validation
You can sequentially leave out each of your
samples and test it for ‘votes’ at several K
values.
You end up with a vote matrix that will tell you
the optimum K value for each class.
You will also get a misclassification matrix "-
this tells you how often one of your knowns
are incorrectly classified.
K nearest neighbor classification
So KNN will always assign a class.
What if you have a material that is not a member of
an existing class?
One option is to set a maximum distance.
Example
Your intraclass distances run about 0.2 for all of
your classes, you might want to omit votes with
distances that exceed 0.2.
Iris (of course)
- The Iris data set is included with a demo of the
program Pirouette.
- We’ll be using the Pirouette demo to show how
to conduct KNN and SIMCA classifications.
- You can download a copy of the demo from
www.infometrix.com.
The demo is fully functional but only with the
data sets that are provided by Infometrix.
The actual software is pretty easy to use but
too expensive for our use in the course.
Iris example
Iris - scores by class Iris - voting results
Iris - class partitions
What? NOT the Iris data set?
- Headspace MS of 4 cola classes.
Two cola brands.
- Diet and regular.
- m/e 44 - 149.
May need to preprocess to
eliminate any nonvariant data.
Cola example
Class
1 Brand 1
2 Diet brand 1
3 Brand 2
4 Diet brand 2
PCA scores
SIMCA models
Since the number
of components
used can vary,
each class will be
best described
by its own
hypervolume.
SIMCA models
Limitation of a class hypervolume.
You can limit the size of a hypervolume
by setting a standard deviation cutoff.
This results in better defined classes.
SD = 3 SD = 2
SIMCA models
Once a model has been created for each
class, you are ready to classify unknowns.
For each model/sample combination:
+ The sample is transformed into PC space
and compared to see if is a likely class
member.
+ If it is within the ‘hypervolume’ of a single
class, you have a match.
SIMCA classification
The potential still exists for a sample to be
classified as a member of more than one class.
It may also not
be a member of
any known class
SIMCA classification
SIMCA will give you an estimate as to
the probability of class membership.
Example - two possible classes.
" " Probability
" Class A" " 0.
" Class B 0.
Here, the sample is more likely to be
a member of Class A.
SIMCA summary
Of the methods covered, SIMCA offers the
most options for developing a classification
model when the classes are well known.
It also requires the most development time as
you must determine the optimum model
conditions for each class.
If used, plan on spending quite a bit of time
working with all of the available options.
SIMCA example - Iris.
Of course we’ll look at the iris dataset again.
Note: We have a
separate model for
each class in the data
set - in this case three.
SIMCA example - Iris.
Pirouette will
provide an
estimate as to the
class hypervolumes
based on the first
three PCs.
SIMCA example - Iris.
SIMCA example - Iris.
It appears that
petal length is
the most useful
for classifying.
SIMCA example - Iris.
These plots show the relative
positions of each sample when
projected into any of the three class
models - two classes at a time - with
color coding based on known class.
Mycobacteria SIMCA
Discriminating
Power is a
measure of which
variables show
the biggest ‘class’
differences.
Mycobacteria SIMCA
Example shows that a different
number of components were
used in developing the
individual SIMCA
hypervolumes.
Mycobacteria SIMCA
Modeling power
indicates the relative
importance of each
variable for
classification.
Loadings, as always show the
relative significance of each
variable in constructing each PC
There are relatively unimportant.
Mycobacteria SIMCA
PC plots are pretty boring since you only have one class. However, it can
be used to see if you have any ‘sub-classes.’
Outliers are test for by plotting sample residuals (difference between
sample and center of hypervolume) vs it’s Mahalanobis distance from the
center of the cluster - similar to a Euclidian distance but takes into account
correlations of the data and is scale invariant.
Mycobacteria SIMCA