Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Chemometric Classification, Lecture Notes- Physics, Study notes of Advanced Physics

University of Liverpool Advanced Physics

Classification methods, Data preprocessing, leave and out validation, rule building method, Discriminant analysis, Coffee example, Iris example, K nearest neighbour classification, KNN, K-means

Typology: Study notes

2010/2011

Uploaded on 09/10/2011

gerrard_11 🇬🇧

4.3

(6)

234 documents

1 / 16

This page cannot be seen from the preview

Don't miss anything!

Classification Methods

Up to now we have been concerned with

methods that:

Display complex information.

Detect patterns or trends.

Now we will introduce methods that can

be used to classify samples based on

models that are developed.

Classification problems

Level I

Simple classification into predefined categories.

Level II

Level I + detection of outliers,

Level III

Level II + prediction of an external property.

Level IV

Level II + prediction of more than one property.

Classification Methods

Many methods have been developed with new ones

being published all of the time.

We’ll look a some representative approaches.

Linear Learning Machine

Discriminant Analysis

Classification Trees

K Nearest Neighbor

SIMCA

The available methods and approaches may vary

based on the package use.

Supported by

XLStat

Classification Methods

All of these methods are considered

supervised learning.

Initial assumptions regarding

membership or properties are made

when developing a model.

An initial evaluation of the data using

exploratory data analysis is useful.

Data sets

Needed develop and evaluate a classification model.

Training set

Representative samples used to build the model.

The modeling software uses the class information.

Evaluation set

Samples of known class, used to test the model.

The modeling software does not know the classes.

Test se t

True unknowns.

Data pre-processing

With any of these methods, you may choose

to do some sort of data preprocessing.

Raw

Is fastest.

Scaled

Gives equal weight to the variables.

PCA

Can be used to reduce noise, insignificant

variables.

Discover Study notes of Advanced Physics University of Liverpool

Partial preview of the text

Download Chemometric Classification, Lecture Notes- Physics and more Study notes Advanced Physics in PDF only on Docsity!

Classification Methods

Up to now we have been concerned with

methods that:

Display complex information.

Detect patterns or trends.

Now we will introduce methods that can

be used to classify samples based on

models that are developed.

Classification problems

Level I

Simple classification into predefined categories.

Level II

Level I + detection of outliers,

Level III

Level II + prediction of an external property.

Level IV

Level II + prediction of more than one property.

Classification Methods

Many methods have been developed with new ones

being published all of the time.

We’ll look a some representative approaches.

Linear Learning Machine

Discriminant Analysis

Classification Trees

K Nearest Neighbor

SIMCA

The available methods and approaches may vary

based on the package use.

Supported by

XLStat

Classification Methods

All of these methods are considered

supervised learning.

Initial assumptions regarding

membership or properties are made

when developing a model.

An initial evaluation of the data using

exploratory data analysis is useful.

Data sets

Needed develop and evaluate a classification model.

Training set

Representative samples used to build the model.

The modeling software uses the class information.

Evaluation set

Samples of known class, used to test the model.

The modeling software does not know the classes.

Test set

True unknowns.

Data pre-processing

With any of these methods, you may choose

to do some sort of data preprocessing.

Raw

Is fastest.

Scaled

Gives equal weight to the variables.

PCA

Can be used to reduce noise, insignificant

variables.

Data pre-processing

With some data sets, you may also want to

some other types of pre-processing.

Example. Spectral or chromatographic traces.

Options may include:

Smoothing, baseline correction, signal

averaging, using the first or second

derivative.

Creating an evaluation set

The evaluation set is typically a sub-set of

the training set that was omitted when

building a model.

Randomly pick a subset of the data.

Random pick members from each class.

Any approach that selectively removes a

portion of the data could cause bias.

Leave-one-out validation

A standardized approach for

validation of a model where each

sample serves as an evaluation set.

1. Omit a single sample from the set

2. Build the model

3. Test the omitted sample

4. Repeat the above steps until each sample has

been omitted and tested once.

Your data

While ‘Leave-One-Out’ testing is the best approach, it

can be slow for large sets.

Alternate approaches are to leave two or more

samples out with each pass.

Samples should be randomly listed in the matrix.

The same two (or more) sample should never be

omitted together more than once.

Rule building methods

Methods where a set of rules are created to discriminate

between classes.

Linear learning machine

One or more linear vectors are created to

discriminate between classes.

Discriminate analysis

Linear or quadratic equations are used to separate

classes.

Classification trees

Series of rules are used to sequentially classify.

Linear learning machine

The assumption is

that one or more

vector can be found

that can be used to

discriminate between

our classes.

This can make use

of our ‘raw’ data or

work in PC space.

PC space would be

better as there would

be noise reduction.

Petal width

Petal length

Sepal Width

Sepal Length

-0.

0

1

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

F1 (99.01 %)

F2 (0.99 %)

1

11

1

1 1

1

1 1

1

1 1

1

1 1

1

2

2 2

2

3

3 3

3

3 3

3

2

1

-10 -5 0 5 10

Coffee example

This consisted of 6 types of

coffee - identified based on MS

data.

To avoid colinearity and null

variable problems, PCA scores

were used (first 5 components).

C

C C

C

C C

C

E E

E

E E

E

E E

K

K K

K

K K

K

R R

R

R R

R

S

S S

S

S S

U

U U U

U

U U U

U

C

E

K

R

S

U

0

5

10

15

-15 -10 -5 0 5 10 15 20

F1 (56.19 %)

F2 (22.79 %)

Classification trees

Predicts class membership by sequential

application of rules based on predictor variables.

With DA and LLM, you create a set of math

models that are all applied at once.

With classification trees, the predictor variables

are evaluated as ordinal rules, one at a time.

Classification trees

Solid - liquid

Density > 1 Red or green

Density > 1

Iris example (yet again!)

XLStat supports the use of

classification and regression trees.

Classification if the Y variable (class)

is qualitative, regression if the Y

variable is quantitative.

The iris example is a classification

example.

Iris example

= If Petal width is between 1 and 8

the assign to Species 1

Node: 1

Size: 150 %: 100

Purity(%):

50

1 50

2

3

Node: 2

Size: 50

%: 33. Purity(%):

100

[1, 8[ 0

0

1 50

2

3

Node: 3

Size: 100

%: 66. Purity(%):

50

[8, 25[ 50

50

10

2

3

Node: 4 Size: 53

%: 35.

Purity(%):

[8, 16.5[ 5

48

0 1

2

3

Node: 6

Size: 46 %: 30.

Purity(%):

[30,

50.5[ 1

45

10

2

3

Node: 8

Size: 41

%: 27. Purity(%):

100

[30,

47.5[ 0

41

10

2

3

Node: 9

Size: 5

%: 3. Purity(%):

80

[47.5,

50.5[ 1

4

10

2

3

Node: 10 Size: 2

%: 1. Purity(%):

50

[22,

23.5[

1

0 1

2

3

Node: 11 Size: 3

%: 2 Purity(%):

100

[23.5,

31[

0

3

0 1

2

3

Sepal Width

Petal length

Node: 7

Size: 7 %: 4.

Purity(%):

[50.5,

58[ 4

3

10

2

3

Node: 12

Size: 3

%: 2 Purity(%):

[60,

62.5[

1

2

10

2

3

Node: 13

Size: 4

%: 2. Purity(%):

75

[62.5,

72[

3

1

10

2

3

Sepal Length

Petal length

Node: 5 Size: 47

%: 31.

Purity(%):

[16.5,

25[

45

2

0 1

2

3

Node: 14

Size: 10 %: 6.

Purity(%):

80

[45,

50.5[ 8

2

10

2

3

Node: 15

Size: 37 %: 24.

Purity(%):

100

[50.5,

69[ 37

0

10

2

3

Petal length

Petal width

Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:Confusion matrix for the estimation sample:

from \ to CaC CaR OhC OhR Total % correct

CaC 17 0 0 0 17 100.0%

CaR 0 9 0 1 10 90.0%

OhC 0 1 7 0 8 87.5%

OhR 0 1 0 5 6 83.3%

Total 17 11 7 6 41 92.7%

K nearest neighbor classification

A similarity-based classification method.

It attempts to assign categories to unknown

samples based on multivariate proximity to other

samples.

It works best with discrete classification types and

is tolerant of poor data sets.

K - !The number of closest neighbors being

compared.

Consider this as the supervised version of HCA.

K nearest neighbor classification

In its simplest form, KNN is conducted by:

First, a training set is collected that

contains examples of each class.

Intersample distances are then

calculated.

where

N = # of variables or components used.

d a " b = a j

b b ^ h

j = 1

N

!

2

KNN

The distance matrix is sorted and the

distance of the unknown sample can be

compared to:

1. The K nearest neighbors

2. The nearest class cluster.

Option 2 requires that K = 1.

KNN

When using the distance to a class, you

can use the same link options that were

discussed earlier.

The distance can be based on:

Single link - closest member of class.

Complete link - farthest member of class.

Centroid - center of class cluster.

KNN - single link

In this example, the

unknown is compared

to the 3 closest

known samples.

In this case, the three

closest samples are

all ʻred.ʼ

Single link

K = 3

KNN - centroid link

Centroid link

With this approach,

the distance to the

center of a class

cluster is determined

and compared.

KNN

Ideally, if a test sample falls well within a known class,

its closes neighbors should all be of one class.

Here, all of the

‘blue’ samples

would be closer

to the unknown

than any of the

green.

Mycobacteria - HCA

46464646464646464646464646464646464646464646464646464646464646464646464646464646494949494949494949494949494949494949464646444444444447444444444444444444444344434343434343434343434545454545454545454545454545454545434343434343454545454545454545454545454545434343434343424242424242424242424242424242424242424343454747474747474747474747474747474747474747

0

100

200

300

400

500

600

700

800

900

1000

A quick review of ALL of the ways

that this data set was difficult to get

useful information from.

Mycobacteria - k means

Mycobacteria - PCA

-3.

-2.

-1.

-6.000 -4.000 -2.000 0.000 2.000 4.000 6.000 8.000 10.

42

43

44

45

46

47

49

Mycobacteria - DA

Getting out the vote

Example - K = 5

Sample Class Distance

1 A 0.

2 B 0.

3 A 0.

4 B 0.

5 C 0.

Here, A and B would tie. The tie-breaker would

be that A averages a smaller distance so

would be made the winner.

KNN validation

The optimum number for K can be found

by trial an error but for a close match, it

should make no difference.

The classifying power of your data can be

evaluated by leave one out validation of

your training set.

This should be done before any sort of

real classification begins.

KNN validation

Validation

You can sequentially leave out each of your

samples and test it for ‘votes’ at several K

values.

You end up with a vote matrix that will tell you

the optimum K value for each class.

You will also get a misclassification matrix "-

this tells you how often one of your knowns

are incorrectly classified.

K nearest neighbor classification

So KNN will always assign a class.

What if you have a material that is not a member of

an existing class?

One option is to set a maximum distance.

Example

Your intraclass distances run about 0.2 for all of

your classes, you might want to omit votes with

distances that exceed 0.2.

Iris (of course)

The Iris data set is included with a demo of the

program Pirouette.

We’ll be using the Pirouette demo to show how

to conduct KNN and SIMCA classifications.

You can download a copy of the demo from

www.infometrix.com.

The demo is fully functional but only with the

data sets that are provided by Infometrix.

The actual software is pretty easy to use but

too expensive for our use in the course.

Iris example

Iris - scores by class Iris - voting results

Iris - class partitions

What? NOT the Iris data set?

Headspace MS of 4 cola classes.

Two cola brands.

Diet and regular.
m/e 44 - 149.

May need to preprocess to

eliminate any nonvariant data.

Cola example

Class

1 Brand 1

2 Diet brand 1

3 Brand 2

4 Diet brand 2

PCA scores

SIMCA models

Since the number

of components

used can vary,

each class will be

best described

by its own

hypervolume.

SIMCA models

Limitation of a class hypervolume.

You can limit the size of a hypervolume

by setting a standard deviation cutoff.

This results in better defined classes.

SD = 3 SD = 2

SIMCA models

Once a model has been created for each

class, you are ready to classify unknowns.

For each model/sample combination:

+ The sample is transformed into PC space

and compared to see if is a likely class

member.

+ If it is within the ‘hypervolume’ of a single

class, you have a match.

SIMCA classification

The potential still exists for a sample to be

classified as a member of more than one class.

It may also not

be a member of

any known class

SIMCA classification

SIMCA will give you an estimate as to

the probability of class membership.

Example - two possible classes.

" " Probability

" Class A" " 0.

" Class B 0.

Here, the sample is more likely to be

a member of Class A.

SIMCA summary

Of the methods covered, SIMCA offers the

most options for developing a classification

model when the classes are well known.

It also requires the most development time as

you must determine the optimum model

conditions for each class.

If used, plan on spending quite a bit of time

working with all of the available options.

SIMCA example - Iris.

Of course we’ll look at the iris dataset again.

Note: We have a

separate model for

each class in the data

set - in this case three.

SIMCA example - Iris.

Pirouette will

provide an

estimate as to the

class hypervolumes

based on the first

three PCs.

SIMCA example - Iris.

It appears that

petal length is

the most useful

for classifying.

SIMCA example - Iris.

These plots show the relative

positions of each sample when

projected into any of the three class

models - two classes at a time - with

color coding based on known class.

Mycobacteria SIMCA

Discriminating

Power is a

measure of which

variables show

the biggest ‘class’

differences.

Mycobacteria SIMCA

Example shows that a different

number of components were

used in developing the

individual SIMCA

hypervolumes.

Mycobacteria SIMCA

Modeling power

indicates the relative

importance of each

variable for

classification.

Loadings, as always show the

relative significance of each

variable in constructing each PC

There are relatively unimportant.

Mycobacteria SIMCA

PC plots are pretty boring since you only have one class. However, it can

be used to see if you have any ‘sub-classes.’

Outliers are test for by plotting sample residuals (difference between

sample and center of hypervolume) vs it’s Mahalanobis distance from the

center of the cluster - similar to a Euclidian distance but takes into account

correlations of the data and is scale invariant.

Chemometric Classification, Lecture Notes- Physics, Study notes of Advanced Physics

Related documents

Partial preview of the text

Download Chemometric Classification, Lecture Notes- Physics and more Study notes Advanced Physics in PDF only on Docsity!

Classification Methods

Up to now we have been concerned with

methods that:

Display complex information.

Detect patterns or trends.

Now we will introduce methods that can

be used to classify samples based on

models that are developed.

Classification problems

Level I

Simple classification into predefined categories.

Level II

Level I + detection of outliers,

Level III

Level II + prediction of an external property.

Level IV

Level II + prediction of more than one property.

Classification Methods

Many methods have been developed with new ones

being published all of the time.

We’ll look a some representative approaches.

Linear Learning Machine

Discriminant Analysis

Classification Trees

K Nearest Neighbor

SIMCA

The available methods and approaches may vary

based on the package use.

Supported by

XLStat

Classification Methods

All of these methods are considered

supervised learning.

Initial assumptions regarding

membership or properties are made

when developing a model.

An initial evaluation of the data using

exploratory data analysis is useful.

Data sets

Needed develop and evaluate a classification model.

Training set

Representative samples used to build the model.

The modeling software uses the class information.

Evaluation set

Samples of known class, used to test the model.

The modeling software does not know the classes.

Test set

True unknowns.

Data pre-processing

With any of these methods, you may choose

to do some sort of data preprocessing.

Raw

Is fastest.

Scaled

Gives equal weight to the variables.

PCA

Can be used to reduce noise, insignificant

variables.

Data pre-processing

With some data sets, you may also want to

some other types of pre-processing.

Example. Spectral or chromatographic traces.

Options may include:

Smoothing, baseline correction, signal

averaging, using the first or second

derivative.

Creating an evaluation set

The evaluation set is typically a sub-set of

the training set that was omitted when

building a model.

Randomly pick a subset of the data.

Random pick members from each class.

Any approach that selectively removes a

portion of the data could cause bias.

Leave-one-out validation

A standardized approach for