Slides on Genomic Data-Mining Method Classification | BIOE 594, Study notes of Biology

Material Type: Notes; Class: Advanced Special Topics in Bioengineering; Subject: Bioengineering; University: University of Illinois - Chicago; Term: Spring 2003;

Typology: Study notes

Pre 2010

Uploaded on 07/23/2009

koofers-user-s2i
koofers-user-s2i 🇺🇸

10 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
BioE 594 Computational Functional Genomics
Lecture #16
Genomic data-mining method 4 - classification
Yang Dai
UIC
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Slides on Genomic Data-Mining Method Classification | BIOE 594 and more Study notes Biology in PDF only on Docsity!

BioE 594 Computational Functional GenomicsLecture #16Genomic data-mining method 4 - classificationYang DaiUIC

Prof. Yang Dai, BioE 594 Computational Functional Genomics Classification tasks for microarray analysis^ †

Classification of SAMPLES^ „

Generate gene expression profiles that can †^

discriminate between different

known

cell types or conditions,

e.g. between tumor and normal tissue, †^

identify different and previously

unknown

cell types or

conditions, e.g. new subclasses of an existing class of tumors.

Classification of GENES^ „

Assign an unknown cDNA sequence to one of a set of known

gene classes.

„^

Partition a set of genes into new (

unknown

) functional

classes on the basis of their expression patterns across anumber of samples.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Classification – a two-step process †^

Model construction: describing a set of predeterminedclasses based on a training set. It is also calledlearning. „^

Each tuple/sample is assumed to belong to a predefined class „^

The model is represented as classification rules, decision trees,or mathematical formulae

Model usage: for classifying future test data/objects^ „

Estimate accuracy of the model †^

The known label of test example is compared with theclassified result from the model †^

Accuracy rate is the % of test cases that are correctlyclassified by the model

„^

If the accuracy is acceptable, use the model to classify datatuples whose class labels are not known.

Prof. Yang Dai, BioE 594 Computational Functional Genomics Evaluating Classification Methods †^

Predictive accuracy †^

Speed and scalability^ „

time to construct the model „ time to use the model †^

Robustness: handling noise and missing values †^

Scalability: efficiency in disk-resident databases †^

Interpretability:^ „

understandable and insight provided by the model †^

Compactness of the model: size of the tree, or the number of rules.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Measure for the quality of a particular prediction

Prof. Yang Dai, BioE 594 Computational Functional Genomics

8

Performance assessment^ †

Resubstitution estimation

„^

Error rate on the learning set. „^

Can be severely biased

Test set estimation

„^

Points in the learning set

L

are divided into two sets,

L

and 1

L

classifier is built using

L

and error rate is computed for 1

L

2

„^

Must ensure that

L

and 1

L

are from the same population 2

„^

2/3 and 1/3 repeated random sampling. This procedurereduces effective sample size.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Performance assessment^ †

Out-of-bag estimation^ „

The use of cross-validation (or any other process) is intendedto provide accurate estimates of the classification error rate.In addition this is also a procedure for model selection, i.e., to

determine parameters involved in the model.

„^

These estimates relate only to experiment that was (cross-)validated. „^

There is a common practice in this area of doing featureselection using all of the data and them using cross-validation only on the model building and classificationportion.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Performance assessment - ROC †^

ROC (Receiver Operating Characteristic) curve is oneminus specificity (false positive rate) versus a plot ofsensitivity (true positive rate). It is used for visualizing,organizing and selecting classifiers based on theirperformance.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Performance assessment – ROC (a toy example) †^

Table 2

can be further transformed to

table 3

The corresponding ROC curve

False Positive rate

True Positive rate

Cut-offmeasure

Specificity

Sensitivity

Cut-offmeasure

Table 2

Table 3

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Performance assessment – ROC examples

Random performance

Good separation betweenclasses, convex curve

Reasonable separation,

mostly convex

Fairly poor separation,

mostly convex

Poor separation, large and small concavities

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Centroid classification †^

Algorithm „^

For each class, calculate the center of mass of the representativepoints in the class. „^

Calculate the distance between the position of the sample to beclassified and each of the centers of mass of the classes using anappropriate distance measure „^

Assign sample to the class whose center of mass is nearest to it

Centroid classification extends very naturally to more thantwo classes.

It is fast and uses all data. However, it can give completelyincorrect results if the data are not linearly separable.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

17

Bayesian Classification †^ Probabilistic learning: classification learning can also beseen as computing^ „

P(

C=

c^ |

d), i.e., given a data tuple

d, what is the probability that

d^

is

of class

c. (

C^

is the class attribute).

Model^ „

Let

A

through

A

be the attributes with discrete values. They arek

used to predict a discrete class

C

„^

Given an example with observed attribute values

a

through

ak

„^

The prediction is the class c such thatP(

C=

c|A

a

∩L

∩^ A

=k ak

) is maximal.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Computing probabilities †

Now suppose that all attributes are conditionallyindependent given the class c. Formally, we assume

P(

A

a

|A

a

L

∩^

A

=k

ak

,^ C

=c

) = P(

A

a

|^ C

=c

and so on for

A

through

A

.k

We haveP(

A

a

∩^

A

=k

ak

|^ C

=c

P(A

a

|^ C

=c

)• P(A

a

|^ C

=c

) • • • P(A

=k^

ak

|^ C

=c

How do we estimate P(

A

a

|^ C

=c

Prof. Yang Dai, BioE 594 Computational Functional Genomics An example - training dataset

age

income

student

credit_rating

buys_computer

<=

high

no

fair

no

<=

high

no

excellent

no

30…

high

no

fair

yes

medium

no

fair

yes

low

yes

fair

yes

low

yes

excellent

no

31…

low

yes

excellent

yes

<=

medium

no

fair

no

<=

low

yes

fair

yes

medium

yes

fair

yes

<=

medium

yes

excellent

yes

31…

medium

no

excellent

yes

31…

high

yes

fair

yes

medium

no

excellent

no

Class:C1:buys_computer=‘yes’C2:buys_computer=‘no’ Data sampleX =(age<=30,Income=medium,Student=yesCredit_rating=Fair)