Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Slides on Genomic Data-Mining Method Classification | BIOE 594, Study notes of Biology

University of Illinois - Chicago Biology

Material Type: Notes; Class: Advanced Special Topics in Bioengineering; Subject: Bioengineering; University: University of Illinois - Chicago; Term: Spring 2003;

Typology: Study notes

Pre 2010

Uploaded on 07/23/2009

koofers-user-s2i 🇺🇸

10 documents

1 / 23

This page cannot be seen from the preview

Don't miss anything!

BioE 594 Computational Functional Genomics

Lecture #16

Genomic data-mining method 4 - classification

Yang Dai

UIC

Discover Study notes of Biology University of Illinois - Chicago

Partial preview of the text

Download Slides on Genomic Data-Mining Method Classification | BIOE 594 and more Study notes Biology in PDF only on Docsity!

BioE 594 Computational Functional GenomicsLecture #16Genomic data-mining method 4 - classificationYang DaiUIC

Prof. Yang Dai, BioE 594 Computational Functional Genomics Classification tasks for microarray analysis^

Classification of SAMPLES^

Generate gene expression profiles that can ^

discriminate between different

known

cell types or conditions,

e.g. between tumor and normal tissue, ^

identify different and previously

unknown

cell types or

conditions, e.g. new subclasses of an existing class of tumors.

Classification of GENES^

Assign an unknown cDNA sequence to one of a set of known

gene classes.

^

Partition a set of genes into new (

unknown

) functional

classes on the basis of their expression patterns across anumber of samples.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Classification – a two-step process ^

Model construction: describing a set of predeterminedclasses based on a training set. It is also calledlearning. ^

Each tuple/sample is assumed to belong to a predefined class ^

The model is represented as classification rules, decision trees,or mathematical formulae

Model usage: for classifying future test data/objects^

Estimate accuracy of the model ^

The known label of test example is compared with theclassified result from the model ^

Accuracy rate is the % of test cases that are correctlyclassified by the model

If the accuracy is acceptable, use the model to classify datatuples whose class labels are not known.

Prof. Yang Dai, BioE 594 Computational Functional Genomics Evaluating Classification Methods ^

Predictive accuracy ^

Speed and scalability^

time to construct the model time to use the model ^

Robustness: handling noise and missing values ^

Scalability: efficiency in disk-resident databases ^

Interpretability:^

understandable and insight provided by the model ^

Compactness of the model: size of the tree, or the number of rules.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Measure for the quality of a particular prediction

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Performance assessment^

Resubstitution estimation

^

Error rate on the learning set. ^

Can be severely biased

Test set estimation

^

Points in the learning set

L

are divided into two sets,

L

and 1

L

classifier is built using

L

and error rate is computed for 1

L

^

Must ensure that

L

and 1

L

are from the same population 2

^

2/3 and 1/3 repeated random sampling. This procedurereduces effective sample size.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Performance assessment^

Out-of-bag estimation^

The use of cross-validation (or any other process) is intendedto provide accurate estimates of the classification error rate.In addition this is also a procedure for model selection, i.e., to

determine parameters involved in the model.

These estimates relate only to experiment that was (cross-)validated. ^

There is a common practice in this area of doing featureselection using all of the data and them using cross-validation only on the model building and classificationportion.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Performance assessment - ROC ^

ROC (Receiver Operating Characteristic) curve is oneminus specificity (false positive rate) versus a plot ofsensitivity (true positive rate). It is used for visualizing,organizing and selecting classifiers based on theirperformance.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Performance assessment – ROC (a toy example) ^

Table 2

can be further transformed to

table 3

The corresponding ROC curve

False Positive rate

True Positive rate

Cut-offmeasure

Specificity

Sensitivity

Cut-offmeasure

Table 2

Table 3

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Performance assessment – ROC examples

Random performance

Good separation betweenclasses, convex curve

Reasonable separation,

mostly convex

Fairly poor separation,

mostly convex

Poor separation, large and small concavities

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Centroid classification ^

Algorithm ^

For each class, calculate the center of mass of the representativepoints in the class. ^

Calculate the distance between the position of the sample to beclassified and each of the centers of mass of the classes using anappropriate distance measure ^

Assign sample to the class whose center of mass is nearest to it

Centroid classification extends very naturally to more thantwo classes.

It is fast and uses all data. However, it can give completelyincorrect results if the data are not linearly separable.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Bayesian Classification ^ Probabilistic learning: classification learning can also beseen as computing^

P(

C=

c^ |

d), i.e., given a data tuple

d, what is the probability that

of class

c. (

C^

is the class attribute).

Model^

Let

A

through

A

be the attributes with discrete values. They arek

used to predict a discrete class

C

^

Given an example with observed attribute values

a

through

^

The prediction is the class c such thatP(

C=

c|A

a

∩L

∩^ A

=k ak

) is maximal.

Prof. Yang Dai, BioE 594 Computational Functional Genomics

Computing probabilities

Now suppose that all attributes are conditionallyindependent given the class c. Formally, we assume

P(

A

a

|A

a

L

∩^

A

=k

ak

,^ C

=c

) = P(

A

a

|^ C

=c

and so on for

A

through

A

.k

We haveP(

A

a

∩^

A

=k

ak

|^ C

=c

P(A

a

|^ C

=c

)• P(A

a

|^ C

=c

) • • • P(A

=k^

ak

|^ C

=c

How do we estimate P(

A

a

|^ C

=c

Prof. Yang Dai, BioE 594 Computational Functional Genomics An example - training dataset

age

income

student

credit_rating

buys_computer

high

fair

high

excellent

30…

high

fair

yes

medium

fair

yes

low

yes

fair

yes

low

yes

excellent

31…

low

yes

excellent

yes

medium

fair

low

yes

fair

yes

medium

yes

fair

yes

medium

yes

excellent

yes

31…

medium

excellent

yes

31…

high

yes

fair

yes

medium

excellent

Class:C1:buys_computer=‘yes’C2:buys_computer=‘no’ Data sampleX =(age<=30,Income=medium,Student=yesCredit_rating=Fair)

Slides on Genomic Data-Mining Method Classification | BIOE 594, Study notes of Biology

Related documents

Partial preview of the text

Download Slides on Genomic Data-Mining Method Classification | BIOE 594 and more Study notes Biology in PDF only on Docsity!

BioE 594 Computational Functional GenomicsLecture #16Genomic data-mining method 4 - classificationYang DaiUIC

Classification of SAMPLES^ 

Classification of GENES^ 

^

Classification – a two-step process ^

Model construction: describing a set of predeterminedclasses based on a training set. It is also calledlearning. ^

Model usage: for classifying future test data/objects^ 

Measure for the quality of a particular prediction

Resubstitution estimation

^

Test set estimation

^

L

L

L

L

L

^

L

L

^

Performance assessment^ 

Out-of-bag estimation^ 

Performance assessment - ROC ^

ROC (Receiver Operating Characteristic) curve is oneminus specificity (false positive rate) versus a plot ofsensitivity (true positive rate). It is used for visualizing,organizing and selecting classifiers based on theirperformance.

Performance assessment – ROC (a toy example) ^

Table 2

can be further transformed to

table 3

The corresponding ROC curve

Performance assessment – ROC examples

Centroid classification ^

Algorithm ^

Centroid classification extends very naturally to more thantwo classes.

It is fast and uses all data. However, it can give completelyincorrect results if the data are not linearly separable.

Bayesian Classification ^ Probabilistic learning: classification learning can also beseen as computing^ 

P(

C=

C^

Model^ 

A

A

C

^

^

C=

∩L

∩^ A

Computing probabilities 

Now suppose that all attributes are conditionallyindependent given the class c. Formally, we assume

P(

A

a

|A

a

L

∩^

A

=k

ak

,^ C

=c

) = P(

A

a

|^ C

=c

and so on for

A

through

A

.k

We haveP(

A

a

∩^

Classification of SAMPLES^

Classification of GENES^

^

Classification – a two-step process ^

Model construction: describing a set of predeterminedclasses based on a training set. It is also calledlearning. ^

Model usage: for classifying future test data/objects^

^

^

^

^

Performance assessment^

Out-of-bag estimation^

Performance assessment - ROC ^

Performance assessment – ROC (a toy example) ^

Centroid classification ^

Algorithm ^

Bayesian Classification ^ Probabilistic learning: classification learning can also beseen as computing^

Model^

^

^

Computing probabilities