Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

data mining good notes, Exercises of Data Compression

University of Delhi Data Compression

unit 2 data mining notes . this is very good notes

Typology: Exercises

2016/2017

Uploaded on 05/09/2017

karan-verma 🇮🇳

(1)

1 document

1 / 122

This page cannot be seen from the preview

Don't miss anything!

Data Mining:

Concepts and Techniques

— Chapter 6 —

February 10, 2014

Data Mining: Concepts and Techniques

SURESH BABU M

ASST PROF

IT DEPT

VJIT

Discover Exercises of Data Compression University of Delhi

Partial preview of the text

Download data mining good notes and more Exercises Data Compression in PDF only on Docsity!

Data Mining:

Concepts and Techniques

— Chapter 6 —

February 10, 2014

Data Mining: Concepts and Techniques

SURESH BABU M

ASST PROF

IT DEPT

VJIT

Chapter 6. Classification and Prediction

What is classification? What isprediction?

Issues regarding classification andprediction

Classification by decision tree

Support Vector Machines (SVM)

Associative classification

Lazy learners (or learning from yourneighbors)

Other classification methods

February 10, 2014

Data Mining: Concepts and Techniques

Classification by decision tree induction

Bayesian classification

Rule-based classification

Classification by back propagation

Other classification methods

Prediction

Accuracy and error measures

Ensemble methods

Model selection

Summary

Classification—A Two-Step Process

Model construction

: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined class, as determined bythe

class label attribute

The set of tuples used for model construction is

training set

The model is represented as classification rules, decision trees, or mathematicalformulae

Model usage

: for classifying future or unknown objects

February 10, 2014

Data Mining: Concepts and Techniques

Model usage

: for classifying future or unknown objects

Estimate accuracy

of the model

The known label of test sample is compared with the classified result from themodel

Accuracy rate is the percentage of test set samples that are correctly classifiedby the model

Test set is independent of training set, otherwise over-fitting will occur

If the accuracy is acceptable, use the model to

classify data

tuples whose class

labels are not known

Process (1): Model Construction

Training

Data

Classification

Algorithms

February 10, 2014

Data Mining: Concepts and Techniques

NAM E RANK 5

YEARS TENURED

M ike

Assistant Prof

M ary

Assistant Prof

yes

Bill

Professor

yes

Jim

Associate Prof

yes

Dave

Assistant Prof

Anne

Associate Prof

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier

(Model)

Supervised vs. Unsupervised Learning

Supervised learning (classification)^

Supervision:

The training data (observations, measurements, etc.)

are accompanied by labels indicating the class of the observations

New data is classified based on the training set

February 10, 2014

Data Mining: Concepts and Techniques

New data is classified based on the training set

Unsupervised learning

(clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim ofestablishing the existence of classes or clusters in the data

Chapter 6. Classification and Prediction

What is classification? What isprediction?

Issues regarding classification andprediction

Classification by decision tree

Support Vector Machines (SVM)

Associative classification

Lazy learners (or learning from yourneighbors)

Other classification methods

February 10, 2014

Data Mining: Concepts and Techniques

Classification by decision tree induction

Bayesian classification

Rule-based classification

Classification by back propagation

Other classification methods

Prediction

Accuracy and error measures

Ensemble methods

Model selection

Summary

Issues: Evaluating Classification Methods

Accuracy^

classifier accuracy: predicting class label

predictor accuracy: guessing value of predicted attributes

Speed^

time to construct the model (training time)

time to use the model (classification/prediction time)

February 10, 2014

Data Mining: Concepts and Techniques

time to use the model (classification/prediction time)

Robustness: handling noise and missing values

Scalability: efficiency in disk-resident databases

Interpretability^

understanding and insight provided by the model

Other measures, e.g., goodness of rules, such as decision tree size orcompactness of classification rules

Chapter 6. Classification and Prediction

What is classification? What isprediction?

Issues regarding classification andprediction

Classification by decision tree

Support Vector Machines (SVM)

Associative classification

Lazy learners (or learning from yourneighbors)

Other classification methods

February 10, 2014

Data Mining: Concepts and Techniques

Classification by decision tree induction

Bayesian classification

Rule-based classification

Classification by back propagation

Other classification methods

Prediction

Accuracy and error measures

Ensemble methods

Model selection

Summary

Output: A Decision Tree for “

buys_computer”

age?

overcast

February 10, 2014

Data Mining: Concepts and Techniques

student?

credit rating?

no

yes

fair

excellent

yes

no

Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)^

Tree is constructed in a

top-down recursive divide-and-conquer manner

At start, all the training examples are at the root

Attributes are categorical (if continuous-valued, they are discretized in advance)

Examples are partitioned recursively based on selected attributes

Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,

February 10, 2014

Data Mining: Concepts and Techniques

Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain

Conditions for stopping partitioning^

All samples for a given node belong to the same class

There are no remaining attributes for further partitioning –

majority voting

employed for classifying the leaf

There are no samples left

Attribute Selection: Information Gain

Class P: buys_computer = “yes”

Class N: buys_computer = “no”

means “age <=30” has 5 out of 14

samples, with 2 yes’es and 3 no’s.

age

p

n

I(p

, ni

)i

I

D

Info

age

I

(^940). 0

)

5 (

log (^514)

) (^914) (

log

) (^5) , 9 (

)

(

−

Info

February 10, 2014

Data Mining: Concepts and Techniques

samples, with 2 yes’es and 3 no’s. Hence

Similarly,

_

rating

credit

Gain

student

Gain

income

Gain

(^246). 0 ) ( ) ( ) ( =

−

Info

age

Gain

age

income

student

credit_rating

buys_computer

high

fair

high

excellent

31…

high

fair

yes

medium

fair

yes

low

yes

fair

yes

low

yes

excellent

31…

low

yes

excellent

yes

medium

fair

low

yes

fair

yes

medium

yes

fair

yes

medium

yes

excellent

yes

31…

medium

excellent

yes

31…

high

yes

fair

yes

medium

excellent

Computing Information-Gain for Continuous-ValueAttributes

Let attribute A be a continuous-valued attribute

Must determine the

best split point

for A

Sort the value A in increasing order

Typically, the midpoint between each pair of adjacent values is considered as a possible

split point

February 10, 2014

Data Mining: Concepts and Techniques

considered as a possible

split point

+ai

)/2 is the midpoint between the values of a

and a

The point with the

minimum expected information requirement

for A is

selected as the split-point for A

Split:^

D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the setof tuples in D satisfying A > split-point

Gini index (CART, IBM IntelligentMiner)

If a data set

D

contains examples from

classes, gini index,

gini

D

) is defined as

where

is the relative frequency of class

D

If a data set

D

is split on A into two subsets

D

and

D

, the

gini

index

gini

D

) is defined

∑ =

−

n j

gini

)

(

Reduction in Impurity:

The attribute provides the smallest

gini

split

D

) (or the largest reduction in impurity) is

chosen to split the node (

need to enumerate all the possible splitting points for each attribute

) ) ( ) ( ) (

gini

A

−

∆

February 10, 2014

Data Mining: Concepts and Techniques

)

(

| | ) ( | |

)

(

gini

D D

gini

D D

gini

Gini index (CART, IBM IntelligentMiner)

Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

Suppose the attribute income partitions D into 10 in D

: {low, medium} and 4 in D

(^459). 0

)

(

  

   −

  

   −

gini

)

(

)

(

10 14

)

(

}

Gini

gini

medium

low

income

 

 

 

 

∈

but gini

{medium,high}

is 0.30 and thus the best since it is the lowest

All attributes are assumed continuous-valued

May need other tools, e.g., clustering, to get the possible split values

Can be modified for categorical attributes^ February 10, 2014

Data Mining: Concepts and Techniques









data mining good notes, Exercises of Data Compression

Related documents

Partial preview of the text

Download data mining good notes and more Exercises Data Compression in PDF only on Docsity!

Data Mining:

Concepts and Techniques

— Chapter 6 —

SURESH BABU M

ASST PROF

IT DEPT

VJIT

Training

Data

Classification

Algorithms

NAM E RANK 5

YEARS TENURED

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier

(Model)

Supervised learning (classification)^

Supervision:

The training data (observations, measurements, etc.)

are accompanied by labels indicating the class of the observations

New data is classified based on the training set

New data is classified based on the training set

Unsupervised learning

(clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim ofestablishing the existence of classes or clusters in the data

Accuracy^

classifier accuracy: predicting class label

predictor accuracy: guessing value of predicted attributes

Speed^

time to construct the model (training time)

time to use the model (classification/prediction time)

time to use the model (classification/prediction time)

Robustness: handling noise and missing values

Scalability: efficiency in disk-resident databases

Interpretability^

understanding and insight provided by the model

Other measures, e.g., goodness of rules, such as decision tree size orcompactness of classification rules

Output: A Decision Tree for “

buys_computer”

age?

overcast

student?

credit rating?

no

yes

yes

yes

fair

excellent

yes

no

age

p

n

I(p

, ni

)i

I

I

I

D

I

−

_

rating

credit

Gain

student

Gain

income

Gain

Let attribute A be a continuous-valued attribute

Must determine the

best split point

for A