data mining good notes, Exercises of Data Compression

unit 2 data mining notes . this is very good notes

Typology: Exercises

2016/2017

Uploaded on 05/09/2017

karan-verma
karan-verma 🇮🇳

5

(1)

1 document

1 / 122

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Mining:
Concepts and Techniques
Chapter 6 —
February 10, 2014
Data Mining: Concepts and Techniques
1
SURESH BABU M
ASST PROF
IT DEPT
VJIT
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download data mining good notes and more Exercises Data Compression in PDF only on Docsity!

Data Mining:

Concepts and Techniques

— Chapter 6 —

February 10, 2014

Data Mining: Concepts and Techniques

1

SURESH BABU M
ASST PROF
IT DEPT
VJIT

Chapter 6. Classification and Prediction 

What is classification? What isprediction?



Issues regarding classification andprediction



Classification by decision tree



Support Vector Machines (SVM)



Associative classification



Lazy learners (or learning from yourneighbors)



Other classification methods

February 10, 2014

Data Mining: Concepts and Techniques

2



Classification by decision tree induction



Bayesian classification



Rule-based classification



Classification by back propagation



Other classification methods



Prediction



Accuracy and error measures



Ensemble methods



Model selection



Summary

Classification—A Two-Step Process 

Model construction

: describing a set of predetermined classes



Each tuple/sample is assumed to belong to a predefined class, as determined bythe

class label attribute



The set of tuples used for model construction is

training set



The model is represented as classification rules, decision trees, or mathematicalformulae



Model usage

: for classifying future or unknown objects

February 10, 2014

Data Mining: Concepts and Techniques

4



Model usage

: for classifying future or unknown objects



Estimate accuracy

of the model



The known label of test sample is compared with the classified result from themodel



Accuracy rate is the percentage of test set samples that are correctly classifiedby the model



Test set is independent of training set, otherwise over-fitting will occur



If the accuracy is acceptable, use the model to

classify data

tuples whose class

labels are not known

Process (1): Model Construction

Training

Data

Classification

Algorithms

February 10, 2014

Data Mining: Concepts and Techniques

NAM E RANK 5
YEARS TENURED

M ike

Assistant Prof

no

M ary

Assistant Prof

yes

Bill

Professor

yes

Jim

Associate Prof

yes

Dave

Assistant Prof

no

Anne

Associate Prof

no

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier

(Model)

Supervised vs. Unsupervised Learning

Supervised learning (classification)^ 

Supervision:

The training data (observations, measurements, etc.)

are accompanied by labels indicating the class of the observations

New data is classified based on the training set

February 10, 2014

Data Mining: Concepts and Techniques

7

New data is classified based on the training set

Unsupervised learning

(clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim ofestablishing the existence of classes or clusters in the data

Chapter 6. Classification and Prediction 

What is classification? What isprediction?



Issues regarding classification andprediction



Classification by decision tree



Support Vector Machines (SVM)



Associative classification



Lazy learners (or learning from yourneighbors)



Other classification methods

February 10, 2014

Data Mining: Concepts and Techniques

8



Classification by decision tree induction



Bayesian classification



Rule-based classification



Classification by back propagation



Other classification methods



Prediction



Accuracy and error measures



Ensemble methods



Model selection



Summary

Issues: Evaluating Classification Methods

Accuracy^ 

classifier accuracy: predicting class label

predictor accuracy: guessing value of predicted attributes

Speed^ 

time to construct the model (training time)

time to use the model (classification/prediction time)

February 10, 2014

Data Mining: Concepts and Techniques

10

time to use the model (classification/prediction time)

Robustness: handling noise and missing values

Scalability: efficiency in disk-resident databases

Interpretability^ 

understanding and insight provided by the model

Other measures, e.g., goodness of rules, such as decision tree size orcompactness of classification rules

Chapter 6. Classification and Prediction 

What is classification? What isprediction?



Issues regarding classification andprediction



Classification by decision tree



Support Vector Machines (SVM)



Associative classification



Lazy learners (or learning from yourneighbors)



Other classification methods

February 10, 2014

Data Mining: Concepts and Techniques

11



Classification by decision tree induction



Bayesian classification



Rule-based classification



Classification by back propagation



Other classification methods



Prediction



Accuracy and error measures



Ensemble methods



Model selection



Summary

Output: A Decision Tree for “

buys_computer”

age?

overcast

February 10, 2014

Data Mining: Concepts and Techniques

13

student?

credit rating?

no

yes

yes

yes

fair

excellent

yes

no

Algorithm for Decision Tree Induction



Basic algorithm (a greedy algorithm)^ 

Tree is constructed in a

top-down recursive divide-and-conquer manner



At start, all the training examples are at the root



Attributes are categorical (if continuous-valued, they are discretized in advance)



Examples are partitioned recursively based on selected attributes



Test attributes are selected on the basis of a heuristic or statistical measure (e.g.,

February 10, 2014

Data Mining: Concepts and Techniques

14



Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain



Conditions for stopping partitioning^ 

All samples for a given node belong to the same class



There are no remaining attributes for further partitioning –

majority voting

is

employed for classifying the leaf



There are no samples left

Attribute Selection: Information Gain

g

Class P: buys_computer = “yes”

g

Class N: buys_computer = “no”

means “age <=30” has 5 out of 14

samples, with 2 yes’es and 3 no’s.

age

p

i^

n

i^

I(p

, ni

)i

I
I
I
D

Info

age

I

(^940). 0

)

14

5 (

log (^514)

) (^914) (

log

14

9

) (^5) , 9 (

)

(

2

2

=

=

I

D

Info

February 10, 2014

Data Mining: Concepts and Techniques

16

samples, with 2 yes’es and 3 no’s. Hence

Similarly,

_

rating

credit

Gain

student

Gain

income

Gain

(^246). 0 ) ( ) ( ) ( =

=

D

Info

D

Info

age

Gain

age

age

income

student

credit_rating

buys_computer

<=

high

no

fair

no

<=

high

no

excellent

no

31…

high

no

fair

yes

>

medium

no

fair

yes

>

low

yes

fair

yes

>

low

yes

excellent

no

31…

low

yes

excellent

yes

<=

medium

no

fair

no

<=

low

yes

fair

yes

>

medium

yes

fair

yes

<=

medium

yes

excellent

yes

31…

medium

no

excellent

yes

31…

high

yes

fair

yes

>

medium

no

excellent

no

Computing Information-Gain for Continuous-ValueAttributes



Let attribute A be a continuous-valued attribute

Must determine the

best split point

for A

Sort the value A in increasing order

Typically, the midpoint between each pair of adjacent values is considered as a possible

split point

February 10, 2014

Data Mining: Concepts and Techniques

17

considered as a possible

split point



(a

+ai

i+

)/2 is the midpoint between the values of a

i^

and a

i+

The point with the

minimum expected information requirement

for A is

selected as the split-point for A

Split:^ 

D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the setof tuples in D satisfying A > split-point

Gini index (CART, IBM IntelligentMiner) 

If a data set

D

contains examples from

n

classes, gini index,

gini

D

) is defined as

where

p

j^

is the relative frequency of class

j

in

D



If a data set

D

is split on A into two subsets

D

1

and

D

2

, the

gini

index

gini

D

) is defined

as

∑ =

=

n j

p

j

D

gini

1

2

1

)

(



Reduction in Impurity:



The attribute provides the smallest

gini

split

D

) (or the largest reduction in impurity) is

chosen to split the node (

need to enumerate all the possible splitting points for each attribute

) ) ( ) ( ) (

D

gini

D

gini

A

gini

A

=

February 10, 2014

Data Mining: Concepts and Techniques

19

)

(

|

|

| | ) ( | |

|

|

)

(

2

2

1

1

D

gini

D D

D

gini

D D

D

gini

A

=

Gini index (CART, IBM IntelligentMiner) 

Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”



Suppose the attribute income partitions D into 10 in D

1

: {low, medium} and 4 in D

2

(^459). 0

14

5

14

9

1

)

(

2

2

=

  

   −

  

   −

=

D

gini

)

(

14

4

)

(

10 14

)

(

1

1

}

,

{^

D

Gini

D

Gini

D

gini

medium

low

income

 

 

 

 

but gini

{medium,high}

is 0.30 and thus the best since it is the lowest



All attributes are assumed continuous-valued



May need other tools, e.g., clustering, to get the possible split values



Can be modified for categorical attributes^ February 10, 2014

Data Mining: Concepts and Techniques

20

14

14