Tree Pruning & Classification: Impurity, Class Assignment, & Pruning, Study notes of Systems Engineering

An overview of tree pruning and classification techniques, including impurity measures for regression and classification trees, class assignment rules, and stop-splitting rules. It also discusses the importance of pruning branches to prevent overfitting and the challenges of finding the optimal tree size.

Typology: Study notes

Pre 2010

Uploaded on 08/05/2009

koofers-user-wu1
koofers-user-wu1 🇺🇸

10 documents

1 / 122

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
TreeBasedModels
KwokLeungTsui
Industrial&SystemsEngineering
GeorgiaInstituteofTechnology
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Tree Pruning & Classification: Impurity, Class Assignment, & Pruning and more Study notes Systems Engineering in PDF only on Docsity!

Tree‐Based

Models Kwok‐Leung^ Tsui

Industrial^ &^ Systems

Engineering

Georgia^ Institute

of^ Technology

Regression Models

  1. Classical linear model 2. Generalized linear model 3. Additive model

S^ : smooth nonparametric functionj^

  1. Generalized additive models

Weather^ Data:

Play^ or^ not

Play?

Outlook^ Temperature^

Humidity^ Windy^ Play? sunny^ hot^

high^ false^ No sunny^ hot^

high^ true^ No overcast^ hot^

high^ false^ Yes rain^ mild^

high^ false^ Yes rain^ cool^

normal^ false^ Yes rain^ cool^

normal^ true^ No overcast^ cool^

normal^ true^ Yes sunny^ mild^

high^ false^ No sunny^ cool^

normal^ false^ Yes rain^ mild^

normal^ false^ Yes sunny^ mild^

normal^ true^ Yes overcast^ mild^

high^ true^ Yes overcast^ hot^

normal^ false^ Yes rain^ mild^

high^ true^ No

overcast

high^ normal

falsetrue

sunny^

rain

No^

No

Yes^

Yes

Example^ Tree^ for^ “Play?”^ Outlook^ Yes Humidity^ Windy

- Models•^ C^ — the regression model prediction value corresponding to the region Rm

m

- Two types•^ Regression trees^ ^ Classification trees • Fundamentals Issues in Tree-based Models•^ How to decide the splitting point? (Tree growing)^ ^ How to control the size of the tree? (Tree pruning)

(^52) , (^15) (^42) , (^14)

(^32) , (^13) (^22) , (^12) 5 2 ,^1112 ,^11

RxxIc

RxxIc

RxxIc

RxxIcR

RxxIcxf mmm xxIc

Tree-Based Models ) ∈= ∑= =

Regression^

Trees

  • Partition the space into

M^ regions: R^ , R^ , …, R^1

.M M ∈= RxIcmm (^) = m

  • For each of N observations, input is x^ = (x^ xi^ i1 ,^ xf^ )()(^1 • The best partition: to minimize the sum of squared error:^ N^2 − xfy ))((^ ∑ ii^ = i^1

…^ x^ ), output is continuous yi2 ,^ ,^ ip^

i,

)|( mii

∧ Rxyaveragec^ ∈= m^

Classification and Regression Trees^ (Fundamentals)

Notation

• x – input vector •^ X^ – input space •^ J^ – number of classes •^ C^ – set of classes, e.g.

C ={1,2,L, J

Accuracy Estimation • Question – How good is a classifier in prediction, i.e.,how accurate is a classifier? • True Misclassification Rate – Given the learning sample

L, (x^ , j), n=1,n^ n^

L,N, x^ ∊^ X , j∊^ C , construct d(x). Let (x, j), xn^ n^

∊^ X , j ∊^ C^ be a new sample from the same populationas^ L. The true misclassification rate of d(x),*^ R^ (d), is defined as^ R

*^ (d) = P(d(x)^ ≠ j).

How to Estimate R

*^ (d)

• Three Types of Estimates of R

* (d)

–^ Training Error R(d) –^ Test Error R

ts(d)

–^ Cross-Validation Error R

cv(d)

How to Estimate R

*^ (d) (Cond.)

•^ Cross-Validation Error R

cv(d)

– Randomly split

L^ into^ V^ subsets of as nearly equalsize as possible,^ L,^ L,^ L. For v = 1,^1 V

L,^ V,

vconstruct d(x) using

L^ -^ L. The testing error of dv^

v^ (x)

is ts^ v^ R^ (d) =^ ∑

vI(d(xn)^ ≠(x,j) L^ nn∊ v

j) / Nn^ v

where Nis the number of cases inv^

L.v^

– The^ V-fold cross-validation error is defined as^ R

cv^ Vts^ (d) =^ ∑R^ (dv=^

v^ ) /^ V

Classification Trees • Are tree structured classifiers. • Construction : beginning with the inputspace, repeatedly split into two descendantsubsets. Splits are formed by conditions onthe input vector x = {x

, x^ ,^ L}, e.g., a split 12

could be of the form

{x^ ∊^ X : x^ =^4

Tree Construction • Three Issues – How to grow, i.e., how to split? – Which class should a terminal node beassigned to? – When to stop?

Split Selection