Cost-Complexity Pruning Process for CART and QUEST Trees, Study notes of Mathematical Statistics

The cost-complexity pruning process for cart and quest trees, which helps to find the optimally pruned subtree based on a given real number α. The process involves calculating the cost-complexity risk of a tree and finding the smallest optimally pruned subtree. The document also provides an algorithm to generate a sequence of pruned subtrees and select the right-sized subtree.

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Cost-Complexity Pruning Process
Assuming a CART or QUEST tree has been grown successfully using a learning sample, this
document describes the automatic cost-complexity pruning process for both CART and
QUEST trees. Materials in this document are based on Classification and Regression Trees
by Breiman et al (1984). Calculations of the risk estimates used throughout this document are
given in “Assignment and Risk Estimation” (TREE-assignment-risk.pdf).
Cost-Complexity Risk of a Tree T
Given a tree T and a real number
α
, the cost-complexity risk of T with respect to
α
is
|
~
|)()( TTRTR
α
α
+= ,
where |
~
T
| is the number of terminal nodes and R(T) is the resubstitution risk estimate of T.
Smallest Optimally Pruned Subtree
Pruned subtree: For any tree
T
,
T
is a pruned subtree of
T
if
T
is a tree with the same
root node as T and all nodes of
T
are also nodes of
T
. Denote TT %
if
T
is a pruned
subtree of
T
.
Optimally pruned subtree: Given
α
, a pruned subtree T of T is called an optimally
pruned subtree of T with respect to
α
if )(min)( TRTR TT
=
αα
%
. The optimally pruned
subtree may not be unique.
Smallest optimally pruned subtree: If TT % for any optimally pruned subtree T %T0
such that RT
α
()
= RT
α
()
′′ , then T is the smallest optimally pruned subtree of 0
T with
respect to
α
, and is denoted by T0(
α
).
Cost-Complexity Pruning Process
Suppose that a tree T0 was grown. The cost-complexity pruning process consists of two steps:
1. Based on the learning sample, find a sequence of pruned subtrees {}Tkk
K=0 of T0 such
that T0 T1 T2 TK, where TK has only the root node of T0.
2. Find an “honest” risk estimate
R
(Tk) of each subtree. Select a right sized tree from the
sequence of pruned subtrees.
pf3
pf4

Partial preview of the text

Download Cost-Complexity Pruning Process for CART and QUEST Trees and more Study notes Mathematical Statistics in PDF only on Docsity!

Cost-Complexity Pruning Process

Assuming a CART or QUEST tree has been grown successfully using a learning sample, this document describes the automatic cost-complexity pruning process for both CART and QUEST trees. Materials in this document are based on Classification and Regression Trees by Breiman et al (1984). Calculations of the risk estimates used throughout this document are given in “Assignment and Risk Estimation” (TREE-assignment-risk.pdf).

Cost-Complexity Risk of a Tree T

Given a tree T and a real number α , the cost-complexity risk of T with respect to α is

R α ( T )= R ( T )+ α| T ,

where |

T | is the number of terminal nodes and R ( T ) is the resubstitution risk estimate of T.

Smallest Optimally Pruned Subtree

Pruned subtree : For any tree T , T ′ is a pruned subtree of T if T ′ is a tree with the same

root node as T and all nodes of T ′^ are also nodes of T. Denote T ′ 7 T if T ′^ is a pruned

subtree of T.

Optimally pruned subtree : Given α , a pruned subtree T’ of T is called an optimally

pruned subtree of T with respect to α if R ( T ) min R ( T )

T T

α (^) ′′% α

. The optimally pruned

subtree may not be unique.

Smallest optimally pruned subtree : If T ′ 7 T ′′ for any optimally pruned subtree T” 7 T 0

such that R α ( T ′) = R α ( T ′′ ), then T’ is the smallest optimally pruned subtree of T 0 with

respect to α , and is denoted by T 0 ( α ).

Cost-Complexity Pruning Process

Suppose that a tree T 0 was grown. The cost-complexity pruning process consists of two steps:

  1. Based on the learning sample , find a sequence of pruned subtrees { Tk } kK = 0 of T 0 such that T 0 2 T 1 2 T 2 2 … 2 TK , where TK has only the root node of T 0.
  2. Find an “honest” risk estimate R  ( Tk ) of each subtree. Select a right sized tree from the sequence of pruned subtrees.

Generate a sequence of smallest optimally pruned subtrees

To generate a sequence of pruned subtrees in step 1, the cost-complexity pruning technique developed by Breiman et. al. (1984) is used. In generating the sequence of subtrees, only the

learning sample is used. Given any real value αmin ( α min = 0 in any SPSS implementation)

and an initial tree T 0 , there exists a sequence of real values

− ∞< α 1 = αmin< α 2 <"< α K < +∞ and a sequence of pruned subtrees

T 0 2 T 12 " 2 T K , such that the smallest optimally pruned subtree of T 0 for a given α is

1

0

0

0 0

K

k k K K

k k k K

T T

T T

T

T ,

where

1 min g^ k ( t )

k + (^) tTk α = , Tk + 1 = { tTk : gk ( s )> α (^) k + 1 forallancestorsoft} ,

k

k k kt

kt k

t T

t T T

T

Rt RT

g t

, ,

Tk , t

is the branch of Tk stemming from node t, and R(t) is the resubstitution risk estimate of

node t based on the learning sample.

Explicit algorithm

The algorithm can be used to generate a sequence of subtrees of T 0 for a given initial value α = αmin , and an initial tree T 0 = {1, …, # T 0 } where # T 0 is the number of nodes in T 0. For node t , let

leftchildof otherwise

0 isterminal

t

t

lt t ,

rightchildof otherwise

0 isterminal

t

t

rt t ,

parentof otherwise

0 isrootnode

t

t

pat

Selecting the Right Sized Subtree

To select the right sized pruned subtree from the sequence of pruned subtrees { Tk } kK = 0 of T 0 ,

an “honest” method is used to estimate the risk R  ( Tk ) and its standard error se ( R^ ˆ^ ( Tk ))of each subtree T (^) k. Two methods can be used: the resubstitution estimation method and the test sample estimation method. Resubstitution estimation is used if there is no test sample. Test sample estimation is used if there is a testing sample. Select the subtree Tk* as the right sized subtree of T 0 based on one of the following rules.

Simple rule

The right sized tree is selected as the k * ∈ {0, 1, 2, …, K } such that

ˆ ( * ) minˆ( k )

k (^) k

R T = RT.

The b-SE rule

For any nonnegative real value b (default b = 1), the right sized tree is selected as the largest k ** ∈ {0, 1, 2, …, K } such that

R ˆ^ ( Tk ** )≤ R ˆ( Tk )+ bse ( R ˆ( Tk )).**

References

Breiman, L., Friedman, J.H., Olshen, R., and Stone, C.J., 1984. Classification and Regression Trees Wadsworth & Brooks/Cole Advanced Books & Software, Pacific California.