



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Main discussion in this file is about Assignment, Risk Estimation, Assignment of a Node , Assignment of a case, Loss Function, Risk Estimation of a tree T, Resubstitution Estimate of the Risk
Typology: Exercises
1 / 6
This page cannot be seen from the preview
Don't miss anything!




This document discusses how a class or a value is assigned to a node and to a case and three methods of risk estimation: the resubstitution method, test sample method and cross validation method. The information is applicable to the tree growing algorithms CART, CHAID, exhaustive CHAID and QUEST. Materials in this document are based on Classification and Regression Trees by Breiman, et al (1984). It is assumed that a CART, CHAID, exhaustive CHAID or QUEST tree has been grown successfully using a learning sample.
Y The dependent variable, or target variable. It can be either categorical (nominal or ordinal) or continuous.
If Y is categorical with J classes, its class takes values in C = {1, …, J }.
N 3 = x n , y (^) nn = 1 The learning sample where x (^) n and yn are the predictor vector and
dependent variable for case n.
3 ( t ) The learning samples that fall in node^ t.
f (^) n The frequency weight associated with case n. Non-integral positive value is rounded to its nearest integer.
w n The case weight associated with case n.
π ( ) j , j = 1, …, J Prior probability of Y = j
C ( i | j ) The cost of miss-classifying a class j case as a class i case, C ( j | j )= 0.
Once the tree is grown, an assignment (also called action or decision) is given to each node based on the learning sample. To predict the dependent variable value for an incoming case, we first find in which terminal node it falls, then use the assignment of that terminal node for prediction.
Assignment of a Node
For any node t , let dt be the assignment given to node t ,
() iscontinuous
() iscategorical
yt Y
j t Y d (^) t ,
,
∈
n t
n n n w
w f y N t
y t !
,
where
j
p jt
p jt p j t ( ,)
w j
wj
N
N t p jt j ,
∈
n!
∈
n!
N (^) w , j wnfnI ( yn j ),
∈
()
n t
N (^) w t wnfn !
∈
()
n t
N (^) w j t wnfnI yn j !
.
If there is more than one class j that achieves the minimum, choose j
* ( t ) to be the smallest
∈
()
n t
N (^) f jt fnI yn j !
is greater than 0, or the absolute smallest if
Nf , j ( t ) is zero for all of them.
Assignment of a case
For a case with predictor vector x , the assignment or prediction dT ( x ) for this case by the tree T is
( )
(( )) iscontinuous
( ) iscategorical ( )
yt Y
j t Y d (^) T x
x x ,
where t ( x )is the terminal node the case falls in.
Note that case weight is not involved in risk estimation, though it is involved in tree growing process and class assignment.
( )
∈
∈
n D
n n T n n j fj
nD
n n T n j n fj
j
fL y d I y j L N
f Ly d L I y j N
s
2 2
,
2
,
2
x
x
,
∈ ∈
nD
n n T n nD f
n n T n f
f L y d L N
f Ly d L N
s
2 2 2 2 ( , ( ))
x x.
Putting everything together we get
∈ ∈
∈
∈
( ()) continuous
( ()| ) () categorical,M
( ()| ) () categorical,M
~ (^) ()
2
~
,
,
~ ,
f y yt Y N
C j t j N t Y N
j
C j t j N t Y N
t TnDt
n n f
j tT
fj f j
t T j
fj f
,
− −
−
−
=
∑ ∑
∑ ∑
∑
∑∑
∈ ∈
∈
∈
∈
( ()) ( | ) con
1
cat,M
() ( ()| )
() ( ()| )
( )
() ( ()| ) ( | ) cat,M ( )
1
Var( ( | ))
~ (^) ()
4 2 2
~ (^) ,
2
~
,
2
,
~
f y yt N RT D Y N
Y N
N tC j t j
N tC j t j N
j
N tC j t j N RT D Y N
RT D
t TnDt
n n f f
j (^) t T fj
tT
fj
fj fj
j tT
fj f f
π
,
where
∈
()
nDt
N (^) f j t fnI yn j.
The estimated standard error of R(T|D) is given by se( R ( T | D ))= var( R ( T | D )).
∈
tT
R T D Rt D ~
( | ) ( | ) with R ( t | D )
being the contribution from node t to the tree risk such that
∈
( ()) continuous
( ()| ) categorical,M
() ( ()| ) categorical,M
()
2
,
,
,
f y yt Y N
C j t j Y N
j N t
N tC j t j Y N
Rt D
nDt
n n f
j (^) fj
fj
j
fj f
.
Resubstitution Estimate of the Risk of a tree T
The resubstitution risk estimation method uses the same set of data (learning sample 3 ) that is used to grow the tree T to calculate its risk, i.e.
Var( ( )) Var( ( | ))
~
RT RT Rt
Rt Rt
tT
=
∈
.
Test Sample Estimate of the Risk
The idea of test sample risk estimation is that the whole data set is divided into 2 mutually exclusive subsets 3 and (^3) ′. 3 is used as a learning sample to grow a tree T and (^3) ′is used as a test sample to check the accuracy of the tree. The test sample estimate is
Var( ( )) Var( ( | ))
ts
ts
.
Cross Validation Estimate of the Risk of a tree T
Cross validation estimation is provided only when a tree is grown using the automatic tree growing process. Let T be a tree which has been grown using all data from the whole data set
3
0
. Let V ≥ 2 be a positive integer.
…, V.
same set of user specified stopping rules which was applied to grow T.
j (^) v t or y (^) v ( t )for node t of Tv is done, consider
3 ′ v as a test sample and calculate its test sample risk estimate ( (^) v )
ts R T.