Assignment and Risk Estimation - Mathematics and Statistics - Study Notes, Exercises of Mathematical Statistics

Main discussion in this file is about Assignment, Risk Estimation, Assignment of a Node , Assignment of a case, Loss Function, Risk Estimation of a tree T, Resubstitution Estimate of the Risk

Typology: Exercises

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Assignment and Risk Estimation
This document discusses how a class or a value is assigned to a node and to a case and three
methods of risk estimation: the resubstitution method, test sample method and cross
validation method. The information is applicable to the tree growing algorithms CART,
CHAID, exhaustive CHAID and QUEST. Materials in this document are based on
Classification and Regression Trees by Breiman, et al (1984). It is assumed that a CART,
CHAID, exhaustive CHAID or QUEST tree has been grown successfully using a learning
sample.
Notations
Y The dependent variable, or target variable. It can be either categorical
(nominal or ordinal) or continuous.
If Y is categorical with J classes, its class takes values in C = {1, …, J}.
{}
N
n
nn y1
,=
=x! The learning sample where n
x and n
y are the predictor vector and
dependent variable for case n.
)(t! The learning samples that fall in node t.
n
f The frequency weight associated with case n. Non-integral positive value is
rounded to its nearest integer.
n
w The case weight associated with case n.
π
()j, j = 1, …, J Prior probability of Y = j
)|( jiC The cost of miss-classifying a class j case as a class i case, 0)|( =jjC .
Assignment
Once the tree is grown, an assignment (also called action or decision) is given to each node
based on the learning sample. To predict the dependent variable value for an incoming case,
we first find in which terminal node it falls, then use the assignment of that terminal node for
prediction.
Assignment of a Node
For any node t, let t
d be the assignment given to node t,
=continuous is )(
lcategorica is )(
*
Yty
Ytj
dt,
pf3
pf4
pf5

Partial preview of the text

Download Assignment and Risk Estimation - Mathematics and Statistics - Study Notes and more Exercises Mathematical Statistics in PDF only on Docsity!

Assignment and Risk Estimation

This document discusses how a class or a value is assigned to a node and to a case and three methods of risk estimation: the resubstitution method, test sample method and cross validation method. The information is applicable to the tree growing algorithms CART, CHAID, exhaustive CHAID and QUEST. Materials in this document are based on Classification and Regression Trees by Breiman, et al (1984). It is assumed that a CART, CHAID, exhaustive CHAID or QUEST tree has been grown successfully using a learning sample.

Notations

Y The dependent variable, or target variable. It can be either categorical (nominal or ordinal) or continuous.

If Y is categorical with J classes, its class takes values in C = {1, …, J }.

N 3 = x n , y (^) nn = 1 The learning sample where x (^) n and yn are the predictor vector and

dependent variable for case n.

3 ( t ) The learning samples that fall in node^ t.

f (^) n The frequency weight associated with case n. Non-integral positive value is rounded to its nearest integer.

w n The case weight associated with case n.

π ( ) j , j = 1, …, J Prior probability of Y = j

C ( i | j ) The cost of miss-classifying a class j case as a class i case, C ( j | j )= 0.

Assignment

Once the tree is grown, an assignment (also called action or decision) is given to each node based on the learning sample. To predict the dependent variable value for an incoming case, we first find in which terminal node it falls, then use the assignment of that terminal node for prediction.

Assignment of a Node

For any node t , let dt be the assignment given to node t ,

() iscontinuous

() iscategorical

yt Y

j t Y d (^) t ,

j ( t )= argmin i ∑ jC ( i | j ) p ( j | t )

,

n t

n n n w

w f y N t

y t !

,

where

j

p jt

p jt p j t ( ,)

w j

wj

N

N t p jt j ,

n!

N w wnfn , ∑

n!

N (^) w , j wnfnI ( yn j ),

()

n t

N (^) w t wnfn !

()

, (^ ) ( )

n t

N (^) w j t wnfnI yn j !

.

If there is more than one class j that achieves the minimum, choose j

* ( t ) to be the smallest

such j for which ∑

()

, (^ ) ( )

n t

N (^) f jt fnI yn j !

is greater than 0, or the absolute smallest if

Nf , j ( t ) is zero for all of them.

For CHAID and exhaustive CHAID, use π ( j )= Nw , j Nw in the equation.

Assignment of a case

For a case with predictor vector x , the assignment or prediction dT ( x ) for this case by the tree T is

( )

(( )) iscontinuous

( ) iscategorical ( )

yt Y

j t Y d (^) T x

x x ,

where t ( x )is the terminal node the case falls in.

Risk estimation

Note that case weight is not involved in risk estimation, though it is involved in tree growing process and class assignment.

( )

n D

n n T n n j fj

nD

n n T n j n fj

j

fL y d I y j L N

f Ly d L I y j N

s

2 2

,

2

,

2

x

x

,

∑ (^ )^ ∑

∈ ∈

nD

n n T n nD f

n n T n f

f L y d L N

f Ly d L N

s

2 2 2 2 ( , ( ))

x x.

Putting everything together we get

∈ ∈

( ()) continuous

( ()| ) () categorical,M

( ()| ) () categorical,M

~ (^) ()

2

~

,

,

~ ,

f y yt Y N

C j t j N t Y N

j

C j t j N t Y N

RT D

t TnDt

n n f

j tT

fj f j

t T j

fj f

,



 − −

− 

 −

=

∑ ∑

∑ ∑

∑∑

∈ ∈

( ()) ( | ) con

1

cat,M

() ( ()| )

() ( ()| )

( )

() ( ()| ) ( | ) cat,M ( )

1

Var( ( | ))

~ (^) ()

4 2 2

~ (^) ,

2

~

,

  • 2 ,

2

,

~

  • 2 2 2 ,

f y yt N RT D Y N

Y N

N tC j t j

N tC j t j N

j

N tC j t j N RT D Y N

RT D

t TnDt

n n f f

j (^) t T fj

tT

fj

fj fj

j tT

fj f f

π

,

where

()

, (^ ) ( )

nDt

N (^) f j t fnI yn j.

The estimated standard error of R(T|D) is given by se( R ( T | D ))= var( R ( T | D )).

Risk estimation of a tree is often written as ∑

tT

R T D Rt D ~

( | ) ( | ) with R ( t | D )

being the contribution from node t to the tree risk such that

( ()) continuous

( ()| ) categorical,M

() ( ()| ) categorical,M

()

2

,

,

,

f y yt Y N

C j t j Y N

j N t

N tC j t j Y N

Rt D

nDt

n n f

j (^) fj

fj

j

fj f

.

Resubstitution Estimate of the Risk of a tree T

The resubstitution risk estimation method uses the same set of data (learning sample 3 ) that is used to grow the tree T to calculate its risk, i.e.

Var( ( )) Var( ( | ))

~

RT R T

RT RT Rt

Rt Rt

tT

=

.

Test Sample Estimate of the Risk

The idea of test sample risk estimation is that the whole data set is divided into 2 mutually exclusive subsets 3 and (^3) ′. 3 is used as a learning sample to grow a tree T and (^3) ′is used as a test sample to check the accuracy of the tree. The test sample estimate is

Var( ( )) Var( ( | ))

R T R T

R T RT

ts

ts

.

Cross Validation Estimate of the Risk of a tree T

Cross validation estimation is provided only when a tree is grown using the automatic tree growing process. Let T be a tree which has been grown using all data from the whole data set

3

0

. Let V ≥ 2 be a positive integer.

  1. Divide 30 into V mutually exclusive subsets (^3) ′ (^) v , v = 1, …, V. Let (^3) v be 30 - (^3) ′ v , v = 1,

…, V.

  1. For each v , consider (^3) v as a learning sample and grow a tree Tv on (^3) v by using the

same set of user specified stopping rules which was applied to grow T.

  1. After Tv is grown and assignment ()

j (^) v t or y (^) v ( t )for node t of Tv is done, consider

3 ′ v as a test sample and calculate its test sample risk estimate ( (^) v )

ts R T.