Gain Summary - Mathematics and Statistics - Study Notes, Summaries of Mathematical Statistics

In this study material file, you will learn about: Gain Summary, Types, Average Oriented Gain Summary, Target Class, Average Profit Value Gain, Node-by-Node, Cumulative, Percentile Gain

Typology: Summaries

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Gain Summary
The Gain Summary summarizes a tree by displaying descriptive statistics for each terminal
node. This allows users to recognize the relative contribution of each terminal node and
identify the subsets of terminal nodes that are most useful. This document can be used for all
tree growing algorithms CART, CHAID, exhaustive CHAID and QUEST.
Note that case weight is not involved in gain summary calculations though it is involved in
tree growing process and class assignment.
Types of Gain Summaries
Depending on the type of dependent variable, different statistics are given in the gain
summary.
Average Oriented Gain Summary (Y continuous)
Statistics related to the node mean of Y are given. Through this summary, users may identify
the terminal nodes that give the largest (or smallest) average of the dependent variable.
Target Class Gain Summary (Y categorical)
Statistics related to an interested dependent variable class (target class) are given. Users may
identify the terminal nodes that have a large relative contribution to the target class.
Average Profit Value Gain Summary (Y categorical)
Statistics related to average profits are given. Users may be interested in identifying the
terminal nodes that have relatively large average profit values.
Node-by-Node, Cumulative, Percentile Gain Summary
To assist users in identifying the interesting terminal nodes and in understanding the result of
a tree, three different ways (node-by-node, cumulative and percentile) of looking at the gain
summaries mentioned above are provided.
Notations
Y The dependent variable, or target variable. It can be either categorical
(nominal or ordinal) or continuous.
If Y is categorical with J classes, its class takes values in C = {1, …, J}.
D Data set used to calculate gain statistics. It can be either learning sample data
set or test sample data set.
D(t) Cases in D fallen in node t.
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Gain Summary - Mathematics and Statistics - Study Notes and more Summaries Mathematical Statistics in PDF only on Docsity!

Gain Summary

The Gain Summary summarizes a tree by displaying descriptive statistics for each terminal

node. This allows users to recognize the relative contribution of each terminal node and

identify the subsets of terminal nodes that are most useful. This document can be used for all

tree growing algorithms CART, CHAID, exhaustive CHAID and QUEST.

Note that case weight is not involved in gain summary calculations though it is involved in

tree growing process and class assignment.

Types of Gain Summaries

Depending on the type of dependent variable, different statistics are given in the gain

summary.

Average Oriented Gain Summary (Y continuous)

Statistics related to the node mean of Y are given. Through this summary, users may identify

the terminal nodes that give the largest (or smallest) average of the dependent variable.

Target Class Gain Summary (Y categorical)

Statistics related to an interested dependent variable class (target class) are given. Users may

identify the terminal nodes that have a large relative contribution to the target class.

Average Profit Value Gain Summary (Y categorical)

Statistics related to average profits are given. Users may be interested in identifying the

terminal nodes that have relatively large average profit values.

Node-by-Node, Cumulative, Percentile Gain Summary

To assist users in identifying the interesting terminal nodes and in understanding the result of

a tree, three different ways (node-by-node, cumulative and percentile) of looking at the gain

summaries mentioned above are provided.

Notations

Y The dependent variable, or target variable. It can be either categorical

(nominal or ordinal) or continuous.

If Y is categorical with J classes, its class takes values in C = {1, …, J }.

D Data set used to calculate gain statistics. It can be either learning sample data

set or test sample data set.

D (t) Cases in D fallen in node t.

n y The dependent variable value for case n.

n f The frequency weight associated with case n. Non-integral positive value is

rounded to its nearest integer.

f N (^) The number of cases in D ,

nD

f n N f

N ( t )

f The number of cases in^ D (t),^ ∑

()

nDt

f n N t f

f j

N

, The number of class j cases in^ D ,^ ∑

n D

f j n n N f I ( y j ) ,

, N t f j The number of class j cases in D (t),

()

,

nDt

f j n n N t f I y j

y ( t )

The mean of dependent variable in D(t), ∑

nDt

n n f

f y N t

yt

j ′′ Target class of interest, it is any value in {1, …, J}.

Target class j ′′is user-specified. If not, the default target class is j ′′= 1.

r ( j ), e ( j ) They are respectively the revenue and expense associated with class j.

pv ( ) j (^) The profit value associated with class j, pv ( j )= r ( j )− e ( j ).

j t

(

~ ) Class assignment given by terminal node^

~ t.

π ( ) j Prior probability of class j Y = j, j = 1, …, J.

M1 For categorical Y, denote empirical prior situation. CHAID and exhaustive

CHAID always considered as having empirical prior.

M2 For categorical Y, denote non-empirical prior situation.

Gain Summary: Node by Node

The node-by-node gain summary includes statistics for each node that are defined below.

Terminal Node

The identity of a terminal node. It is denoted by

~ t.

Size: n

Total number of cases in the terminal node. It is denoted by N f (

~ t ).

Size: %

Percentage of cases in the node. It is denoted by p f (

~ t )100%, where p f (

~ t ) is given by

Mean (for average oriented gain summary only)

The respective mean )

y ( t of the continuous dependent variable Y at the node.

s ( t = y t.

ROI (Return on Investment, for average profit value gain summary only)

ROI for a node is calculated as average profit divided by average expense.

)

~ (

)

~ ( )

~ ( s 0 t

st ROI t =.

Where )

~ s (^) 0 ( t is the average expense for node

~ t and is calculated using equation for )

s ( t

with pv(j) replaced by e(j ).

Index (%)

For target class gain summary, it is the ratio (in %) of score for the node to the proportion of

class j ′′cases in the sample. It is denoted by is (

~ t )100%, where is (

~ t ) is

′′

M
M

,

j

st

N N

st

is t

fj f

.

For average profit value gain summary, it is the ratio (in %) of score for the node to the

average profit value for the sample.

 

=

M ( ) ( )

)

~ (

M ( )/

)

~ (

)

~ (

,

j

j

fj f

j pv j

st

N pv j N

st

ist

π

.

For average oriented gain summary, it is the ratio (in %) of the gain score for the node to the

gain score s ( t = 1) for root node t = 1.

st

st is t.

Notice that if the denominator is 0, the index is not available.

Gain Summary: Cumulative

In the cumulative gain summary, all nodes are first sorted with respect to the values of score

s ( t. To simplify the formulas, we assume that nodes in the collection {

~ t 1 ,

~ t 2 , …,

~

|

~ |

t T

}

are already sorted either in descending or ascending order according to user-request.

Terminal Node

The identity of a terminal node. It is denoted by

~ t s .

Cumulative Size: n, Cumulative Size: %, Cumulative gain: n, Cumulative gain: %

These statistics are simply defined as the cumulative sum of corresponding node-by-node

items up to the terminal node of interest. Let )

i a t be the node-by-node statistics, then its

cumulative count part up to node

~ t s is

=

s

i

s i a t at

1

(. These four cumulative statistics

are denoted respectively by )

f sN t , )

f sp t , )

f , j sN (^) ′′ t and | )

p ( t j f s

Cumulative Score

For Cumulative response, it is the ratio of target class j ′′ cases up to the node to the total

number of cases up to the node. For cumulative average profit, it is the average profit value

up to the node. For cumulative mean, it is the mean of all y n ’s up to the nodes

~ t s

. In all cases,

the same formula is used. However, readers should use the appropriate formulas for s (

~ t )

and pf (

~ t ) in the calculations. This cumulative score is denoted by ⊕ s ( s t

).

=

=

=

=

M

M1,or,Ycontinuous

1

1

1

1

s

i

f i

s

i

i f i

s

i

f i

s

i

i f i

s

p t

st p t

N t

st N t

s t.

Cumulative ROI (for average profit value gain summary only)

Cumulative ROI up to a node is

specified (default q = 10). For fixed q , the number of percentiles to be studied is 100/ q. The

p -th percentile to be studied is the pq %-tile, and its size is N N pq % f pq f

⋅ , p = 1, …,

100/ q. For any pq %-tile, let p s and p s ′ be the two smallest integers in {1, …, |

~ T |} such that

( )]

f. pq f sp 1 f sp N ∈ ⊕ N tN t, [ ))

f. pq f sp 1 f sp N N t N t ′ − ′

where ) 0

0 ⊕ N tf .

Terminal Nodes

The identity of all terminal nodes that belong to the p

th increment. Node t

belongs to the p

th

increment if [ , ]

p 1 p t s s

Percentile (%)

Percentile being studied. The p -th percentile is the pq %-tile.

Percentile: n

Total number of cases in the percentile, N (^) fpq = [ N (^) fpq % ], where [ x ] denotes the nearest

integer of x.

Gains: n (for target class percentile gain summary only)

Total number of class j ′′cases in the pq %-tile. It is denoted by ◊ N (^) f , j ′′( p ).

,

1

, (^1) p

p

p

p fj’’ s f s

fpq f s-

f j f,j" s- N t N t

N - N t

N p N t

⋅ ′′

where ⊕ N (^) f , j ′′(

~ t 0 ) is defined to be 0.

Gains: % (for target class percentile gain summary only)

Percentage of class j ′′cases in the sample that belong to the pq %-tile. It is denoted by

◊ ′′ p p f , j ( ) 100%, where ◊ ′′ p p f , j ( ) is

f j

f j f j N

N p p p

′′

′′ ′′

,

, ,

Percentile score

For target class percentile gain summary, it is an estimate of ratio of the number of class j ′′

cases in the pq %-tile to the total number of cases in the percentile. For average profit value

percentile gain summary, it is an estimate of the average profit value in the pq %-tile. For

average oriented percentile gain summary, it is an estimate of the ratio of gain score for all

nodes in the percentile. In all charts, the same formula is used.

− − −

− − −

M
M

.

1 1. 1

.

1 1. 1

f pq

f s s fpq f s s

fpq

f s s fpq f s s

p

p t st p p t st

N

N t st N N t st

s p

p p p p

p p p p

where

. 1 . (^1) p

p

p

p f s f s

fpq f s

f pq f s p t N t

N N t

p p t

Percentile ROI (for average profit value gain summary only)

The definition of percentile ROI is

( )

( ) ( ) s 0 p

sp ROI p

◊ ◊ =.

Where ◊ s 0 ( p ) is the percentile expense and calculated through equation ◊ s ( p ) with

pv ( t replaced by )

e ( t.

Percentile Index (in %)

For target class percentile gain summary, it is the ratio (in %) of percentile gain score for the

pq %-tile to the proportion of class j ′′cases in the sample. It is denoted by ◊ is ( p )100%,

where ◊ is ( p ) is

′′

M

s( )

M

,

j

p

N N

s p

is p

fj f

.