CHAID Algorithms - Mathematics and Statistics - Study Notes, Study notes of Mathematical Statistics

Main discussion in this file is about CHAID, Exhaustive CHAID Algorithms, Merging, Splitting, Stopping, The p-Value Calculations, Continuous dependent variable, Expected Cell Frequencies

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CHAID and Exhaustive CHAID Algorithms
This document describes the tree growing process of CHAID and Exhaustive CHAID
algorithms. The CHAID algorithm is originally proposed by Kass (1980) and the Exhaustive
CHAID is by Biggs et al (1991). Algorithm CHAID and Exhaustive CHAID allow multiple
splits of a node.
Both CHAID and exhaustive CHAID algorithms consist of three steps: merging, splitting and
stopping. A tree is grown by repeatedly using these three steps on each node starting form the
root node.
Notations
Y The dependent variable, or target variable. It can be ordinal categorical,
nominal categorical or continuous.
If Y is categorical with J classes, its class takes values in C = {1, …, J}.
m
X, m = 1, …, M The set of all predictor variables. A predictor can be ordinal categorical,
nominal categorical or continuous.
{}
N
n
nn y1
,=
=x! The whole learning sample.
n
w The case weight associated with case n.
n
f The frequency weight associated with case n. Non-integral positive value is
rounded to its nearest integer.
The CHAID Algorithm
The following algorithm only accepts nominal or ordinal categorical predictors. When
predictors are continuous, they are transformed into ordinal predictors before using the
following algorithm.
Merging
For each predictor variable X, merge non-significant categories. Each final category of X will
result in one child node if X is used to split the node. The merging step also calculates the
adjusted p-value that is to be used in the splitting step.
1. If X has 1 category only, stop and set the adjusted p-value to be 1.
2. If X has 2 categories, go to step 8.
3. Else, find the allowable pair of categories of X (an allowable pair of categories for
ordinal predictor is two adjacent categories, and for nominal predictor is any two
categories) that is least significantly different (i.e., most similar). The most similar pair is
the pair whose test statistic gives the largest p-value with respect to the dependent
variable Y. How to calculate p-value under various situations will be described in later
sections.
pf3
pf4
pf5
pf8

Partial preview of the text

Download CHAID Algorithms - Mathematics and Statistics - Study Notes and more Study notes Mathematical Statistics in PDF only on Docsity!

CHAID and Exhaustive CHAID Algorithms

This document describes the tree growing process of CHAID and Exhaustive CHAID algorithms. The CHAID algorithm is originally proposed by Kass (1980) and the Exhaustive CHAID is by Biggs et al (1991). Algorithm CHAID and Exhaustive CHAID allow multiple splits of a node. Both CHAID and exhaustive CHAID algorithms consist of three steps: merging, splitting and stopping. A tree is grown by repeatedly using these three steps on each node starting form the root node.

Notations

Y The dependent variable, or target variable. It can be ordinal categorical, nominal categorical or continuous. If Y is categorical with J classes, its class takes values in C = {1, …, J }.

X m , m = 1, …, M

The set of all predictor variables. A predictor can be ordinal categorical, nominal categorical or continuous.

N

3 = x n , y nn = 1

The whole learning sample.

w n

The case weight associated with case n.

f n

The frequency weight associated with case n. Non-integral positive value is rounded to its nearest integer.

The CHAID Algorithm

The following algorithm only accepts nominal or ordinal categorical predictors. When predictors are continuous, they are transformed into ordinal predictors before using the following algorithm.

Merging

For each predictor variable X , merge non-significant categories. Each final category of X will result in one child node if X is used to split the node. The merging step also calculates the adjusted p -value that is to be used in the splitting step.

  1. If X has 1 category only, stop and set the adjusted p -value to be 1.
  2. If X has 2 categories, go to step 8.
  3. Else, find the allowable pair of categories of X (an allowable pair of categories for ordinal predictor is two adjacent categories, and for nominal predictor is any two categories) that is least significantly different (i.e., most similar). The most similar pair is the pair whose test statistic gives the largest p -value with respect to the dependent variable Y. How to calculate p -value under various situations will be described in later sections.
  1. For the pair having the largest p -value, check if its p -value is larger than a user-specified

alpha-level^ α^ merge ( alpha_merge ). If it does, this pair is merged into a single compound

category. Then a new set of categories of X is formed. If it does not, then go to step 7.

  1. ( Optional ) If the newly formed compound category consists of three or more original categories, then find the best binary split within the compound category which p -value is the smallest. Perform this binary split if its p -value is not larger than an alpha-level α (^) split-merge ( alpha_spli-merge ).
  2. Go to step 2.
  3. ( Optional ) Any category having too few observations (as compared with a user-specified minimum segment size) is merged with the most similar other category as measured by the largest of the p -values.
  4. The adjusted p -value is computed for the merged categories by applying Bonferroni adjustments that are to be discussed later.

Splitting

The “best” split for each predictor is found in the merging step. The splitting step selects which predictor to be used to best split the node. Selection is accomplished by comparing the adjusted p -value associated with each predictor. The adjusted p -value is obtained in the merging step.

  1. Select the predictor that has the smallest adjusted p -value (i.e., most significant).
  2. If this adjusted p -value is less than or equal to a user-specified alpha-level αsplit ( alpha_split ), split the node using this predictor. Else, do not split and the node is considered as a terminal node.

Stopping

The stopping step checks if the tree growing process should be stopped according to the following stopping rules.

  1. If a node becomes pure; that is, all cases in a node have identical values of the dependent variable, the node will not be split.
  2. If all cases in a node have identical values for each predictor, the node will not be split.
  3. If the current tree depth reaches the user specified maximum tree depth limit value, the tree growing process will stop.
  4. If the size of a node is less than the user-specified minimum node size value, the node will not be split.
  5. If the split of a node results in a child node whose node size is less than the user- specified minimum child node size value, child nodes that have too few cases (as compared with this minimum) will merge with the most similar child node as measured by the largest of the p -values. However, if the resulting number of child nodes is 1, the node will not be split.

Continuous dependent variable

If the dependent variable Y is continuous, perform an ANOVA F test that tests if the means of Y for different categories of X are the same. This ANOVA F test calculates the F -statistic and hence derives the p -value as

∑∑

∑∑

= ∈

= ∈

= I

i

f nD

n n n n i

I

i n D

n n n i

w f I x i y y N I

w f I x i y y I

F

1

2

1

2

p = Pr ( F ( I − 1 , NfI )> F ) ,

where

n D

n n n

n D

n n n n i

w fI x i

w f yI x i

y

n D

n n

nD

n n n

w f

w f y

y , (^) ∑ ∈

nD

N f fn ,

and F ( I − 1 , Nf − I )) is a random variable following a F -distribution with degrees of

freedom I and N f − I.

Nominal dependent variable

If the dependent variable Y is nominal categorical, the null hypothesis of independence of X and Y is tested. To do the test, a contingency (or count) table is formed using classes of Y as columns and categories of the predictor X as rows. The expected cell frequencies under the null hypothesis are estimated. The observed cell frequencies and the expected cell frequencies are used to calculate Pearson chi-squared statistic or likelihood ratio statistic. The p -value is computed based on either one of these two statistics.

The Pearson’s Chi-square statistic and likelihood ratio statistic are respectively,

∑∑ = =

J

j

I

i (^) ij

ij ij

m

n m

X

1 1

2 2

∑∑

J

j

I

i

G nij nij mij

1

2

2 ln( /ˆ )

where (^) ∑ ∈

n D

nij fnI ( xn i yn j ) is the observed cell frequency and m  ij is the

estimated expected cell frequency for cell ( x n = i , yn = j ) from independence model as

following. The corresponding p -value is given by p =Pr (χ (^) d (^2) > X^2 )for Pearson’s Chi-

square test or p =Pr( χ (^) d (^2) > G^2 )for likelihood ratio test, where χ (^) d^2 follows a chi-squared distribution with degrees of freedom d = ( J - 1)( I - 1).

Estimation of Expected Cell Frequencies without case Weights

..

..

n

nn

m

i j

ij =

where

n (^) i nij j

J (^) t

. = =

∑ 1

, n (^) j nij i

I (^) t

. = =

∑ 1

, n n (^) ij i

I

j

J (^) t t .. = = =

∑ ∑ 1 1

Estimation of Expected Cell Frequencies with Case Weights

If case weights are specified, the expected cell frequency under the null hypothesis of independence is of the form

mij wij α i β j

= −^1

where α i and β (^) j are parameters to be estimated, and

ij

ij ij

n

w

w = , (^) ∑ ∈

nD

wij wnfnI ( x i yn j ).

Parameters estimates αˆ (^) i , βˆ^ j , and hence m ˆ (^) ij , are resulted from the following iterative procedure.

  1. k = 0 , α (^) i (^0 )= β( j^0 )= 1 , mij (^0 )= wij −^1.

∑ ∑

j

k ij

k i i j

k ij j

k i i

m

n

w

n

()

(). 1 ()

( 1 ). α β

α.

− +

i

k ij i

k j j

w

n

1 ( 1 )

( 1 ). α

β.

  1. mij (^ k +^1 )= wij −^1 α (^) i ( k +^1 ) β( jk +^1 ).

∑ (^ )^ ∑

j

k ij

k i i

j

k s s i

k ij j

k j i

m

n

w

n

j ()

(). 1 () ()( )

( 1 ). α β γ

α.

∑ (^ )

− + −

i

k s s i

k ij i

k j j (^) j

w

n

1 ( 1 ) () (^ )

( 1 ). α γ

β.

  1. ( )
    • 1 ( 1 ) ( 1 ) ( k ) ( s s ) i

k j

k ij ij i

j

m w

− + + − = α β γ ,

j

j ij

j

j ij ij i

s s m

s s n m

G

2 *

otherwise

()

() ( 1 ) k i

i i

k k i i

G G

γ

γ γ.

  1. ( ) ( 1 ) 1 ( 1 ) ( 1 ) ( k 1 ) ( s s ) i

k j

k ij i

k ij

j

m w

  • − + + + − = α β γ.
  1. If (+^1 )− () < ε ,

max ijk ijk

ij

m m , stop and output

( 1 ) ( 1 ) ( 1 )

    • k + i

k j

k α (^) i β γ and ( k + 1 )

mij as the

final estimates i j ,ˆˆ i , m ˆˆ ij

αˆˆ ,β γ. Otherwise, k = k + 1 , go to 2.

The Bonferroni Adjustments

The adjusted p -value is calculated as the p -value times a Bonferroni multiplier. The Bonferroni multiplier adjusts for multiple tests.

CHAID

Suppose that a predictor variable originally has I categories, and it is reduced to r categories after the merging step. The Bonferroni multiplier B is the number of possible ways that I

categories can be merged into r categories. For r = I, B = 1. For 2 ≤ r < I , use the following

equation.

= (^) ∑

=

Ordinalwithamissing category

Nominalpredictor

Ordinalpredictor

1

0

r

I

r

r

I

v r v

r v

r

I

B

r

v

I v (^).

Exhaustive CHAID

Exhaustive CHAID merges two categories iteratively until only two categories left. The Bonferroni multiplier B is the sum of number of possible ways of merging two categories at each iteration.

Ordinalwithamissing category

Nominalpredictor

Ordinalpredictor

2

I I

I I

I I

B.

Missing Values

If the dependent variable of a case is missing, it will not be used in the analysis. If all predictor variables of a case are missing, this case is ignored. If the case weight is missing, zero, or negative, the case is ignored. If the frequency weight is missing, zero, or negative, the case is ignored.

Otherwise, missing values will be treated as a predictor category. For ordinal predictors, the algorithm first generates the “best” set of categories using all non-missing information from the data. Next the algorithm identifies the category that is most similar to the missing category. Finally, the algorithm decides whether to merge the missing category with its most similar category or to keep the missing category as a separate category. Two p-values are calculated, one for the set of categories formed by merging the missing category with its most similar category, and the other for the set of categories formed by adding the missing category as a separate category. Take the action that gives the smallest p -value.

For nominal predictors, the missing category is treated the same as other categories in the analysis.

References

Bigss, D., Ville, B., and Suen, E. (1991). A Method of Choosing Multiway Partitions for Classification and Decision Trees. Journal of Applied Statistics , 18, 1, 49-62.

Goodman, L. A. (1979). Simple Models for the Analysis of Association in Cross- Classifications Having Ordered Categories. Journal of the American Statistical Association , 74, 537-552.

Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Applied Statistics , 20, 2, 119-127.