Download Discrimination and Classification: Statistical Aspects of Machine Learning I | STAT 616 and more Study notes Statistics in PDF only on Docsity! More on Discrimination and Classification: More on CART and Extensions: General Regression Model yj = f(xj1, ..., xjp) + ²j , j = 1, ..., n. In the GLM: yj = θ0 + θ1xj1 + ... + θpxjp + ²j . This equation holds for every observation in the popula- tion. This is a “tree” with one node and one leaf, CART uses regression modelling at each split, but in the end makes the same estimate for every observation in the terminal node. E.G. In the Heart Attack Example, all obserations are classified as high risk or low risk given that they are in the same terminal node. I.e., the model is ŷj = θ̂0, in each terminal node. This leads to TR, “Treed Regression”. TR is a compromise between GLM and CART. Al- lows for regression fits at terminal nodes. This enables less complex, and more interpretable trees to be fit. E.G. Say the most important split variable is xk. Then if xk < c we predict ŷ = α̂l + β̂lxl and for xk > c we predict ŷ = α̂r + β̂rxr. This can be written as a linear model of the form: ŷ = α̂lzl + β̂lzlxl + α̂rzr + β̂rzrxr, where zl = 1 if xk < c and zr = 1 if xk > c. Not a “true” linear model as the indicator variables are estimated from the data. One can have more than one split and then fit separate regressions. The number of indicator variables is equal to the number of terminal nodes. How to choose between models of this type? Consider an alternative hypothesis where the split of the parent node is given by L = {i : xik < c} and R = {i : xik > c} Can then test for further splits, results show that living area is the variable in the final regression at each terminal node. Bumping (Bootstrap Umbrella of Model Parameters) Recall, that Bagging produces an accurate classifier (reduces error rates), but leads to a hard to interpret predictor. Bumping is an attempt to reduce error rates from a given procedure (CART, subset regression, ...) while retaining an interpretable predictor. Setup: Training sample: z = (z1, ..., zn) from a distribution F . We have a model for the data that depends on param- eters, θ. From the training sample, we minimize a target criterion: θ̂ = argminθR(z, θ). The proposal is to estimate θ using bootstrap sam- ples z1∗, ..., zB∗ and estimate θ̂ from each bootstrap sam- ple: θ̂∗b = argminθR(z b∗, θ), and then choose θ̂ as the value that gives the smallest value of R(z, θ): θ̂B = θ̂∗b where θ̂∗b = argminbR(z, θ̂ ∗b). Include the original sample z with the bootstrap samples. E.G. R(z, θ) = Q[y, ηX(x, θ)] where Q is the loss function, typically (y − η)2 in regression and I(y 6= η) in classification. Comments: If using the original training sample, we find the global minimizer of R(z, θ) (e.g., linear least squares re- gression) then Bumping will give this global minimizer. Why? The bootstrap sample includes the original sample and the global minimizer cannot be improved upon. However, adaptive procedures like subset regression and classification trees often find local minimizers, so there is potential for bumping to give a better local min- imum. Data sets: 1) p = 5 predictors, each U [−1, 1]. The responses are y = 1 if x1 > 0 and x2 > 0, y = 0 otherwise. 2) “glass” data: Determine which type of glass (6 types) from 9 chemical measurements on 214 observa- tions. 3) Breast Cancer data: 699 cases (458 benign and 241 malignant), 9 variables (cell characteristics) Error Rate Comparison Data CART Bumped Bagged Simulate .049 .029 .025 Glass .035 .030 .025 Cancer .049 .047 .026 Results: 1) Bagging is the best. 2) Bumping reduces error in each case. 3) So if only error rate is important, bagging is bet- ter, if fitted model structure is important bumping is bet- ter.