Download Machine learning ensemble methods Machine learning ensemble methods Machine learning ensem and more Schemes and Mind Maps Theory of Machines in PDF only on Docsity! ENCS5341 Machine Learning and Data Science Ensemble Methods Yazan Abu Farha - Birzeit University Some slides are taken from Carlos Guestrin Introduction
Simple (weak) classifiers are good!
4
3
2|
1
°
-1
-2
-3
Income>$100K?
Logistic regression Shallow Decision
w. simple features decision trees stumps
\
Y
Low variance. Learning is fast!
But high bias...
Ensemble Learning Which models should we combine? • If a model underfits (high bias, low variance), combine with other low variance models • Need to be different (experts on different parts of the data). • Bias reduction can be done with Boosting. • If a model overfits (low bias, high variance) combine with other low bias models • Need to be different (each model mistakes must be different). • Variance reduction can be done with Bagging. For example, shallow trees have high bias and low variance, whereas deep trees have low bias and high variance. • We can combine multiple shallow trees via boosting (e.g AdaBoost). • We can combine multiple deep trees via bagging (e.g. Random Forest). 4 Boosting Boosting Formulation • Train a single simple classifier (for example a decision stump) • Combine different classifiers to get a strong a classifier using an additive model 6 Aside: Learning a decision stump on weighted data
Increase weight a of harder/misclassified points
Credi Income 7 Weight
‘ C0 4
A $130K Safe 0.5
B $80K Risky = 1.5
c $110K Risky 1.2
A $110K Safe 0.8
A §90K Safe 0.6 > $100K < $100K
B $120K Safe 0.7 2 12 3 65
c $30K Risky 3
c S60K Risky 2
B S95K Safe 0.8
A $60K Safe 0.7
A $98K Safe 0.9
Boosting = gready learning ensembles from data
Training data
Learn classifier
Higher weight
for points where
X) is wrong
Predict
Weighted data
Learn classifier & coefficient
Predict
10
Boosting: Toy Example 11 Each data point has a class label: wt =1 and a weight: +1 ( ) -1 ( ) yt = • It is a sequential procedure: xt=1 xt=2 xt Boosting: Toy Example 14 We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: We update the weights: +1 ( ) -1 ( ) yt = wt wt exp{-yt Ft} Boosting: Toy Example 15 We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: We update the weights: +1 ( ) -1 ( ) yt = wt wt exp{-yt Ft} Boosting: Toy Example 16 We set a new problem for which the previous weak classifier performs at chance again Each data point has a class label: We update the weights: +1 ( ) -1 ( ) yt = wt wt exp{-yt Ft} AdaBoost Algorithm 19 • Given (x1, y1),…, (xm, ym) where xiєX, yiє{-1, +1} • Initialise weights D1(i) = 1/m • Iterate t=1,…,T: • Train weak learner using distribution Dt • Get weak classifier: ht : X ® R • Update: • where Zt is a normalization factor (chosen so that Dt+1 will be a distribution), and at: • Output – the final classifier t ititt t Z xhyiDiD ))(exp()()(1 a- =+ 01ln 2 1 >÷÷ ø ö çç è æ - = t t t e ea ))(()( 1 å = = T t tt xhsignxH a Bagging Bagging • Obtain different models by training the same model on different training sample. • Reduce overfitting by averaging out individual predictions. • In practice: take M bootstraps samples of your data, train a model on each bootstrap. • Final prediction is obtained by averaging predictions from base models. • Soft voting (or majority) for classification. • Mean value for regression. • Can produce uncertainty estimate as well by combining class probabilities of individual models. 21 Simple Majority Voting
Test examples
24
Random Forests • Main idea: build a larger number of un-pruned decision trees and combine their predictions. • The trees has to be different from each others. This is achieved by using two sources of randomness: • Bagging: randomizing the training set. • Randomized node optimization (RNO): randomizing the set of features to select from in each node. • Final prediction is obtained by either majority voting, or averaging the predictions of all trees. 25 Sources of randomness in decision forest • Bagging • Randomized node optimization (RNO) 26 The full training set The randomly sampled subset of training data made available for the tree t The full set of all possible node test parameters For each node the set of randomly sampled features Randomness control parameter. For no randomness and maximum tree correlation. For max randomness and minimum tree correlation. Node weak learner Node test params Node training Random Forest Application • Body part classification 29 right elbow right hand left shoulderneck [ J. Shotton et al. Real-Time Human Pose Recognition in Parts from a Single Depth Image. CVPR 2011 ] Body part classification • Input: Labeled training data (depth images of human body) • Tree: Separate data based on class label 30 Training Data
Geren PPAF
i pe i
PT WFR