Adaptive Basis Function Models in Machine Learning for Signals, Lecture notes of Computer Science

This lecture explores adaptive basis function models in machine learning, focusing on decision trees and boosting algorithms. It delves into the concepts of overfitting and regularization, explaining techniques like early stopping, weight decay, and bayesian priors. The lecture also covers the bias-variance trade-off and its implications for model performance. It concludes with a brief introduction to deep neural networks as adaptive basis function models.

Typology: Lecture notes

2023/2024

Uploaded on 12/11/2024

ritvik-avancha
ritvik-avancha 🇺🇸

1 document

1 / 29

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 545 Machine Learning for Signals
Lecture 14: Adaptive Basis Function
Models
Minje Kim, Ph.D.
Associate Professor
Siebel School of Computer Science
https:/ /minjekim.com
minje@illinois.edu
C OM PU T ER S CI EN C E G RA IN G ER E NG I NE ER I NG
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d

Partial preview of the text

Download Adaptive Basis Function Models in Machine Learning for Signals and more Lecture notes Computer Science in PDF only on Docsity!

CS 545 Machine Learning for Signals

Lecture 14: Adaptive Basis Function

Models

M inje K im , P h.D.

S i eb e l S c ho o l o f C o m pu te r S ci e n c e^ A s s oc i a te Pr of es s o r

h tt ps : / /m i nj e k i m. c om m i nj e @i l l i n oi s. ed u

C O M P U T E R S C I E N C E G R A I N G E R E N G I N E E R I N G

Adaptive Basis Function Models

  • Really, nothing but weighted sum of features

○ Let’s go against the argument about the Kernel methods

○ Basis functions Feature transformation function

-^ In adaptive basis function models we explicitly learn this function from data Instead of using kernels

○ Adaptive basis functions? First you assume that there are M such basis functions

The basis function is parameterized and learned from data

  • The entire parameter set

2

Decision Trees

  • CART for classification

○ Fraction of the positive examples can be used to construct the posterior

MLaPP Figure 1.1, 16.2 4

4,0 shape

color size < 10 4,0 0,

ellipse (^) ot her

blue (^) red ot her yes (^) no Number of training examples per class

Decision Trees

  • Training

○ We need to decide Which feature to threshold

Where to threshold Based on the cost minimization

○ Optimization

5

Sample index Features Threshold Set of possible threshold

Decision Trees

  • Classification cost

○ Classification cost

Misclassification rate

Entropy

7

Decision Trees

  • Iris example

○ Overfitting?

MLaPP Figure 16.4, 16.5(b) 8

Decision Trees

  • Pros and cons of decision trees

○ Pros Easy to interpret (e.g. for medical diagnosis)

Can handle discrete input Robust to monotone transformation and scaling (e.g. log) Comes with feature selection Works well with large datasets Easy to handle missing variables

○ Cons There are other outperforming models

  • (^) UnstableThe greedy construction algorithm is not very optimal—small changes in the top node propagate down to the leaf nodes High variance

10

Overfitting and Regularization

  • Polynomial curve fitting

○ Minimizing an error function

Therefore,

○ Question: what’s the right order of polynomial, The more the better? M?

PRML Figure 1.2 11

Overfitting and Regularization

  • Preventing overfitting—early stopping

○ Early stopping Check on the simulated test error and stops earlier than the convergence (of the training error)

○ N- Divide the training set into N exclusive subsetsfold Cross validation: simulate the testing environment using training data

N different train Each pair is used to train a classifier and to evaluate it-validation pairs Average the N results The average shows the performance of your choice MLaPP Figure 16.5(b) 13

1 st^ fold Features^ Train TrainFrames Test Train Test Train Test Train Train

2 nd^ fold 3 rd^ fold

Overfitting and Regularization

  • Preventing overfitting—more training samples

○ Weights become larger if the model overfits

PRML Figure 1.6 14

M =0 M =1 M =3 M =

𝑤 0 ∗^ 0.19 0.82 0.31 0.

𝑤 1 ∗^ - 1.27 7.99 232.

𝑤 2 ∗^ - 25.43 - 5321.

𝑤 3 ∗^ 17.37 48568.

𝑤 4 ∗^ - 231639.

𝑤 5 ∗^ 640042.

𝑤 6 ∗^ - 1061800.

𝑤 7 ∗^ 1042400.

𝑤 8 ∗^ - 557682.

𝑤 9 ∗^ 125201.

○ A big training dataset solves the problem

Overfitting and Regularization

  • Preventing overfitting—weight decay

○ Regularization can decay the weights

PRML Figure 1.8 16

ln 𝜆 = −∞ ln 𝜆 = − 18 ln 𝜆 = 0 𝑤 0 ∗^ 0.35 0.35 0. 𝑤 1 ∗^ 232.37 4.74 - 0. 𝑤 2 ∗^ - 5321.83 - 0.77 - 0. 𝑤 3 ∗^ 48568.31 - 31.97 - 0. 𝑤 4 ∗^ - 231639.30 - 3.89 - 0. 𝑤 5 ∗^ 640042.26 55.28 - 0. 𝑤 6 ∗^ - 1061800.52 41.32 - 0. 𝑤 7 ∗^ 1042400.18 - 45.95 - 0. 𝑤 8 ∗^ - 557682.99 - 91.53 0. 𝑤 9 ∗^ 125201.43 72.68 0.

○ Regularization lets us use a complex model without worrying about overfitting

Overfitting and Regularization

  • Preventing overfitting—another takes

○ Recall SVM objective function can be seen as a combination of the hinge loss and regularization

○ Bayesian priors can work as a Maximum likelihood regularizer

Prior MAP Or to minimize PRML Figure 1.16 17

The Bias-Variance Trade-Off

  • Averaged model predictions can reduce variance

○ 100 models from 100 different

PRML Figure 3.5 19

Original function

Estimated functions

Average of the estimated functions

Too much regularization: low variance, high bias

Model averaging doesn’t help remove bias

Too little regularization: high variance, low bias

Model averaging helps remove variance

The Bias-Variance Trade-Off

  • Bootstrap aggregation (or bagging); random forests

○ In theory, if you have multiple training datasets, Train multiple complex models and average the results→ low variance and low bias

○ In practice, you don’t have multiple training datasets

○ Bootstrapping Subsample from one training dataset with replacement

○ Train m - th model from m - th bootstrap dataset

○ The M models construct a committee ( bagging )

○ Variance We hope

○ Random forests : subsample from dataset; subset of variables

20