Machine Learning: Understanding Tuning and Cross-Validation in Statistical Methods, Cheat Sheet of Machine Learning

An overview of machine learning methods, focusing on the concepts of tuning and cross-validation. It covers various machine learning techniques, including LASSO, Ridge Regression, Elastic Net, Kernel-Based Regression, and Regression Trees. how these methods use cross-validation for parameter tuning and discusses their applications in supervised learning and regression problems.

Typology: Cheat Sheet

2018/2019

Uploaded on 07/05/2022

barbara_gr
barbara_gr 🇦🇺

4.6

(73)

1K documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Machine Learning Cheat Sheet
Cameron Taylor
November 14, 2019
Introduction
This cheat sheet introduces the basics of machine learning and how it relates to traditional econo-
metrics. It is accessible with an intermediate background in statistics and econometrics. It is meant
to show people that machine learning does not have to be hard and mysterious! Descriptions are
favored over illustrations though in future iterations examples and illustrations could be added.
Concepts
Machine learning methods are statistical methods, and if you are familiar with statistical methods
quite well then much of machine learning is not mysterious. The major concept necessary to
machine learning methods that is not involved in traditional statistical or econometric methods is
the idea of tuning and cross-validation. In machine learning methods, it is often the case that there
is some parameter λsuch that the objective function (of the data and parameters) being minimized
to estimate the parameters is also a function of λ. Thus, one needs to select λbefore estimating the
parameters. This is usually done by cross-validation, explained below.
Method Details
Cross-Validation
I cover the most popular form of cross-validation: K-fold cross validation. This works as an input
into finding λthat will be used in estimation. Split the data randomly into Kroughly equally sized
Stanford Graduate School of Business. Email: [email protected]
1
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Machine Learning: Understanding Tuning and Cross-Validation in Statistical Methods and more Cheat Sheet Machine Learning in PDF only on Docsity!

Machine Learning Cheat Sheet

Cameron Taylor∗

November 14, 2019

Introduction

This cheat sheet introduces the basics of machine learning and how it relates to traditional econo- metrics. It is accessible with an intermediate background in statistics and econometrics. It is meant to show people that machine learning does not have to be hard and mysterious! Descriptions are favored over illustrations though in future iterations examples and illustrations could be added.

Concepts

Machine learning methods are statistical methods, and if you are familiar with statistical methods quite well then much of machine learning is not mysterious. The major concept necessary to machine learning methods that is not involved in traditional statistical or econometric methods is the idea of tuning and cross-validation. In machine learning methods, it is often the case that there is some parameter λ such that the objective function (of the data and parameters) being minimized to estimate the parameters is also a function of λ. Thus, one needs to select λ before estimating the parameters. This is usually done by cross-validation, explained below.

Method Details

Cross-Validation

I cover the most popular form of cross-validation: K-fold cross validation. This works as an input into finding λ that will be used in estimation. Split the data randomly into K roughly equally sized ∗Stanford Graduate School of Business. Email: [email protected]

bins Bk. For each k = 1,... , K and each value of the tuning parameter λm over a grid considered, compute the validation error Ek(λ) =

i∈Bk

(yi − fˆ (^) λ− k(xi))^2 (1)

where fˆ (^) λ− k(xi) is the estimate at tuning parameter λ leaving out data in Bk. Then the total CV objective function is

CV (λ) = N^1

∑^ K

k=

Ek(λ) (2)

and minimizing this is how we choose λ. When K = N this is called leave-one-out cross valida- tion. In practice most people choose K around 10 for computational reasons.

Supervised Learning and Regression

These are problems concerning estimating E[Y |X].

  • OLS: Standard, many tools and resources for this.
  • LASSO: Start from OLS but think about penalizing the number of coefficients that are larger than 0. The estimator is

β^ ˆlasso = argminβ

∑^ N

i=

(yi − Xiβ) + λ

∑^ p j=

|βj |

where p is the total number of variables not a constant. This is the L1 penalty. λ is chosen by cross-validation: usually 10-fold cross validation is preferred.

  • Ridge: Start from OLS but again think about penalizing the number of coefficients larger than 0. The estimator is

β^ ˆridge = argminβ

∑^ N

i=

(yi − Xiβ) + λ

∑^ p j=

β j^2

where p is the total number of variables not a constant. This is the L2 penalty.

  • LASSO vs. Ridge: The difference is hard vs. soft thresholding (Statistical Learning Book

where hm : Rp^ → R are transformations. When we choose the hm in a specific way we call it splines. In particular, splines have knots at a vector of points in the feature space ξ that allow us to interpolate continuously between the different regions defined by the knots. A common spline choice is the natural cubic spline which with K knots which has basis functions:

N 1 (X) = 1, N 2 (X) = X, Nk+2(X) = dk(X) − dK− 1 (X)

where dk(X) = (X−ξk^ )

(^3) +−(X−ξK ) (^3) + ξK −ξk.^ The natural cubic spline is nice because it mitigates boundary bias (bias at the boundaries of the feature space or X’s) by forcing the estimated function to be linear beyond the boundary knots. Note that these procedures have many tuning parameters: degree of splines, number of knots and where to place the knots. Multivariate Adaptive Regression Splines get over this problem by greedily selecting the knots in a fashion similar to regression trees.

  • Additive Models: The generalized additive model is given by

f^ ˆadditive(X) =

∑^ p j=

fj (Xj ) (8)

and we aim to minimize the sum of squared errors. This model can be fit with the backfitting algorithm which consists of iteratively applying a smoother (e.g. kernel regression, local linear, etc.) to each dimension j and updating. Thus we do, in some senses, parameterize this model by picking smoothers to estimate each separate fj.

  • Regression Tree: Regression trees consist of splitting the feature space into regions that pre- dict a constant for the dependent variable in each region. Thus the regression tree adaptively selects both the variable/feature j and a split point in that variable s to minimize the sum of squared errors, predicting at the mean in each region. Cross validation is done on the number of splits J we allow. Pruning: To make sure to pick up important relationships, it is often helpful to grow a “big” tree first and then prune the tree by removing splits to optimize. Bump Hunting/PRIM: For classification, can adapt regression tree like methods to find modes/maximums of the dependent variable in the feature space.
  • Random Forest: Random forests add smoothness to regression trees by introducing ran- domness in two ways. Consider applying a regression tree multiple times on bootstrapped

samples of the data and where, at each step, we consider only a random subset of the inde- pendent variables to split on. Then the resulting random forest estimate is the average over these bootstrapped regression trees.

  • Boosting: The general idea behind boosting is to apply a simple, even naive method, iter- atively to improve the fit. The method is iteratively applied to residuals from the previous methods. At its most basic level boosting fits an additive model. The number of times boosted M is a tuning parameter here. In particular, at any iteration m the following is fitted

min γm

i

L(yi, Gˆm− 1 (Xi) + g(Xi; γm)) (9)

where L is the loss function, Gˆ 0 (x) ≡ 0 and g(·) is the single method for each boosting step (e.g. single split tree). One can weight the resulting gm by m in the overall additive prediction. Tuning usually involves: number of boosts/iterations to apply, some CV on the simple method (ex: how many splits in the tree), and the learning rate m. AdaBoost: A particular popular example of a boosting algorithm for classification. This fits a classification method and then weights each classification (in overall additive method) based on the computed error of the iterative step. It uses exponential loss. Many different loss functions can be considered for boosting. The difference between AdaBoost and Gradient Boosting is that in AdaBoost a single weak learner is applied sequentially where the only thing that varies among iterates is the weighting of the different data points for prediction based on previous error. The final prediction function is a weighted sum of all the weak learners (like Gradient boosting). The weights are varied to adapt to “more difficult cases”.

  • Neural Nets: Neural nets are large parametric models. They consist of hidden layers and nodes in each layer. There is also an input layer and an output layer. Each node is a linear function of terms that are non-linear transformations of nodes inn the previous layer. So, if, for example, g(·) is linear, then it is just a very large linear model with many interactions. No asymptotic theory exists and value is in ability to estimate with thousands of parameters. To estimate, regularize the linear parameters for each node, and pick the regularization penalty through CV.

Supervised Learning and Classification

These are problems concerning estimating P (Y |X) or alternatively classifying Y into buckets given X.

then we pick the hyperplane that “puts biggest distance between the classes”. This is useful for out-of-sample predictions and generalizing the model. The method is called support vectors because the parameter estimates forming the hyper- plane depend on only a few points (vectors) that are pivotal for forming the hyperplane. In particular βˆ =

i∈S αˆixi^ where^ S^ is the set of support vectors. In general, we can write the problem as

min α

∑N

i=

[

1 − yi

α 0 +

∑^ N

j=

αj xTj xi

)]

  • λ

i,j

αixTi xj αj (11)

where λ is a regularization parameter and a+ = 1 {a > 0 } · a. We can make the classifier more flexible by replacing xTj xi in the estimation problem with kernel functions K(xj , xi) to allow for non-linear interactions.

Unsupervised Learning

These are problems concerning grouping X’s with no dependent variable.

  • Principal Components: Provides sequence of best linear approximations to the data by essen- tially fitting a q dimensional hyperplane to explain the variance in X. A modern alternative for non-negative data is matrix factorization.
  • k means clustering: Goal is to categorize each data point xi into one of K clusters C(i) ∈ { 1 ,... , K}. Suppose that the X’s are quantitative. Define the dissimilarity measure be- tween two points xi and xj as

d(xi, xj ) = ||xi − xj ||^2

Then to estimate consider the within cluster scatter for a certain dissimilarity measure

W (C) =

∑^ K

k=

C(i)=k

C(j)=k

d(xi, dj )

In this case this simplifies nicely and has a nice zig-zag algorithm to optimize where we pick the means of clusters and then categorize points, iterating the process. We can generalize d(·, ·) to avoid issues with outliers and large quantitative values - the estimation algorithm still works well. Note that we cannot do CV on K here, we must choose it. One way

to choose it is to plot log(WK ) as a function of K and look for a “gap”. We can also do hierarchical clustering to avoid having to pick K altogether.

  • Multidimensional Scaling: Given data or even a dissimilarity index, look for low represen- tation representation of the data z = (z 1 ,... , zn) by minimizing stress function

SM (z) =

i 6 =j

(dij − ||zi − zj ||)^2

with alternatives. The idea is that the low representation will still preserve the distances between the points. Has recently been expanded to non-linear setup.

  • Google PageRank: Consider pages that link to each other. Let Lij = 1 if page j links to page i. Let cj =

i Lij^ be total links. Then page rank is defined recursively by

pi = (1 − d) + d

∑^ N

j=

Lij cj^ pj^ (12)

where the idea is that a page is important if important pages link to it. In matrix form this can be written as

p = (1 − d)e + dLD− c 1 p = [(1 − d)eeT^ /N + dLD c− 1 ]p = Ap

where the second equality uses a normalization the average page rank to be 1. Then A has largest real eigenvalue 1 so we can use the power method to find the fixed point. The reason this works is that it forms an irreducible aperiodic Markov chain.

  • Generative Adversarial Networks: Suppose that we have real observations with empirical distribution FˆX (·) in X. Then consider some noise distribution FZ (·) (e.g. multivariate normal) in Z. We require a generator gθ : Z → X. For example, this can be a neural network or another flexible model. Finally we require a discriminator/critic to tell fake and real data apart, this can also be a neural network or other flexible classifier. Then the goal of GAN is to find a θ so that gθ(Z) ∼ FˆX according to the discriminator/critic. We can generally write this as a min-max problem where the inner maximum problem is for fitting the classifier and the outer minimization is to minimize the optimization criteria so that the generator makes it very difficult to distinguish between the data generating processes. Then we can estimate by updating the classifier and then using gradient descent to update the GAN. In general, often simplify the critic/discriminator step to some measure between probability