





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An overview of machine learning methods, focusing on the concepts of tuning and cross-validation. It covers various machine learning techniques, including LASSO, Ridge Regression, Elastic Net, Kernel-Based Regression, and Regression Trees. how these methods use cross-validation for parameter tuning and discusses their applications in supervised learning and regression problems.
Typology: Cheat Sheet
1 / 9
This page cannot be seen from the preview
Don't miss anything!






This cheat sheet introduces the basics of machine learning and how it relates to traditional econo- metrics. It is accessible with an intermediate background in statistics and econometrics. It is meant to show people that machine learning does not have to be hard and mysterious! Descriptions are favored over illustrations though in future iterations examples and illustrations could be added.
Machine learning methods are statistical methods, and if you are familiar with statistical methods quite well then much of machine learning is not mysterious. The major concept necessary to machine learning methods that is not involved in traditional statistical or econometric methods is the idea of tuning and cross-validation. In machine learning methods, it is often the case that there is some parameter λ such that the objective function (of the data and parameters) being minimized to estimate the parameters is also a function of λ. Thus, one needs to select λ before estimating the parameters. This is usually done by cross-validation, explained below.
I cover the most popular form of cross-validation: K-fold cross validation. This works as an input into finding λ that will be used in estimation. Split the data randomly into K roughly equally sized ∗Stanford Graduate School of Business. Email: [email protected]
bins Bk. For each k = 1,... , K and each value of the tuning parameter λm over a grid considered, compute the validation error Ek(λ) =
i∈Bk
(yi − fˆ (^) λ− k(xi))^2 (1)
where fˆ (^) λ− k(xi) is the estimate at tuning parameter λ leaving out data in Bk. Then the total CV objective function is
CV (λ) = N^1
k=
Ek(λ) (2)
and minimizing this is how we choose λ. When K = N this is called leave-one-out cross valida- tion. In practice most people choose K around 10 for computational reasons.
These are problems concerning estimating E[Y |X].
β^ ˆlasso = argminβ
i=
(yi − Xiβ) + λ
∑^ p j=
|βj |
where p is the total number of variables not a constant. This is the L1 penalty. λ is chosen by cross-validation: usually 10-fold cross validation is preferred.
β^ ˆridge = argminβ
i=
(yi − Xiβ) + λ
∑^ p j=
β j^2
where p is the total number of variables not a constant. This is the L2 penalty.
where hm : Rp^ → R are transformations. When we choose the hm in a specific way we call it splines. In particular, splines have knots at a vector of points in the feature space ξ that allow us to interpolate continuously between the different regions defined by the knots. A common spline choice is the natural cubic spline which with K knots which has basis functions:
N 1 (X) = 1, N 2 (X) = X, Nk+2(X) = dk(X) − dK− 1 (X)
where dk(X) = (X−ξk^ )
(^3) +−(X−ξK ) (^3) + ξK −ξk.^ The natural cubic spline is nice because it mitigates boundary bias (bias at the boundaries of the feature space or X’s) by forcing the estimated function to be linear beyond the boundary knots. Note that these procedures have many tuning parameters: degree of splines, number of knots and where to place the knots. Multivariate Adaptive Regression Splines get over this problem by greedily selecting the knots in a fashion similar to regression trees.
f^ ˆadditive(X) =
∑^ p j=
fj (Xj ) (8)
and we aim to minimize the sum of squared errors. This model can be fit with the backfitting algorithm which consists of iteratively applying a smoother (e.g. kernel regression, local linear, etc.) to each dimension j and updating. Thus we do, in some senses, parameterize this model by picking smoothers to estimate each separate fj.
samples of the data and where, at each step, we consider only a random subset of the inde- pendent variables to split on. Then the resulting random forest estimate is the average over these bootstrapped regression trees.
min γm
i
L(yi, Gˆm− 1 (Xi) + g(Xi; γm)) (9)
where L is the loss function, Gˆ 0 (x) ≡ 0 and g(·) is the single method for each boosting step (e.g. single split tree). One can weight the resulting gm by m in the overall additive prediction. Tuning usually involves: number of boosts/iterations to apply, some CV on the simple method (ex: how many splits in the tree), and the learning rate m. AdaBoost: A particular popular example of a boosting algorithm for classification. This fits a classification method and then weights each classification (in overall additive method) based on the computed error of the iterative step. It uses exponential loss. Many different loss functions can be considered for boosting. The difference between AdaBoost and Gradient Boosting is that in AdaBoost a single weak learner is applied sequentially where the only thing that varies among iterates is the weighting of the different data points for prediction based on previous error. The final prediction function is a weighted sum of all the weak learners (like Gradient boosting). The weights are varied to adapt to “more difficult cases”.
These are problems concerning estimating P (Y |X) or alternatively classifying Y into buckets given X.
then we pick the hyperplane that “puts biggest distance between the classes”. This is useful for out-of-sample predictions and generalizing the model. The method is called support vectors because the parameter estimates forming the hyper- plane depend on only a few points (vectors) that are pivotal for forming the hyperplane. In particular βˆ =
i∈S αˆixi^ where^ S^ is the set of support vectors. In general, we can write the problem as
min α
i=
1 − yi
α 0 +
j=
αj xTj xi
λ
i,j
αixTi xj αj (11)
where λ is a regularization parameter and a+ = 1 {a > 0 } · a. We can make the classifier more flexible by replacing xTj xi in the estimation problem with kernel functions K(xj , xi) to allow for non-linear interactions.
These are problems concerning grouping X’s with no dependent variable.
d(xi, xj ) = ||xi − xj ||^2
Then to estimate consider the within cluster scatter for a certain dissimilarity measure
k=
C(i)=k
C(j)=k
d(xi, dj )
In this case this simplifies nicely and has a nice zig-zag algorithm to optimize where we pick the means of clusters and then categorize points, iterating the process. We can generalize d(·, ·) to avoid issues with outliers and large quantitative values - the estimation algorithm still works well. Note that we cannot do CV on K here, we must choose it. One way
to choose it is to plot log(WK ) as a function of K and look for a “gap”. We can also do hierarchical clustering to avoid having to pick K altogether.
SM (z) =
i 6 =j
(dij − ||zi − zj ||)^2
with alternatives. The idea is that the low representation will still preserve the distances between the points. Has recently been expanded to non-linear setup.
i Lij^ be total links. Then page rank is defined recursively by
pi = (1 − d) + d
j=
Lij cj^ pj^ (12)
where the idea is that a page is important if important pages link to it. In matrix form this can be written as
p = (1 − d)e + dLD− c 1 p = [(1 − d)eeT^ /N + dLD c− 1 ]p = Ap
where the second equality uses a normalization the average page rank to be 1. Then A has largest real eigenvalue 1 so we can use the power method to find the fixed point. The reason this works is that it forms an irreducible aperiodic Markov chain.