General Optimization Linear Algorithms, Study notes of Machine Learning

MACHINE LEARNING CHEATSHEET. Summary of Machine Learning Algorithms descriptions, advantages and use cases. Inspired by the very good book.

Typology: Study notes

2021/2022

Uploaded on 07/05/2022

allan.dev
allan.dev 🇦🇺

4.5

(86)

1K documents

1 / 5

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
MACHINE LEARNING CHEATSHEET
Summary' of' Machine' Learning' Algorithms' descriptions,'
advantages'and' use'cases.' Inspired'by'the'very' good'book'
and'articles'of'MachineLearningMastery,/with'added'math,/
and'ML/Pros/&/Cons'of'HackingNote.'Design'inspired'by'The/
Probability/Cheatsheet/of'W.'Chen.'Written'by'Rémi'Canard.'
!
General
Definition
!
We! want! to! learn! a! target! function!f!that! maps! input!
variables!X!to!output!variable!Y,!with!an!error!e:!
𝑌 = 𝑓 𝑋 + &𝑒!
Linear, Nonlinear
!
Different!algorithms!make!different!assumptions!about!the!
shape!and! structure! of!f,! thus!th e!need! of!testin g!several!
methods.!Any!algorithm!can!be!either:!
-!Parametric!(or! Linear):!simplify! the!mapping! to!a! known!
linear!combination!form!and!learning!its!coefficients.!
-!Non! parametric!(or! Nonlinear):! free! to! learn! any!
functional! form!from! the! training!data,! while! maintaining!
some!ability!to!generalize.!
Linear! algorithms! are!usually!simpler,! faster! and! requires!
less!data ,!while!Nonlinear! can!be! are!more! flexible,!more!
powerful!and!more!performant.!
Supervised, Unsupervised
Supervised! learning!methods! learn! to! predict! Y! from! X!
given!that!the!data!is!labeled.!
Unsupervised!learning!methods!learn!to!find!the! inherent!
structure!of!the!unlabeled!data.!
Bias-Variance trade-off
In!supervised! learning,!the! prediction!error! e"is!composed!
of!the!bias,!the!variance!and!the!irreducible!part.!
Bias!refers!to! simplifying! assumptions!made!to! learn!the!
target!function!easily.!
Variance!refers!to!sensitivity!of!the!model!to!changes!in!the!
training!data.!
The! goal!of! parameterization!is! to! achieve! a! low! bias!
(underlying! pattern! not! too! simplified)! and! low! variance!
(not!sensitive!to!specificities!of!the!training!data)!tradeoff.
Underfitting, Overfitting
In! statistics,! fit' refers! to! how! well! the! target! function! is!
approximated.!
Underfitting!refers!to!poor!inductive!learning!from!training!
data!and!poor!generalization.!
Overfitting!refers!to! learning! the! training! d ata! detail! and!
noise!which!leads! to!poor!generalization.!It!can! be!limited!
by!using!resampling!and!defining!a!validation!dataset.!
!
Optimization
Almost!every!machine!learning!method!has!an!optimization!
algorithm!at!its!core.!
Gradient Descent
!!
Gradient!Descent! is!used!to! find!the! coefficients!of!f" that!
minimizes!a!cost!function!(for!example!MSE,!SSR).!!
Procedure:!
à!Initialization!!!!!!!!!!!𝜃 = 0!!!!!(coefficients!to!0!or!random)!
à!Calculate!cost!!!!!!!!!𝐽(𝜃) = 𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒(𝑓 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 )!
à!Gradient!of!cost!!!! 7
789 𝐽(𝜃)!!we!know!the!uphill!direction!
à!Update!coeff!!!!!!!!!!𝜃𝑗 = &𝜃𝑗 &𝛼 7
789 𝐽(𝜃)!we!go!downhill!
The! cost! updating! process! is! repeated! until! convergence!
(minimum!found).!
!
Batch! Gradient! Descend! does! summing/averaging! of!the!
cost!over!all!the!observations.!
Stochastic! Gradient! Descent! apply! the! procedure! of!
parameter!updating!for!each!observation.!
Tips:!
-!Change!learning!rate!𝛼!(“size!of!jump”!at!each!iteration)!
-!Plot!Cost"vs"Time"to!assess!learning!rate!performance!
-!Rescaling!the!input!variables!
-!Reduce!passes!through!training!set!with!SGD!
-!Average!over!10!or!more!updated!to!observe!the!learning!
trend!while!using!SGD!
Ordinary Least Squares
!OLS!is!used!to!find!the!estimator!𝛽!!that!minimizes!the!sum!
of!squared!residuals:!! (𝑦? 𝛽@ 𝛽9𝑥?9
B
9CD
E
?CD )F= 𝑦 𝑋 &𝛽!
!
Using!linear!algebra!such!that!we!have!𝛽 = (𝑋G𝑋)HD 𝑋G𝑦!!
Maximum Likelihood Estimation
MLE! is! used! to! find! the! estimators! that! minimizes!the!
likelihood!function:!
𝜃 𝑥 = 𝑓
8(𝑥)!!!!!!!density!function!of!the!data!distribution!
!
Linear Algorithms
All!linear!Algorithms!assume! a!linear!relationship!between!
the!input!variables!X!and!the!output!variable!Y.!
Linear Regression
!
Representation:!
A!LR!model!representation!is!a!linear!equation:!
𝑦 =& 𝛽@+ 𝛽D𝑥D+ + 𝛽?𝑥?!
𝛽@!is! usually! called! intercept! or! bias!coefficient.!The!
dimension! of! the! hyperplane! of! the! regression! is! its!
complexity.!
pf3
pf4
pf5

Partial preview of the text

Download General Optimization Linear Algorithms and more Study notes Machine Learning in PDF only on Docsity!

MACHINE LEARNING CHEATSHEET

Summary of Machine Learning Algorithms descriptions, advantages and use cases. Inspired by the very good book and articles of MachineLearningMastery, with added math , and ML Pros & Cons of HackingNote. Design inspired by The Probability Cheatsheet of W. Chen. Written by Rémi Canard.

General

Definition We want to learn a target function f that maps input variables X to output variable Y , with an error e : 𝑌 = 𝑓 𝑋 + 𝑒 Linear, Nonlinear Different algorithms make different assumptions about the shape and structure of f , thus the need of testing several methods. Any algorithm can be either:

  • Parametric (or Linear ): simplify the mapping to a known linear combination form and learning its coefficients.
  • Non parametric (or Nonlinear ): free to learn any functional form from the training data, while maintaining some ability to generalize. Linear algorithms are usually simpler, faster and requires less data, while Nonlinear can be are more flexible, more powerful and more performant. Supervised, Unsupervised Supervised learning methods learn to predict Y from X given that the data is labeled. Unsupervised learning methods learn to find the inherent structure of the unlabeled data. Bias-Variance trade-off In supervised learning, the prediction error e is composed of the bias , the variance and the irreducible part. Bias refers to simplifying assumptions made to learn the target function easily. Variance refers to sensitivity of the model to changes in the training data. The goal of parameterization is to achieve a low bias (underlying pattern not too simplified) and low variance (not sensitive to specificities of the training data) tradeoff. Underfitting, Overfitting In statistics, fit refers to how well the target function is approximated. Underfitting refers to poor inductive learning from training data and poor generalization. Overfitting refers to learning the training data detail and noise which leads to poor generalization. It can be limited by using resampling and defining a validation dataset.

Optimization

Almost every machine learning method has an optimization algorithm at its core. Gradient Descent Gradient Descent is used to find the coefficients of f that minimizes a cost function (for example MSE, SSR). Procedure: à Initialization 𝜃 = 0 (coefficients to 0 or random) à Calculate cost 𝐽(𝜃) = 𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒(𝑓 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠 ) à Gradient of cost 7 789 𝐽(𝜃)^ we know the uphill direction à Update coeff 𝜃𝑗 = 𝜃𝑗 − 𝛼 7897 𝐽(𝜃) we go downhill The cost updating process is repeated until convergence (minimum found). Batch Gradient Descend does summing/averaging of the cost over all the observations. Stochastic Gradient Descent apply the procedure of parameter updating for each observation. Tips:

  • Change learning rate 𝛼 (“size of jump” at each iteration)
  • Plot Cost vs Time to assess learning rate performance
  • Rescaling the input variables
  • Reduce passes through training set with SGD
  • Average over 10 or more updated to observe the learning trend while using SGD Ordinary Least Squares OLS is used to find the estimator 𝛽 that minimizes the sum of squared residuals:^ E ?CD( 𝑦? − 𝛽@ − B 9 CD𝛽 9 𝑥?9)F^ = 𝑦 − 𝑋 𝛽 Using linear algebra such that we have 𝛽 = (𝑋G𝑋)HD𝑋G𝑦 Maximum Likelihood Estimation MLE is used to find the estimators that minimizes the likelihood function: ℒ 𝜃 𝑥 = 𝑓 8 (𝑥) density function of the data distribution

Linear Algorithms

All linear Algorithms assume a linear relationship between the input variables X and the output variable Y. Linear Regression Representation: A LR model representation is a linear equation: 𝑦 = 𝛽@ + 𝛽D𝑥D + ⋯ + 𝛽?𝑥? 𝛽@ is usually called intercept or bias coefficient. The dimension of the hyperplane of the regression is its complexity.

Learning: Learning a LR means estimating the coefficients from the training data. Common methods include Gradient Descent or Ordinary Least Squares. Variations: There are extensions of LR training called regularization methods, that aim to reduce the complexity of the model:

  • Lasso Regression : where OLS is modified to minimize the sum of the coefficients (L1 regularization) (𝑦? − 𝛽@ − 𝛽 9 𝑥? B 9 CD E ?CD )F^ + 𝜆 |𝛽 9 | B 9 CD = 𝑅𝑆𝑆 + 𝜆 |𝛽 9 | B 9 CD
  • Ridge Regression : where OLS is modified to minimize the squared sum of the coefficients (L2 regularization) (𝑦? − 𝛽@ − 𝛽 9 𝑥? B 9 CD E ?CD )F^ + 𝜆 𝛽 9 F B 9 CD = 𝑅𝑆𝑆 + 𝜆 𝛽 9 F B 9 CD where 𝜆 ≥ 0 is a tuning parameter to be determined. Data preparation:
  • Transform data for linear relationship (ex: log transform …for exponential relationship)
  • Remove noise such as outliers
  • Rescale inputs using standardization or normalization Advantages:
  • Good regression baseline considering simplicity
  • Lasso/Ridge can be used to avoid overfitting
  • Lasso/Ridge permit feature selection in case of collinearity Usecase examples:
  • Product sales prediction according to prices or promotions -. Call-center waiting-time prediction according to the …number of complaints and the number of working agents Logistic Regression It is the go-to for binary classification. Representation: Logistic regression a linear method but predictions are transformed using the logistic function (or sigmoid): 𝜙 is S - shaped and map real-valued number in (0,1). The representation is an equation with binary output: 𝑦 =

𝑒QRSQTUTS⋯SQVUV

1 + 𝑒QRSQTUTS⋯SQVUV

Which actually models the probability of default class: 𝑝 𝑋 =

𝑒QRSQTUTS⋯SQVUV

1 + 𝑒QRSQTUTS⋯SQVUV

Learning: Learning the Logistic regression coefficients is done using maximum-likelihood estimation , to predict values close to 1 for default class and close to 0 for the other class. Data preparation:

  • Probability transformation to binary for classification
  • Remove noise such as outliers Advantages:
  • Good classification baseline considering simplicity
  • Possibility to change cutoff for precision/recall tradeoff
  • Robust to noise/overfitting with L1/L2 regularization
  • Probability output can be used for ranking Usecase examples:
  • Customer scoring with probability of purchase
  • Classification of loan defaults according to profile Linear Discriminant Analysis For multiclass classification , LDA is the preferred linear technique. Representation: LDA representation consists of statistical properties calculated for each class : means and the covariance matrix:

𝜇Z =

D

E[^ 𝑥?

E

?CD and^ 𝜎

F = D

EH] (𝑥?^ −^ 𝜇Z)

E ?CD F LDA assumes Gaussian data and attributes of same 𝝈𝟐. Predictions are made using Bayes Theorem: 𝑃 𝑌 = 𝑘 𝑋 = 𝑥 =

𝑃(𝑘)×𝑃(𝑥|𝑘)

] cCD 𝑃(𝑙)×𝑃(𝑥|𝑙) to obtain a discriminate function (latent variable) for each class k , estimating 𝑃(𝑥|𝑘) with a Gaussian distribution: 𝐷Z 𝑥 = 𝑥 ×

𝜇Z

𝜎F^

𝜇ZF

2 𝜎F^

  • ln (𝑃 𝑘 ) The class with largest discriminant value is the output class. Variations:
  • Quadratic DA : Each class uses its own variance estimate
  • Regularized DA : Regularization into the variance estimate Data preparation:
  • Review and modify univariate distributions to be Gaussian
  • Standardize data to 𝜇 = 0 , 𝜎 = 1 to have same variance
  • Remove noise such as outliers Advantages:
  • Can be used for dimensionality reduction by keeping the …latent variables as new variables

For regression the output can be the mean , while for classification the output can be the most common class. Various distances can be used, for example:

  • Euclidean Distance, good for similar type of variables 𝑑 𝑎, 𝑏 = (𝑎? − 𝑏?)F E ?CD
  • Manhattan Distance, good for different type of variables 𝑑 𝑎, 𝑏 = |𝑎? − 𝑏?| E ?CD The best value of k must be found by testing , and the algorithm is sensible to the Curse of dimensionality. Data preparation:
  • Rescale inputs using standardization or normalization
  • Address missing data for distance calculations
  • Dimensionality reduction or feature selection for COD Advantages:
  • Effective if the training data is large
  • No learning phase
  • Robust to noisy data, no need to filter outliers Usecase examples:
  • Recommending products based on similar customers
  • Anomaly detection in customer behavior Support Vector Machines SVM is a go-to for high performance with little tuning Representation: In SVM, a hyperplane is selected to separate the points in the input variable space by their class, with the largest margin. The closest datapoints (defining the margin) are called the support vectors. But real data cannot be perfectly separated , that is why a C defines the amount of violation of the margin allowed. The lower C, the more sensitive SVM is to training data. The prediction function is the signed distance of the new input x to the separating hyperplane w : 𝑓 𝑥 =< 𝑤, 𝑥 > + 𝜌 = 𝑤G𝑥 + 𝜌 with 𝜌 the bias Which gives for linear kernel , with 𝑥? the support vectors: 𝑓 𝑥 = 𝑎?×(𝑥×𝑥?)) E ?CD

Learning: The hyperplane learning is done by transforming the problem using linear algebra, and minimizing: 1 𝑛 max 0 , 1 − 𝑦? 𝑤. 𝑥† − 𝑏 E ?CD

+ 𝜆||𝑤||F

Variations: SVM is implemented using various kernels, which define the measure between new data and support vectors:

  • Linear (dot-product): 𝐾 𝑥, 𝑥? = (𝑥×𝑥?)
  • Polynomial : 𝐾 𝑥, 𝑥? = 1 + (𝑥×𝑥?)ˆ
  • Radial : 𝐾 𝑥, 𝑥? = 𝑒H‰^ ((UHUV) ) Data preparation:
  • SVM assumes numeric inputs, may require dummy ….transformation of categorical features Advantages:
  • Allow nonlinear separation with nonlinear Kernels
  • Works good in high dimensional space
  • Robust to multicollinearity and overfitting Usecase examples:
  • Face detection from images
  • Target Audience Classification from tweets Ensemble Algorithms Ensemble methods use multiple, simpler algorithms combined to obtain better performance. Bagging and Random Forest Random Forest is part of a bigger type of ensemble methods called Bootstrap Aggregation or Bagging. Bagging can reduce the variance of high-variance models. It uses the Bootstrap statistical procedure : estimate a quantity from a sample by creating many random subsamples with replacement, and computing the mean of each subsample. Representation: For bagged decision trees , the steps would be:
    • Create many subsamples of the training dataset
    • Train a CART model on each sample
    • Given a new dataset, calculate the average prediction However, combining models works best if submodels are weakly correlated at best. Random Forest is a tweaked version of bagged decision trees to reduce tree correlation. Learning: During learning, each sub-tree can only access a random sample of features when selecting the split points. The size of the feature sample at each split is a parameter m. A good default is 𝑝 for classification and B Š for regression. The OOB estimate is the performance of each model on its Out-Of-Bag (not selected) samples. It is a reliable estimate of test error. Bagged method can provide feature importance , by calculating and averaging the error function drop for individual variables (depending of samples where a variable is selected or not).

Advantages: In addition to the advantages of the CART algorithm

  • Robust to overfitting and missing variables
  • Can be parallelized for distributed computing
  • Performance as good as SVM but easier to interpret Usecase examples:
  • Predictive machine maintenance
  • Optimizing line decision for credit cards Boosting and AdaBoost AdaBoost was the first successful boosting algorithm developed for binary classification. Representation: A boost classifier is of the form 𝐹G 𝑥 = 𝑓s(𝑥) G sCD where each 𝑓s is a week learner correcting the errors of the previous one. Adaboost is commonly used with decision trees with one level ( decision stumps ). Predictions are made using the weighted average of the weak classifiers. Learning: Each training set instance is initially weighted 𝑤 𝑥? = D E One decision stump is prepared using the weighted samples, and a misclassification rate is calculated: 𝜖 = E ?CD(𝑤?×𝑝vx ?) E ?CD 𝑤 Which is the weighted sum of the misclassification rates, where w is the training instance i weight and 𝑝vx? its prediction error (1 or 0). A stage value is computed from the misclassification rate: 𝑠𝑡𝑎𝑔𝑒 = ln (

This stage value is used to update the instances weights : 𝑤 = 𝑤×𝑒rstvא The incorrectly predicted instance are given more weight. Weak models are added sequentially using the training weights, until no improvement can be made or the number of rounds has been attained. Data preparation:

  • Outliers should be removed for AdaBoost Advantages:
  • High performance with no tuning (only number of rounds) Interesting Resources Machine Learning Mastery website

    https://machinelearningmastery.com/ Scikit-learn website, for python implementation http://scikit-learn.org/ W.Chen probability cheatsheet https://github.com/wzchen/probability_cheatsheet HackingNote , for interesting, condensed insights https://www.hackingnote.com/ Seattle Data Guy blog , for business oriented articles https://www.theseattledataguy.com/ Explained visually , making hard ideas intuitive http://setosa.io/ev/ This Machine Learning Cheatsheet https://github.com/remicnrd/ml_cheatsheet