# Machine Learning complete cheatsheet, Exercises for Mathematics

135 pages
190Number of visits
Description
Machine learning cheatsheet containing popular algorithms
20 points
this document
Preview3 pages / 135

https://github.com/soulmachine/machine-learning-cheat-sheet

soulmachine@gmail.com

Machine Learning Cheat Sheet

Classical equations, diagrams and tricks in machine learning

February 12, 2015

ii

Preface

This cheat sheet contains many classical equations and diagrams on machine learning, which will help you quickly recall knowledge and ideas in machine learning.

This cheat sheet has three significant advantages:

1. Strong typed. Compared to programming languages, mathematical formulas are weakly typed. For example, X can be a set, a random variable, or a matrix. This causes difficulty in understanding the meaning of formulas. In this cheat sheet, I try my best to standardize symbols used, see section §.

2. More parentheses. In machine learning, authors are prone to omit parentheses, brackets and braces, this usually causes ambiguity in mathematical formulas. In this cheat sheet, I use parentheses(brackets and braces) at where they are needed, to make formulas easy to understand.

3. Less thinking jumps. In many books, authors are prone to omit some steps that are trivial in his option. But it often makes readers get lost in the middle way of derivation.

At Tsinghua University, May 2013 soulmachine

iii

Contents

Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Types of machine learning . . . . . . . . . . . . 1 1.2 Three elements of a machine learning

model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.1 Representation . . . . . . . . . . . . . . 1 1.2.2 Evaluation . . . . . . . . . . . . . . . . . 1 1.2.3 Optimization . . . . . . . . . . . . . . . 2

1.3 Some basic concepts . . . . . . . . . . . . . . . . . 2 1.3.1 Parametric vs non-parametric

models . . . . . . . . . . . . . . . . . . . . 2 1.3.2 A simple non-parametric

classifier: K-nearest neighbours 2 1.3.3 Overfitting . . . . . . . . . . . . . . . . . 2 1.3.4 Cross validation . . . . . . . . . . . . . 2 1.3.5 Model selection . . . . . . . . . . . . . 2

2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Frequentists vs. Bayesians . . . . . . . . . . . . 3 2.2 A brief review of probability theory . . . . 3

2.2.1 Basic concepts . . . . . . . . . . . . . . 3 2.2.2 Mutivariate random variables . . 3 2.2.3 Bayes rule . . . . . . . . . . . . . . . . . . 4 2.2.4 Independence and conditional

independence . . . . . . . . . . . . . . . 4 2.2.5 Quantiles . . . . . . . . . . . . . . . . . . 4 2.2.6 Mean and variance . . . . . . . . . . 4

2.3 Some common discrete distributions . . . 5 2.3.1 The Bernoulli and binomial

distributions . . . . . . . . . . . . . . . . 5 2.3.2 The multinoulli and

multinomial distributions . . . . . 5 2.3.3 The Poisson distribution . . . . . . 5 2.3.4 The empirical distribution . . . . 5

2.4 Some common continuous distributions . 6 2.4.1 Gaussian (normal) distribution . 6 2.4.2 Student’s t-distribution . . . . . . . 6 2.4.3 The Laplace distribution . . . . . . 7 2.4.4 The gamma distribution . . . . . . 8 2.4.5 The beta distribution . . . . . . . . . 8 2.4.6 Pareto distribution . . . . . . . . . . . 8

2.5 Joint probability distributions . . . . . . . . . 9 2.5.1 Covariance and correlation . . . . 9 2.5.2 Multivariate Gaussian

distribution . . . . . . . . . . . . . . . . . 10

2.5.3 Multivariate Student’s t-distribution . . . . . . . . . . . . . . . 10

2.5.4 Dirichlet distribution . . . . . . . . . 10 2.6 Transformations of random variables . . . 11

2.6.1 Linear transformations . . . . . . . 11 2.6.2 General transformations . . . . . . 11 2.6.3 Central limit theorem . . . . . . . . 13

2.7 Monte Carlo approximation . . . . . . . . . . . 13 2.8 Information theory . . . . . . . . . . . . . . . . . . 14

2.8.1 Entropy . . . . . . . . . . . . . . . . . . . . 14 2.8.2 KL divergence . . . . . . . . . . . . . . 14 2.8.3 Mutual information . . . . . . . . . . 14

3 Generative models for discrete data . . . . . . . . 17 3.1 Generative classifier . . . . . . . . . . . . . . . . . 17 3.2 Bayesian concept learning . . . . . . . . . . . . 17

3.2.1 Likelihood . . . . . . . . . . . . . . . . . 17 3.2.2 Prior . . . . . . . . . . . . . . . . . . . . . . 17 3.2.3 Posterior . . . . . . . . . . . . . . . . . . . 17 3.2.4 Posterior predictive distribution 18

3.3 The beta-binomial model . . . . . . . . . . . . . 18 3.3.1 Likelihood . . . . . . . . . . . . . . . . . 18 3.3.2 Prior . . . . . . . . . . . . . . . . . . . . . . 18 3.3.3 Posterior . . . . . . . . . . . . . . . . . . . 18 3.3.4 Posterior predictive distribution 19

3.4 The Dirichlet-multinomial model . . . . . . 19 3.4.1 Likelihood . . . . . . . . . . . . . . . . . 20 3.4.2 Prior . . . . . . . . . . . . . . . . . . . . . . 20 3.4.3 Posterior . . . . . . . . . . . . . . . . . . . 20 3.4.4 Posterior predictive distribution 20

3.5 Naive Bayes classifiers . . . . . . . . . . . . . . . 20 3.5.1 Optimization . . . . . . . . . . . . . . . 21 3.5.2 Using the model for prediction 21 3.5.3 The log-sum-exp trick . . . . . . . . 21 3.5.4 Feature selection using

mutual information . . . . . . . . . . 22 3.5.5 Classifying documents using

bag of words . . . . . . . . . . . . . . . 22

4 Gaussian Models . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.1 MLE for a MVN . . . . . . . . . . . . 25 4.1.2 Maximum entropy derivation

of the Gaussian * . . . . . . . . . . . . 26 4.2 Gaussian discriminant analysis . . . . . . . . 26

4.2.1 Quadratic discriminant analysis (QDA) . . . . . . . . . . . . . 26

v

vi Preface

4.2.2 Linear discriminant analysis (LDA) . . . . . . . . . . . . . . . . . . . . . 27

4.2.3 Two-class LDA . . . . . . . . . . . . . 28 4.2.4 MLE for discriminant analysis . 28 4.2.5 Strategies for preventing

overfitting . . . . . . . . . . . . . . . . . . 29 4.2.6 Regularized LDA * . . . . . . . . . . 29 4.2.7 Diagonal LDA . . . . . . . . . . . . . . 29 4.2.8 Nearest shrunken centroids

classifier * . . . . . . . . . . . . . . . . . 29 4.3 Inference in jointly Gaussian

distributions . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.1 Statement of the result . . . . . . . 29 4.3.2 Examples . . . . . . . . . . . . . . . . . . 30

4.4 Linear Gaussian systems . . . . . . . . . . . . . 30 4.4.1 Statement of the result . . . . . . . 30

4.5 Digression: The Wishart distribution * . . 30 4.6 Inferring the parameters of an MVN . . . 30

4.6.1 Posterior distribution of µ . . . . 30 4.6.2 Posterior distribution of Σ * . . . 30 4.6.3 Posterior distribution of µ

and Σ * . . . . . . . . . . . . . . . . . . . . 30 4.6.4 Sensor fusion with unknown

precisions * . . . . . . . . . . . . . . . . 30

5 Bayesian statistics . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 Summarizing posterior distributions . . . . 31

5.2.1 MAP estimation . . . . . . . . . . . . . 31 5.2.2 Credible intervals . . . . . . . . . . . 32 5.2.3 Inference for a difference in

proportions . . . . . . . . . . . . . . . . . 33 5.3 Bayesian model selection . . . . . . . . . . . . . 33

5.3.1 Bayesian Occam’s razor . . . . . . 33 5.3.2 Computing the marginal

likelihood (evidence) . . . . . . . . . 34 5.3.3 Bayes factors . . . . . . . . . . . . . . . 36

5.4 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.4.1 Uninformative priors . . . . . . . . . 36 5.4.2 Robust priors . . . . . . . . . . . . . . . 36 5.4.3 Mixtures of conjugate priors . . 36

5.5 Hierarchical Bayes . . . . . . . . . . . . . . . . . . 36 5.6 Empirical Bayes . . . . . . . . . . . . . . . . . . . . 36 5.7 Bayesian decision theory . . . . . . . . . . . . . 36

5.7.1 Bayes estimators for common loss functions . . . . . . . . . . . . . . . 37

5.7.2 The false positive vs false negative tradeoff . . . . . . . . . . . . 38

6 Frequentist statistics . . . . . . . . . . . . . . . . . . . . . . 39 6.1 Sampling distribution of an estimator . . . 39

6.1.1 Bootstrap . . . . . . . . . . . . . . . . . . 39 6.1.2 Large sample theory for the

MLE * . . . . . . . . . . . . . . . . . . . . 39

6.2 Frequentist decision theory . . . . . . . . . . . 39 6.3 Desirable properties of estimators . . . . . . 39 6.4 Empirical risk minimization . . . . . . . . . . 39

6.4.1 Regularized risk minimization . 39 6.4.2 Structural risk minimization . . . 39 6.4.3 Estimating the risk using

cross validation . . . . . . . . . . . . . 39 6.4.4 Upper bounding the risk

using statistical learning theory * . . . . . . . . . . . . . . . . . . . . 39

6.4.5 Surrogate loss functions . . . . . . 39 6.5 Pathologies of frequentist statistics * . . . 39

7 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . 41 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 41 7.2 Representation . . . . . . . . . . . . . . . . . . . . . . 41 7.3 MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.3.1 OLS . . . . . . . . . . . . . . . . . . . . . . 41 7.3.2 SGD . . . . . . . . . . . . . . . . . . . . . . 42

7.4 Ridge regression(MAP) . . . . . . . . . . . . . . 42 7.4.1 Basic idea . . . . . . . . . . . . . . . . . . 43 7.4.2 Numerically stable

computation * . . . . . . . . . . . . . . 43 7.4.3 Connection with PCA * . . . . . . 43 7.4.4 Regularization effects of big

data . . . . . . . . . . . . . . . . . . . . . . . 43 7.5 Bayesian linear regression . . . . . . . . . . . . 43

8 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 45 8.1 Representation . . . . . . . . . . . . . . . . . . . . . . 45 8.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . 45

8.2.1 MLE . . . . . . . . . . . . . . . . . . . . . . 45 8.2.2 MAP . . . . . . . . . . . . . . . . . . . . . . 45

8.3 Multinomial logistic regression . . . . . . . . 45 8.3.1 Representation . . . . . . . . . . . . . . 45 8.3.2 MLE . . . . . . . . . . . . . . . . . . . . . . 46 8.3.3 MAP . . . . . . . . . . . . . . . . . . . . . . 46

8.4 Bayesian logistic regression . . . . . . . . . . 46 8.4.1 Laplace approximation . . . . . . . 47 8.4.2 Derivation of the BIC . . . . . . . . 47 8.4.3 Gaussian approximation for

logistic regression . . . . . . . . . . . 47 8.4.4 Approximating the posterior

predictive . . . . . . . . . . . . . . . . . . 47 8.4.5 Residual analysis (outlier

detection) * . . . . . . . . . . . . . . . . 47 8.5 Online learning and stochastic

optimization . . . . . . . . . . . . . . . . . . . . . . . . 47 8.5.1 The perceptron algorithm . . . . . 47

8.6 Generative vs discriminative classifiers . 48 8.6.1 Pros and cons of each approach 48 8.6.2 Dealing with missing data . . . . 48 8.6.3 Fishers linear discriminant

analysis (FLDA) * . . . . . . . . . . . 50

Preface vii

9 Generalized linear models and the exponential family . . . . . . . . . . . . . . . . . . . . . . . 51 9.1 The exponential family . . . . . . . . . . . . . . . 51

9.1.1 Definition . . . . . . . . . . . . . . . . . . 51 9.1.2 Examples . . . . . . . . . . . . . . . . . . 51 9.1.3 Log partition function . . . . . . . . 52 9.1.4 MLE for the exponential family 53 9.1.5 Bayes for the exponential

family . . . . . . . . . . . . . . . . . . . . . 53 9.1.6 Maximum entropy derivation

of the exponential family * . . . . 53 9.2 Generalized linear models (GLMs) . . . . . 53

9.2.1 Basics . . . . . . . . . . . . . . . . . . . . . 53 9.3 Probit regression . . . . . . . . . . . . . . . . . . . . 53 9.4 Multi-task learning . . . . . . . . . . . . . . . . . . 53

10 Directed graphical models (Bayes nets) . . . . . 55 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 55

10.1.1 Chain rule . . . . . . . . . . . . . . . . . . 55 10.1.2 Conditional independence . . . . 55 10.1.3 Graphical models . . . . . . . . . . . . 55 10.1.4 Directed graphical model . . . . . 55

10.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 56 10.2.1 Naive Bayes classifiers . . . . . . . 56 10.2.2 Markov and hidden Markov

models . . . . . . . . . . . . . . . . . . . . 56 10.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 56 10.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

10.4.1 Learning from complete data . . 56 10.4.2 Learning with missing and/or

latent variables . . . . . . . . . . . . . . 57 10.5 Conditional independence properties

of DGMs . . . . . . . . . . . . . . . . . . . . . . . . . . 57 10.5.1 d-separation and the Bayes

Ball algorithm (global Markov properties) . . . . . . . . . . 57

10.5.2 Other Markov properties of DGMs . . . . . . . . . . . . . . . . . . . . . 57

10.5.3 Markov blanket and full conditionals . . . . . . . . . . . . . . . . 57

10.5.4 Multinoulli Learning . . . . . . . . . 57 10.6 Influence (decision) diagrams * . . . . . . . 57

11 Mixture models and the EM algorithm . . . . . 59 11.1 Latent variable models . . . . . . . . . . . . . . . 59 11.2 Mixture models . . . . . . . . . . . . . . . . . . . . . 59

11.2.1 Mixtures of Gaussians . . . . . . . 59 11.2.2 Mixtures of multinoullis . . . . . . 60 11.2.3 Using mixture models for

clustering . . . . . . . . . . . . . . . . . . 60 11.2.4 Mixtures of experts . . . . . . . . . . 60

11.3 Parameter estimation for mixture models 60 11.3.1 Unidentifiability . . . . . . . . . . . . 60

11.3.2 Computing a MAP estimate is non-convex . . . . . . . . . . . . . . . 60

11.4 The EM algorithm . . . . . . . . . . . . . . . . . . 60 11.4.1 Introduction . . . . . . . . . . . . . . . . 60 11.4.2 Basic idea . . . . . . . . . . . . . . . . . . 62 11.4.3 EM for GMMs . . . . . . . . . . . . . . 62 11.4.4 EM for K-means . . . . . . . . . . . . 64 11.4.5 EM for mixture of experts . . . . 64 11.4.6 EM for DGMs with hidden

variables . . . . . . . . . . . . . . . . . . . 64 11.4.7 EM for the Student

distribution * . . . . . . . . . . . . . . . 64 11.4.8 EM for probit regression * . . . . 64 11.4.9 Derivation of the Q function . . 64 11.4.10 Convergence of the EM

Algorithm * . . . . . . . . . . . . . . . . 65 11.4.11 Generalization of EM

Algorithm * . . . . . . . . . . . . . . . . 65 11.4.12 Online EM . . . . . . . . . . . . . . . . . 66 11.4.13 Other EM variants * . . . . . . . . . 66

11.5 Model selection for latent variable models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 11.5.1 Model selection for

probabilistic models . . . . . . . . . 67 11.5.2 Model selection for

non-probabilistic methods . . . . 67 11.6 Fitting models with missing data . . . . . . 67

11.6.1 EM for the MLE of an MVN with missing data . . . . . . . . . . . . 67

12 Latent linear models . . . . . . . . . . . . . . . . . . . . . . 69 12.1 Factor analysis . . . . . . . . . . . . . . . . . . . . . 69

12.1.1 FA is a low rank parameterization of an MVN . . 69

12.1.2 Inference of the latent factors . . 69 12.1.3 Unidentifiability . . . . . . . . . . . . 70 12.1.4 Mixtures of factor analysers . . . 70 12.1.5 EM for factor analysis models . 71 12.1.6 Fitting FA models with

missing data . . . . . . . . . . . . . . . . 71 12.2 Principal components analysis (PCA) . . 71

12.2.1 Classical PCA . . . . . . . . . . . . . . 71 12.2.2 Singular value decomposition

(SVD) . . . . . . . . . . . . . . . . . . . . . 72 12.2.3 Probabilistic PCA . . . . . . . . . . . 73 12.2.4 EM algorithm for PCA . . . . . . . 74

12.3 Choosing the number of latent dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 74 12.3.1 Model selection for FA/PPCA . 74 12.3.2 Model selection for PCA . . . . . 74

12.4 PCA for categorical data . . . . . . . . . . . . . 74 12.5 PCA for paired and multi-view data . . . . 75

12.5.1 Supervised PCA (latent factor regression) . . . . . . . . . . . . 75

viii Preface

12.5.2 Discriminative supervised PCA 75 12.5.3 Canonical correlation analysis . 75

12.6 Independent Component Analysis (ICA) 75 12.6.1 Maximum likelihood estimation 75 12.6.2 The FastICA algorithm . . . . . . . 76 12.6.3 Using EM . . . . . . . . . . . . . . . . . . 76 12.6.4 Other estimation principles * . . 76

13 Sparse linear models . . . . . . . . . . . . . . . . . . . . . 77

14 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 79 14.2 Kernel functions . . . . . . . . . . . . . . . . . . . . 79

14.2.1 RBF kernels . . . . . . . . . . . . . . . . 79 14.2.2 TF-IDF kernels . . . . . . . . . . . . . 79 14.2.3 Mercer (positive definite)

kernels . . . . . . . . . . . . . . . . . . . . 79 14.2.4 Linear kernels . . . . . . . . . . . . . . 80 14.2.5 Matern kernels . . . . . . . . . . . . . . 80 14.2.6 String kernels . . . . . . . . . . . . . . . 80 14.2.7 Pyramid match kernels . . . . . . . 81 14.2.8 Kernels derived from

probabilistic generative models 81 14.3 Using kernels inside GLMs . . . . . . . . . . . 81

14.3.1 Kernel machines . . . . . . . . . . . . 81 14.3.2 L1VMs, RVMs, and other

sparse vector machines . . . . . . . 81 14.4 The kernel trick . . . . . . . . . . . . . . . . . . . . . 81

14.4.1 Kernelized KNN . . . . . . . . . . . . 82 14.4.2 Kernelized K-medoids

clustering . . . . . . . . . . . . . . . . . . 82 14.4.3 Kernelized ridge regression . . . 82 14.4.4 Kernel PCA . . . . . . . . . . . . . . . . 83

14.5 Support vector machines (SVMs) . . . . . . 83 14.5.1 SVMs for classification . . . . . . . 83 14.5.2 SVMs for regression . . . . . . . . . 84 14.5.3 Choosing C . . . . . . . . . . . . . . . . 85 14.5.4 A probabilistic interpretation

of SVMs . . . . . . . . . . . . . . . . . . . 85 14.5.5 Summary of key points . . . . . . . 85

14.6 Comparison of discriminative kernel methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

14.7 Kernels for building generative models . 86

15 Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . 87 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 87 15.2 GPs for regression . . . . . . . . . . . . . . . . . . 87 15.3 GPs meet GLMs . . . . . . . . . . . . . . . . . . . . 87 15.4 Connection with other methods . . . . . . . . 87 15.5 GP latent variable model . . . . . . . . . . . . . 87 15.6 Approximation methods for large

datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

16 Adaptive basis function models . . . . . . . . . . . . 89 16.1 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . 89

16.1.1 Representation . . . . . . . . . . . . . . 89 16.1.2 Evaluation . . . . . . . . . . . . . . . . . 89 16.1.3 Optimization . . . . . . . . . . . . . . . 89 16.1.4 The upper bound of the

training error of AdaBoost . . . . 89

17 Hidden markov Model . . . . . . . . . . . . . . . . . . . . 91 17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 91 17.2 Markov models . . . . . . . . . . . . . . . . . . . . . 91

18 State space models . . . . . . . . . . . . . . . . . . . . . . . 93

19 Undirected graphical models (Markov random fields) . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

20 Exact inference for graphical models . . . . . . . 97

21 Variational inference . . . . . . . . . . . . . . . . . . . . . 99

22 More variational inference . . . . . . . . . . . . . . . . 101

23 Monte Carlo inference . . . . . . . . . . . . . . . . . . . . 103

24 Markov chain Monte Carlo (MCMC)inference . . . . . . . . . . . . . . . . . . . . . . . 105 24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 105 24.2 Metropolis Hastings algorithm . . . . . . . . 105 24.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . 105 24.4 Speed and accuracy of MCMC . . . . . . . . 105 24.5 Auxiliary variable MCMC * . . . . . . . . . . 105

25 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

26 Graphical model structure learning . . . . . . . . 109

27 Latent variable models for discrete data . . . . 111 27.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 111 27.2 Distributed state LVMs for discrete data 111

28 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

A Optimization methods . . . . . . . . . . . . . . . . . . . . 115 A.1 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 115 A.2 Gradient descent . . . . . . . . . . . . . . . . . . . . 115

A.2.1 Stochastic gradient descent . . . 115 A.2.2 Batch gradient descent . . . . . . . 115 A.2.3 Line search . . . . . . . . . . . . . . . . . 115 A.2.4 Momentum term . . . . . . . . . . . . 116

A.3 Lagrange duality . . . . . . . . . . . . . . . . . . . . 116 A.3.1 Primal form . . . . . . . . . . . . . . . . 116 A.3.2 Dual form . . . . . . . . . . . . . . . . . . 116

A.4 Newton’s method . . . . . . . . . . . . . . . . . . . 116 A.5 Quasi-Newton method . . . . . . . . . . . . . . . 116

A.5.1 DFP . . . . . . . . . . . . . . . . . . . . . . . 116

Preface ix

A.5.2 BFGS . . . . . . . . . . . . . . . . . . . . . 116 A.5.3 Broyden . . . . . . . . . . . . . . . . . . . 117

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

List of Contributors

Wei Zhang PhD candidate at the Institute of Software, Chinese Academy of Sciences (ISCAS), Beijing, P.R.CHINA, e-mail: zh3feng@gmail.com, has written chapters of Naive Bayes and SVM.

Fei Pan Master at Beijing University of Technology, Beijing, P.R.CHINA, e-mail: example@gmail.com, has written chapters of KMeans, AdaBoost.

Yong Li PhD candidate at the Institute of Automation of the Chinese Academy of Sciences (CASIA), Beijing, P.R.CHINA, e-mail: liyong3forever@gmail.com, has written chapters of Logistic Regression.

Jiankou Li PhD candidate at the Institute of Software, Chinese Academy of Sciences (ISCAS), Beijing, P.R.CHINA, e-mail: lijiankoucoco@163.com, has written chapters of BayesNet.

xi

Notation

Introduction

It is very difficult to come up with a single, consistent notation to cover the wide variety of data, models and algorithms that we discuss. Furthermore, conventions difer between machine learning and statistics, and between different books and papers. Nevertheless, we have tried to be as consistent as possible. Below we summarize most of the notation used in this book, although individual sections may introduce new notation. Note also that the same symbol may have different meanings depending on the context, although we try to avoid this where possible.

General math notation

Symbol Meaning

⌊x⌋ Floor of x, i.e., round down to nearest integer ⌈x⌉ Ceiling of x, i.e., round up to nearest integer xy Convolution of x and y xy Hadamard (elementwise) product of x and y a∧b logical AND a∨b logical OR ¬a logical NOT I(x) Indicator function, I(x) = 1 if x is true, else I(x) = 0 ∞ Infinity Tends towards, e.g., n→ ∞ ∝ Proportional to, so y= ax can be written as y x |x| Absolute value |S| Size (cardinality) of a set n! Factorial function ∇ Vector of first derivatives ∇2 Hessian matrix of second derivatives ≜ Defined as O(·) Big-O: roughly means order of magnitude R The real numbers 1 : n Range (Matlab convention): 1 : n= 1,2, ...,n ≈ Approximately equal to argmax

x f (x) Argmax: the value x that maximizes f

B(a,b) Beta function, B(a,b) = Γ (a)Γ (b) Γ (a+b)

B(α) Multivariate beta function, ∏ k Γ (αk)

Γ (∑ k αk)(n

k

) n choose k , equal to n!/(k!(nk)!)

δ (x) Dirac delta function,δ (x) = ∞ if x= 0, else δ (x) = 0 exp(x) Exponential function ex

Γ (x) Gamma function, Γ (x) = ∫ ∞ 0 u

x−1e−udu

Ψ(x) Digamma function,Psi(x) = d dx

logΓ (x)

xiii

xiv Notation

X A set from which values are drawn (e.g.,X = RD)

Linear algebra notation

We use boldface lower-case to denote vectors, such as x, and boldface upper-case to denote matrices, such as X . We denote entries in a matrix by non-bold upper case letters, such as Xi j.

Vectors are assumed to be column vectors, unless noted otherwise. We use (x1, · · · ,xD) to denote a column vector created by stacking D scalars. If we write X = (x1, · · · ,xn), where the left hand side is a matrix, we mean to stack the xi along the columns, creating a matrix.

Symbol Meaning

X 0 X is a positive definite matrix tr(X) Trace of a matrix det(X) Determinant of matrix X |X| Determinant of matrix X X1 Inverse of a matrix X† Pseudo-inverse of a matrix XT Transpose of a matrix xT Transpose of a vector diag(x) Diagonal matrix made from vector x diag(X) Diagonal vector extracted from matrixX I or Id Identity matrix of size d×d (ones on diagonal, zeros of) 1 or 1d Vector of ones (of length d) 0 or 0d Vector of zeros (of length d)

||x||= ||x||2 Euclidean or 2 norm √

d j=1

x2j

||x||1 1 norm d j=1

∣∣x j∣∣ X :, j j’th column of matrix X i,: transpose of i’th row of matrix (a column vector) X i, j Element (i, j) of matrixX xy Tensor product of x and y

Probability notation

We denote random and fixed scalars by lower case, random and fixed vectors by bold lower case, and random and fixed matrices by bold upper case. Occasionally we use non-bold upper case to denote scalar random variables. Also, we use p() for both discrete and continuous random variables

Symbol Meaning

X ,Y Random variable P() Probability of a random event F() Cumulative distribution function(CDF), also called distribution function p(x) Probability mass function(PMF) f (x) probability density function(PDF) F(x,y) Joint CDF p(x,y) Joint PMF f (x,y) Joint PDF

Notation xv

p(X |Y ) Conditional PMF, also called conditional probability fX |Y (x|y) Conditional PDF X ⊥ Y X is independent of Y X ̸⊥ Y X is not independent of Y X ⊥ Y |Z X is conditionally independent of Y given Z X ̸⊥ Y |Z X is not conditionally independent of Y given Z X ∼ p X is distributed according to distribution p α Parameters of a Beta or Dirichlet distribution cov[X ] Covariance of X E[X ] Expected value of X Eq[X ] Expected value of X wrt distribution q H(X) or H(p) Entropy of distribution p(X) I(X ;Y ) Mutual information between X and Y KL(p||q) KL divergence from distribution p to q ℓ(θ) Log-likelihood function L,a) Loss function for taking action a when true state of nature is θ λ Precision (inverse variance) λ = 1/σ2 Λ Precision matrix Λ = Σ1 mode[X] Most probable value of X µ Mean of a scalar distribution µ Mean of a multivariate distribution Φ cdf of standard normal ϕ pdf of standard normal π multinomial parameter vector, Stationary distribution of Markov chain ρ Correlation coefficient

sigm(x) Sigmoid (logistic) function, 1

1+ e−x σ2 Variance Σ Covariance matrix var[x] Variance of x ν Degrees of freedom parameter Z Normalization constant of a probability distribution

Machine learning/statistics notation

In general, we use upper case letters to denote constants, such asC,K,M,N,T , etc. We use lower case letters as dummy indexes of the appropriate range, such as c = 1 :C to index classes, i = 1 :M to index data cases, j = 1 : N to index input features, k = 1 : K to index states or clusters, t = 1 : T to index time, etc.

We use x to represent an observed data vector. In a supervised problem, we use y or y to represent the desired output label. We use z to represent a hidden variable. Sometimes we also use q to represent a hidden discrete variable.

Symbol Meaning

C Number of classes D Dimensionality of data vector (number of features) N Number of data cases Nc Number of examples of class c,Nc = ∑Ni=1 I(yi = c) R Number of outputs (response variables) D Training data D = {(xi,yi)|i= 1 : N} Dtest Test data X Input space Y Output space

xvi Notation

K Number of states or dimensions of a variable (often latent) k(x,y) Kernel function K Kernel matrix H Hypothesis space L Loss function J(θ) Cost function f (x) Decision function P(y|x) TODO λ Strength of 2 or 1regularizer ϕ(x) Basis function expansion of feature vector x Φ Basis function expansion of design matrixX q() Approximate or proposal distribution Q(θ,θold) Auxiliary function in EM T Length of a sequence T (D) Test statistic for data T Transition matrix of Markov chain θ Parameter vector θ(s) s’th sample of parameter vector θ̂ Estimate (usually MLE or MAP) of θ θ̂MLE Maximum likelihood estimate of θ θ̂MAP MAP estimate of θ θ̄ Estimate (usually posterior mean) of θ w Vector of regression weights (called β in statistics) b intercept (called ε in statistics) W Matrix of regression weights xi j Component (i.e., feature) j of data case i ,for i= 1 : N, j = 1 : D xi Training case, i= 1 : N X Design matrix of size N×D Empirical mean x̄=

1 N Ni=1xi

Future test case xFeature test case y Vector of all training labels y = (y1, ...,yN) zi j Latent component j for case i

Chapter 1 Introduction

1.1 Types of machine learning



Supervised learning

{ Classification Regression

Unsupervised learning

 Discovering clusters Discovering latent factors Discovering graph structure Matrix completion

1.2 Three elements of a machine learning model

Model = Representation + Evaluation + Optimization1

1.2.1 Representation

In supervised learning, a model must be represented as a conditional probability distribution P(y|x)(usually we call it classifier) or a decision function f (x). The set of classifiers(or decision functions) is called the hypothesis space of the model. Choosing a representation for a model is tantamount to choosing the hypothesis space that it can possibly learn.

1.2.2 Evaluation

In the hypothesis space, an evaluation function (also called objective function or risk function) is needed to distinguish good classifiers(or decision functions) from bad ones.

1.2.2.1 Loss function and risk function

Definition 1.1. In order to measure how well a function fits the training data, a loss function L :Y ×Y → R≥ 0 is

1 Domingos, P. A few useful things to know about machine learning. Commun. ACM. 55(10):7887 (2012).

defined. For training example (xi,yi), the loss of predict- ing the value is L(yi, ŷ).

The following is some common loss functions:

1. 0-1 loss function

L(Y, f (X)) = I(Y, f (X)) =

{ 1, Y = f (X) 0, Y ̸= f (X)

2. Quadratic loss function L(Y, f (X)) = (Y − f (X))2 3. Absolute loss function L(Y, f (X)) = |Y − f (X)| 4. Logarithmic loss function

L(Y,P(Y |X)) =logP(Y |X)

Definition 1.2. The risk of function f is defined as the ex- pected loss of f :

Rexp( f ) = E [L(Y, f (X))] = ∫

L(y, f (x))P(x,y)dxdy

(1.1) which is also called expected loss or risk function.

Definition 1.3. The risk function Rexp( f ) can be esti- mated from the training data as

Remp( f ) = 1 N

N

i=1

L(yi, f (xi)) (1.2)

which is also called empirical loss or empirical risk.

You can define your own loss function, but if you’re a novice, you’re probably better off using one from the literature. There are conditions that loss functions should meet2:

1. They should approximate the actual loss you’re trying to minimize. As was said in the other answer, the stan- dard loss functions for classification is zero-one-loss (misclassification rate) and the ones used for training classifiers are approximations of that loss.

2. The loss function should work with your intended op- timization algorithm. That’s why zero-one-loss is not used directly: it doesn’t work with gradient-based opti- mization methods since it doesn’t have a well-defined gradient (or even a subgradient, like the hinge loss for SVMs has). The main algorithm that optimizes the zero-one-loss directly is the old perceptron algorithm(chapter §??).

2 http://t.cn/zTrDxLO

1

2

1.2.2.2 ERM and SRM

Definition 1.4. ERM(Empirical risk minimization)

min f∈F

Remp( f ) =min f∈F

1 N

N

i=1

L(yi, f (xi)) (1.3)

Definition 1.5. Structural risk

Rsmp( f ) = 1 N

N

i=1

L(yi, f (xi))+λJ( f ) (1.4)

Definition 1.6. SRM(Structural risk minimization)

min f∈F

Rsrm( f ) =min f∈F

1 N

N

i=1

L(yi, f (xi))+λJ( f ) (1.5)

1.2.3 Optimization

Finally, we need a training algorithm(also called learn- ing algorithm) to search among the classifiers in the the hypothesis space for the highest-scoring one. The choice of optimization technique is key to the efficiency of the model.

1.3 Some basic concepts

1.3.1 Parametric vs non-parametric models

1.3.2 A simple non-parametric classifier: K-nearest neighbours

1.3.2.1 Representation

y= f (x) = argmin c

xi∈Nk(x) I(yi = c) (1.6)

where Nk(x) is the set of k points that are closest to point x.

Usually use k-d tree to accelerate the process of find- ing k nearest points.

1.3.2.2 Evaluation

No training is needed.

1.3.2.3 Optimization

No training is needed.

1.3.3 Overfitting

1.3.4 Cross validation

Definition 1.7. Cross validation, sometimes called rota- tion estimation, is amodel validation technique for assess- ing how the results of a statistical analysis will generalize to an independent data set3.

Common types of cross-validation:

1. K-fold cross-validation. In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single sub- sample is retained as the validation data for testing the model, and the remaining k 1 subsamples are used as training data.

2. 2-fold cross-validation. Also, called simple cross- validation or holdout method. This is the simplest variation of k-fold cross-validation, k=2.

3. Leave-one-out cross-validation(LOOCV). k=M, the number of original samples.

1.3.5 Model selection

When we have a variety of models of different complex- ity (e.g., linear or logistic regression models with differ- ent degree polynomials, or KNN classifiers with different values ofK), how should we pick the right one? A natural approach is to compute the misclassification rate on the training set for each method.

3 http://en.wikipedia.org/wiki/ Cross-validation_(statistics)

Chapter 2 Probability

2.1 Frequentists vs. Bayesians

what is probability? One is called the frequentist interpretation. In this

view, probabilities represent long run frequencies of events. For example, the above statement means that, if we flip the coin many times, we expect it to land heads about half the time.

The other interpretation is called the Bayesian inter- pretation of probability. In this view, probability is used to quantify our uncertainty about something; hence it is fundamentally related to information rather than repeated trials (Jaynes 2003). In the Bayesian view, the above state- ment means we believe the coin is equally likely to land heads or tails on the next toss

One big advantage of the Bayesian interpretation is that it can be used to model our uncertainty about events that do not have long term frequencies. For example, we might want to compute the probability that the polar ice cap will melt by 2020 CE. This event will happen zero or one times, but cannot happen repeatedly. Nevertheless, we ought to be able to quantify our uncertainty about this event. To give another machine learning oriented exam- ple, we might have observed a blip on our radar screen, and want to compute the probability distribution over the location of the corresponding target (be it a bird, plane, or missile). In all these cases, the idea of repeated trials does not make sense, but the Bayesian interpretation is valid and indeed quite natural. We shall therefore adopt the Bayesian interpretation in this book. Fortunately, the basic rules of probability theory are the same, no matter which interpretation is adopted.

2.2 A brief review of probability theory

2.2.1 Basic concepts

We denote a random event by defining a random variable X .

Descrete random variable: X can take on any value from a finite or countably infinite set.

Continuous random variable: the value of X is real- valued.

2.2.1.1 CDF

F(x)≜ P(X ≤ x) = { ∑u≤x p(u) , discrete∫ x −f (u)du , continuous

(2.1)

2.2.1.2 PMF and PDF

For descrete random variable, We denote the probability of the event that X = x by P(X = x), or just p(x) for short. Here p(x) is called a probability mass function or PMF.A probability mass function is a function that gives the probability that a discrete random variable is ex- actly equal to some value4. This satisfies the properties 0≤ p(x)1 and ∑x∈X p(x) = 1.

For continuous variable, in the equation F(x) =

x −f (u)du, the function f (x) is called a

probability density function or PDF. A probability density function is a function that describes the rela- tive likelihood for this random variable to take on a given value5.This satisfies the properties f (x) 0 and∫ ∞ f (x)dx= 1.

2.2.2 Mutivariate random variables

2.2.2.1 Joint CDF

We denote joint CDF by F(x,y) ≜ P(X ≤ x∩Y ≤ y) = P(X ≤ x,Y ≤ y).

F(x,y)≜ P(X ≤ x,Y ≤ y) = { ∑u≤x,v≤y p(u,v)∫ x −

y −f (u,v)dudv

(2.2) product rule:

p(X ,Y ) = P(X |Y )P(Y ) (2.3)

Chain rule:

4 http://en.wikipedia.org/wiki/Probability_ mass_function 5 http://en.wikipedia.org/wiki/Probability_ density_function

3

4

p(X1:N) = p(X1)p(X3|X2,X1)...p(XN |X1:N−1) (2.4)

2.2.2.2 Marginal distribution

Marginal CDF:

FX (x)≜ F(x,+∞) = ∑xi≤xP(X = xi) = ∑xi≤x +∞ ∑ j=1

P(X = xi,Y = y j)∫ x −fX (u)du=

x −

∫ +∞ f (u,v)dudv

(2.5)

FY (y)≜ F(+∞,y) = ∑y j≤y p(Y = y j) = +∞ ∑ i=1

y j≤yP(X = xi,Y = y j)∫ y −fY (v)dv=

∫ +∞

y −f (u,v)dudv

(2.6)

Marginal PMF and PDF:{ P(X = xi) = ∑+∞j=1P(X = xi,Y = y j) , descrete fX (x) =

∫ +∞ f (x,y)dy , continuous

(2.7)

{ p(Y = y j) = ∑+∞i=1P(X = xi,Y = y j) , descrete fY (y) =

∫ +∞ f (x,y)dx , continuous

(2.8)

2.2.2.3 Conditional distribution

Conditional PMF:

p(X = xi|Y = y j) = p(X = xi,Y = y j)

p(Y = y j) if p(Y )> 0 (2.9)

The pmf p(X |Y ) is called conditional probability. Conditional PDF:

fX |Y (x|y) = f (x,y) fY (y)

(2.10)

2.2.3 Bayes rule

p(Y = y|X = x) = p(X = x,Y = y) p(X = x)

= p(X = x|Y = y)p(Y = y)

y′ p(X = x|Y = y′)p(Y = y′) (2.11)

2.2.4 Independence and conditional independence

We say X and Y are unconditionally independent or marginally independent, denoted X ⊥ Y , if we can represent the joint as the product of the two marginals, i.e.,

X ⊥ Y = P(X ,Y ) = P(X)P(Y ) (2.12)

We say X and Y are conditionally independent(CI) given Z if the conditional joint can be written as a product of conditional marginals:

X ⊥ Y |Z = P(X ,Y |Z) = P(X |Z)P(Y |Z) (2.13)

2.2.5 Quantiles

Since the cdf F is a monotonically increasing function, it has an inverse; let us denote this by F−1. If F is the cdf of X , then F−1(α) is the value of xα such that P(X ≤ xα) = α; this is called the α quantile of F . The value F−1(0.5) is themedian of the distribution, with half of the probability mass on the left, and half on the right. The values F−1(0.25) and F1(0.75)are the lower and up- per quartiles.

2.2.6 Mean and variance

The most familiar property of a distribution is itsmean,or expected value, denoted by µ . For discrete rvs, it is de- fined as E[X ] ≜ ∑x∈X xp(x), and for continuous rvs, it is defined as E[X ]≜

X xp(x)dx. If this integral is not finite,

the mean is not defined (we will see some examples of this later).

The variance is a measure of the spread of a distribu- tion, denoted by σ2. This is defined as follows:

var[X ] = E[(X−µ)2] (2.14)

= ∫

(x−µ)2p(x)dx

= ∫

x2p(x)dx+µ2 ∫

p(x)dx−2µ ∫

xp(x)dx

= E[X2]µ2 (2.15)

from which we derive the useful result

E[X2] = σ2+µ2 (2.16)

The standard deviation is defined as

5

std[X ]≜ √ var[X ] (2.17)

This is useful since it has the same units as X itself.

2.3 Some common discrete distributions

In this section, we review some commonly used paramet- ric distributions defined on discrete state spaces, both fi- nite and countably infinite.

2.3.1 The Bernoulli and binomial distributions

Definition 2.1. Now suppose we toss a coin only once. Let X ∈ {0,1} be a binary random variable, with probabil- ity of success or heads of θ . We say that X has a Bernoulli distribution. This is written as X ∼ Ber(θ), where the pmf is defined as

Ber(x|θ)≜ θ I(x=1)(1θ)I(x=0) (2.18)

Definition 2.2. Suppose we toss a coin n times. Let X ∈ {0,1, · · · ,n} be the number of heads. If the probability of heads is θ , then we say X has a binomial distribution, written as X ∼ Bin(n,θ). The pmf is given by

Bin(k|n,θ)≜ ( n k

) θ k(1θ)n−k (2.19)

2.3.2 The multinoulli and multinomial distributions

Definition 2.3. The Bernoulli distribution can be used to model the outcome of one coin tosses. To model the outcome of tossing a K-sided dice, let x = (I(x = 1), · · · ,I(x = K)) ∈ {0,1}K be a random vector(this is called dummy encoding or one-hot en- coding), then we say X has amultinoulli distribution(or categorical distribution), written as X ∼ Cat(θ). The pmf is given by:

p(x)≜ K

k=1

θ I(xk=1)k (2.20)

Definition 2.4. Suppose we toss a K-sided dice n times. Let x= (x1,x2, · · · ,xK)∈ {0,1, · · · ,n}K be a random vec- tor, where x j is the number of times side j of the dice occurs, then we say X has a multinomial distribution,

written as X ∼Mu(n,θ). The pmf is given by

p(x)≜ (

n x1 · · ·xk

) K k=1

θ xkk (2.21)

where (

n x1 · · ·xk

) ≜ n!

x1!x2! · · ·xK!

Bernoulli distribution is just a special case of a Bino- mial distribution with n = 1, and so is multinoulli distri- bution as to multinomial distribution. See Table 2.1 for a summary.

Table 2.1: Summary of the multinomial and related distributions.

Name K n X

Bernoulli 1 1 x ∈ {0,1} Binomial 1 - x ∈ {0,1, · · · ,n} Multinoulli - 1 x ∈ {0,1}K ,Kk=1 xk = 1 Multinomial - - x ∈ {0,1, · · · ,n}K ,Kk=1 xk = n

2.3.3 The Poisson distribution

Definition 2.5. We say that X ∈ {0,1,2, · · ·} has a Pois- son distribution with parameter λ > 0, written as X ∼ Poi(λ ), if its pmf is

p(x|λ ) = e−λ λ x

x! (2.22)

The first term is just the normalization constant, re- quired to ensure the distribution sums to 1.

The Poisson distribution is often used as a model for counts of rare events like radioactive decay and traffic ac- cidents.

2.3.4 The empirical distribution

The empirical distribution function6, or empirical cdf, is the cumulative distribution function associated with the empirical measure of the sample. LetD= {x1,x2, · · · ,xN} be a sample set, it is defined as

Fn(x)≜ 1 N

N

i=1

I(xi ≤ x) (2.23)

6 http://en.wikipedia.org/wiki/Empirical_ distribution_function

6

Table 2.2: Summary of Bernoulli, binomial multinoulli and multinomial distributions.

Name Written as X p(x)(or p(x)) E[X ] var[X ]

Bernoulli X ∼ Ber(θ) x ∈ {0,1} θ I(x=1)(1θ)I(x=0) θ θ (1θ)

Binomial X ∼ Bin(n,θ) x ∈ {0,1, · · · ,n} ( n k

) θ k(1θ)n−k nθ nθ(1θ)

Multinoulli X ∼ Cat(θ) x ∈ {0,1}K ,Kk=1 xk = 1 K k=1

θ I(x j=1)j - -

Multinomial X ∼Mu(n,θ) x ∈ {0,1, · · · ,n}K ,Kk=1 xk = n (

n x1 · · ·xk

) K k=1

θ x jj - -

Poisson X ∼ Poi(λ ) x ∈ {0,1,2, · · ·} e−λ λ x

x! λ λ

2.4 Some common continuous distributions

In this section we present some commonly used univariate (one-dimensional) continuous probability distributions.

2.4.1 Gaussian (normal) distribution

Table 2.3: Summary of Gaussian distribution.

Written as f (x) E[X ] mode var[X ]

X ∼N ,σ2) 12πσ

e− 1

2σ2 (x−µ)2 µ µ σ2

If X ∼ N(0,1),we say X follows a standard normal distribution.

The Gaussian distribution is the most widely used dis- tribution in statistics. There are several reasons for this.

1. First, it has two parameters which are easy to interpret, and which capture some of the most basic properties of a distribution, namely its mean and variance.

2. Second,the central limit theorem (Section TODO) tells us that sums of independent random variables have an approximately Gaussian distribution, making it a good choice for modeling residual errors or noise.

3. Third, the Gaussian distribution makes the least num- ber of assumptions (has maximum entropy), subject to the constraint of having a specified mean and variance, as we show in Section TODO; this makes it a good de- fault choice in many cases.

4. Finally, it has a simple mathematical form, which re- sults in easy to implement, but often highly effective, methods, as we will see.

See (Jaynes 2003, ch 7) for a more extensive discussion of why Gaussians are so widely used.

2.4.2 Student’s t-distribution

Table 2.4: Summary of Student’s t-distribution.

Written as f (x) E[X ] mode var[X ]

X ∼ T ,σ2,ν) Γ ( ν+12 )νπΓ ( ν2 )

[ 1+

1 ν

( x−µ ν

)2] µ µ

νσ2

ν2

where Γ (x) is the gamma function:

Γ (x)≜ ∫ ∞ 0

tx−1e−tdt (2.24)

µ is the mean, σ2 > 0 is the scale parameter, and ν > 0 is called the degrees of freedom. See Figure 2.1 for some plots.

The variance is only defined if ν > 2. The mean is only defined if ν > 1.

As an illustration of the robustness of the Student dis- tribution, consider Figure 2.2. We see that the Gaussian is affected a lot, whereas the Student distribution hardly changes. This is because the Student has heavier tails, at least for small ν(see Figure 2.1).

If ν = 1, this distribution is known as the Cauchy or Lorentz distribution. This is notable for having such heavy tails that the integral that defines the mean does not converge.

To ensure finite variance, we require ν > 2. It is com- mon to use ν = 4, which gives good performance in a range of problems (Lange et al. 1989). For ν 5, the Student distribution rapidly approaches a Gaussian distri- bution and loses its robustness properties.

7

(a)

(b)

Fig. 2.1: (a) The pdfs for a N (0,1), T (0,1,1) and Lap(0,1/

2). The mean is 0 and the variance is 1 for

both the Gaussian and Laplace. The mean and variance of the Student is undefined when ν = 1.(b) Log of these

pdfs. Note that the Student distribution is not log-concave for any parameter value, unlike the Laplace

distribution, which is always log-concave (and log-convex...) Nevertheless, both are unimodal.

Table 2.5: Summary of Laplace distribution.

Written as f (x) E[X ] mode var[X ]

X ∼ Lap(µ,b) 1 2b

exp ( −|x−µ|

b

) µ µ 2b2

(a)

(b)

Fig. 2.2: Illustration of the effect of outliers on fitting Gaussian, Student and Laplace distributions. (a) No

outliers (the Gaussian and Student curves are on top of each other). (b) With outliers. We see that the Gaussian is more affected by outliers than the Student and Laplace

distributions.

2.4.3 The Laplace distribution

Here µ is a location parameter and b> 0 is a scale param- eter. See Figure 2.1 for a plot.

Its robustness to outliers is illustrated in Figure 2.2. It also put mores probability density at 0 than the Gaussian. This property is a useful way to encourage sparsity in a model, as we will see in Section TODO.

8

Table 2.6: Summary of gamma distribution

Written as X f (x) E[X ] mode var[X ]

X ∼ Ga(a,b) x ∈ R+ b a

Γ (a) xa−1e−xb

a b

a−1 b

a b2

2.4.4 The gamma distribution

Here a > 0 is called the shape parameter and b > 0 is called the rate parameter. See Figure 2.3 for some plots.

(a)

(b)

Fig. 2.3: Some Ga(a,b= 1) distributions. If a≤ 1, the mode is at 0, otherwise it is > 0.As we increase the rate

b, we reduce the horizontal scale, thus squeezing everything leftwards and upwards. (b) An empirical pdf of some rainfall data, with a fitted Gamma distribution

superimposed.

2.4.5 The beta distribution

Here B(a,b)is the beta function,

B(a,b)≜ Γ (a)Γ (b) Γ (a+b)

(2.25)

See Figure 2.4 for plots of some beta distributions. We require a,b > 0 to ensure the distribution is integrable (i.e., to ensure B(a,b) exists). If a = b = 1, we get the uniform distirbution. If a and b are both less than 1, we get a bimodal distribution with spikes at 0 and 1; if a and b are both greater than 1, the distribution is unimodal.

Fig. 2.4: Some beta distributions.

2.4.6 Pareto distribution

The Pareto distribution is used to model the distribu- tion of quantities that exhibit long tails, also called heavy tails.

As k → ∞, the distribution approaches δ (x−m). See Figure 2.5(a) for some plots. If we plot the distribution on a log-log scale, it forms a straight line, of the form log p(x) = a logx+c for some constants a and c. See Fig- ure 2.5(b) for an illustration (this is known as a power law).

9

Table 2.7: Summary of Beta distribution

Name Written as X f (x) E[X ] mode var[X ]

Beta distribution X ∼ Beta(a,b) x ∈ [0,1] 1 B(a,b)

xa−1(1− x)b−1 a a+b

a−1 a+b−2

ab (a+b)2(a+b+1)

Table 2.8: Summary of Pareto distribution

Name Written as X f (x) E[X ] mode var[X ]

Pareto distribution X ∼ Pareto(k,m) x≥ m kmkx−(k+1)I(x≥ m) km k−1

if k > 1 m m2k

(k−1)2(k−2) if k > 2

(a)

(b)

Fig. 2.5: (a) The Pareto distribution Pareto(x|m,k) for m= 1. (b) The pdf on a log-log scale.

2.5 Joint probability distributions

Given a multivariate random variable or random vec- tor 7 X ∈ RD, the joint probability distribution8 is a probability distribution that gives the probability that each of X1,X2, · · · ,XD falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a bivariate distribu- tion, but the concept generalizes to any number of random variables, giving a multivariate distribution.

The joint probability distribution can be expressed ei- ther in terms of a joint cumulative distribution function or in terms of a joint probability density function (in the case of continuous variables) or joint probability mass function (in the case of discrete variables).

2.5.1 Covariance and correlation

Definition 2.6. The covariance between two rvs X and Y measures the degree to which X and Y are (linearly) related. Covariance is defined as

cov[X ,Y ]≜ E [(X−E[X ])(Y −E[Y ])] = E[XY ]E[X ]E[Y ]

(2.26)

Definition 2.7. If X is a D-dimensional random vector, its covariance matrix is defined to be the following symmet- ric, positive definite matrix:

7 http://en.wikipedia.org/wiki/Multivariate_ random_variable 8 http://en.wikipedia.org/wiki/Joint_ probability_distribution