Introduction to Machine Learning Class Notes, Study notes of Machine Learning

Class notes on various topics related to machine learning, including MLE and MAP, nonparametric models, linear regression, logistic regression, naive Bayes classifier, and neural networks and deep learning. The notes provide definitions, explanations, and examples of each topic, making them useful study material for university students in machine learning courses. authored by Huy Nguyen, a PhD student in the Human-Computer Interaction Institute at Carnegie Mellon University.

Typology: Study notes

2022/2023

Uploaded on 05/11/2023

sheetal_101
sheetal_101 🇺🇸

4.8

(17)

234 documents

1 / 69

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Introduction to Machine Learning Class Notes
Huy Nguyen
PhD Student, Human-Computer Interaction Institute
Carnegie Mellon University
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45

Partial preview of the text

Download Introduction to Machine Learning Class Notes and more Study notes Machine Learning in PDF only on Docsity!

Introduction to Machine Learning Class Notes

Huy Nguyen

PhD Student, Human-Computer Interaction Institute

Carnegie Mellon University

Contents

Preface

This is the class notes I took for CMU’s 10701: Introduction to Machine Learning in Fall 2018. The goal of this document is to serve as a quick review of key points from each topic covered in the course. A more comprehensive note collection for beginners is available at UPenn’s CIS520: Machine Learning. In this document, each chapter typically covers one machine learning methodology and contains the followings:

  • Definition - definition of important concepts.
  • Diving in the Math - mathematical proof for a statement / formula.
  • Algorithm - the steps to perform a common routine / subroutine.

Intertwined with these components are transitional text (as I find them easier to review than bullet points), so the document as a whole ends up looking like a mini textbook. While there are already plenty of ML textbooks out there, I am still interested in writing up something that stays closest to the content taught by Professor Ziv Bar-Joseph and Pradeep Ravikumar. I would also like to take this opportunity to thank the two professors for their guidance.

Chapter 1

MLE and MAP

1.1 MLE

Definition 1: (Likelihood function and MLE) Given n data points x 1 , x 2 ,... , xn we can define the likelihood of the data given the model θ (usually a collection of parameters) as follows.

Pˆ (dataset | θ) =

∏^ n

k=

P^ ˆ (xk | θ). (1.1)

The maximum likelihood estimate (MLE) of θ is

θˆM LE = arg max θ

P^ ˆ (dataset | θ). (1.2)

To determine the values for the parameters in θ, we maximize the probability of generating the observed samples. For example, let θ be the model of a coin flip (so θ = {P (Head) = q}), then

the best assignment (MLE) for θ in this case is θˆ =

q ˆ = (^) # samples# heads

Diving in the Math 1 - MLE for binary variable For a binary random variable A with P (A = 1) = q, we show that ˆq = (^) # samples# 1. Assume we observe n samples x 1 , x 2 ,... , xn with n 1 heads and n 2 tails. Then, the likelihood function is

P (D | θ) =

∏^ n

i=

P (xi | θ) = qn^1 (1 − q)n^2.

We now find ˆq that maximizes this likelihood function, i.e., ˆq = arg max q

qn^1 (1 − q)n^2.

To do so, we set the derivative to 0:

∂q

qn^1 (1 − q)n^2 = n 1 qn^1 −^1 (1 − q)n^2 − qn^1 n 2 (1 − q)n^2 −^1 ,

which is equivalent to qn^1 −^1 (1 − q)n^2 −^1 (n 1 (1 − q) − qn 2 ) = 0, which yields n 1 (1 − q) − qn 2 = 0 ⇔ q =

n 1 n 1 + n 2

If we only have very few samples, MLE may not yield accurate results, so it is useful to take into account prior knowledge. When the number of samples gets large, the effect of prior knowledge will diminish. Similar to MLE, we have the following algorithm for MAP.

Algorithm 2: (Finding the MAP) Given n data points x 1 , x 2 ,... , xn, a model θ represented by an expression for P (X | θ), and the prior knowledge P (θ), perform the following steps:

  1. Compute the log-likelihood

L = log P (θ) ·

∏^ n

i=

P (xi | θ) = log P (θ) +

∑^ n

i=

log P (xi | θ).

  1. For each parameter γ in θ, find the solution(s) to the equation ∂L∂γ = 0.
  2. The solution ˆγ that satisfies

∂^2 L

∂ˆγ^2

≤ 0 is the MAP of γ.

Chapter 2

Nonparametric models: KNN and

kernel regression

2.1 Bayes decision rule

Classification is the task of predicting a (discrete) output label given the input data. The performance of any classification algorithm depends on two factors: (1) the parameters are correct, and (2) the underlying assumption holds. The most optimal algorithm is called the Bayes decision rule.

Algorithm 3: (Bayes decision rule) If we know the conditional probability P (x | y) and class prior P (y), use (1.4) to compute

P (y = i | x) =

P (x | y = i)P (y = i) P (x)

∝ P (x | y = i)P (y = i) = qi(x) (2.1)

and qi(x) to select the appropriate class. Choose class 0 if q 0 (x) > q 1 (x) and 1 otherwise. In general choose the class ˆc = arg max c

{qc(x)}.

Because our decision is probabilistic, there is still chance for error. The Bayes error rate (risk) of the data distribution is the probability an instance is misclassified by the Bayes decision rule. For binary classification, the risk for sample x is

R(x) = min{P (y = 0 | x), P (y = 1 | x)}. (2.2)

In other words, if P (y = 0 | x) > P (y = 1 | x), then we would pick the label 0, and the risk is the probability that the actual label is 1, which is P (y = 1 | x). We can also compute the expected risk - the risk for the entire range of values of x:

E[r(x)] =

x

r(x)P (x)dx

x

min{P (y = 0 | x), P (y = 1 | x)}dx

= P (y = 0)

L 1

P (x | y = 0)dx + P (y = 1)

L 0

P (x | y = 1)dx,

where Li is the region over which the decision rule outputs label i. The risk value we computed assumes that both errors (assigning instances of class 1 to 0 and vice versa) are equally harmful. In general, we can set the weight penalty Li,j (x) for assigning

Diving in the Math 2 - Computing probabilities for KNN Let z be the new point we want to classify. Let V be the volume of the m dimensional ball R around z containing the K nearest neighbors for z (where m is the number of features). Also assume that the distribution in R is uniform. Consider the probability P that a data point chosen at random is in R. On one hand, because there are K points in R out of a total of N points, P = KN. On the other hand, let P (x) = q be the density at a point x ∈ R (q is constant because R has uniform distribution). Then P =

x∈R

P (x)dx = qV. Hence we see that the marginal probability of z is

P (z) = q =

P

V

K

N V

Similarly, the conditional probability of z given a class i is

P (z | y = i) =

Ki NiV

Finally, we compute the prior of class i:

P (y = i) =

Ni N

Using Bayes formula:

P (y = i | z) =

P (z | y = i)P (y = i) P (z)

Ki K

Using the Bayes decision rule we will choose the class with the highest probability, which corresponds to the class with the highest Ki - the number of samples in K.

2.4 Local Kernel Regression

Kernel regression is similar to KNN but used for regression. In particular, it focuses on a specific region of the input, as opposed to the global space (like linear regression does). For example, we can output the local average:

fˆ (x) =

∑n ∑i=1^ yi^ ·^ I^ (‖xi^ −^ x‖ ≤^ h) n i=1 I^ (‖xi^ −^ x‖ ≤^ h)^

which can also be expressed in a form similar to linear regression:

fˆ (x) =

∑^ n

i=

wiyi, wi =

I (‖xi − x‖ ≤ h) I

n j=1 ‖xj^ −^ x‖ ≤^ h

Note that the wi’s here represent a hard boundary: if Xi is close to x then wi = 1, else wi = 0. In the general case, w can be expressed using kernel functions.

Algorithm 5: (Nadaraya-Watson Kernel Regression) Given n data points {(xi, yi)}ni=1, we can output the value at a new point x as

fˆ (x) =

∑^ n

i=

wiyi, wi =

K

(∣∣x−xi h

∑n j=1 K^

∣x−xi h

where K is a kernel function. Some typical kernel functions include:

  • Boxcar kernel: K(t) = I (t ≤ 1).
  • Gaussian kernel: K(t) =

2 π

exp

−t

2 2

The distance h in this case the called the kernel bandwidth. The choice of h should depend on the number of training data (determines variance) and smoothness of function (determines bias).

  • Large bandwidth averages more data points so reduces noice (lower variance).
  • Small bandwidth fits more accurately (lower bias).

In general this is the bias-variance tradeoff. Bias represents how accurate the result is (lower bias = more accurate). Variance represents how sensitive the algorithm is to changes in the input (lower variance = less sensitive). Here a large bandwidth (h = 200) yields low variance and high bias, while a small bandwidth (h = 1) yields high variance and low bias. In this case, h = 50 seems like the best middle ground.

3.2 Multivariate and general linear regression

If we have several inputs, this becomes a multivariate regression problem:

y = w 0 + w 1 x 1 +... + wkxk + .

However, not all functions can be approximated using the input values directly. In some cases we would like to use polynomial or other terms based on the input data. As long as the coefficients are linear, the equation is still a linear regression problem. For instance,

y = w 0 x 1 + w 1 x^21 +... + wkx^2 k + .

Typical non-linear basis functions include:

  • Polynomial φj (x) = xj^ ,
  • Gaussian φj (x) = (x−μj^ )

2 2 σ j^2 ,

  • Sigmoid φj (x) = (^) 1+exp(^1 −sj x).

Using this new notation, we formulate the general linear regression problem:

y =

j

wj φj (x),

where φj (x) can either be xj for multivariate regression or one of the non-linear bases we defined. Now assume the general case where we where have n data points (x(1), y(1)), (x(2), y(2)),... , (x(n), y(n)), and each data point has k features (recall that feature j of

x(i)^ is denoted x (i) j ). Again using LSE to find the optimal solution, by defining

φ 0 (x(1)) φ 1 (x(1))... φk(x(1)) φ 0 (x(2)) φ 1 (x(2))... φk(x(2)) .. .

φ 0 (x(n)) φ 1 (x(n))... φk(x(n))

— φ(x(1))T^ — — φ(x(2))T^ —

... — φ(x(n))T^ —

 ,^ (3.4)

we then get w = (ΦT^ Φ)−^1 ΦT^ y. (3.5)

Diving in the Math 4 - LSE for general linear regression problem Our goal is to minimize the following loss function:

J(w) =

i

(y(i)^ −

j

wj φj (x(i)))^2 =

i

(y(i)^ − wT^ φ(x(i)))^2 ,

where w and φ(x(i)) are vectors of dimension k + 1 and y(i)^ is a scalar. Setting the derivative w.r.t w to 0:

∂w

i

(y(i)^ − wT^ φ(x(i)))^2 = 2

i

(y(i)^ − wT^ φ(x(i)))φ(x(i))T^ ,

which yields (^) ∑

i

y(i)φ(x(i))T^ = wT^

i

φ(x(i))φ(x(i))T^.

Hence, defining Φ as in (3.4) would give us

(ΦT^ Φ)w = ΦT^ y ⇒ w = (ΦT^ Φ)−^1 ΦT^ y

To sum up, we have the following algorithm for the general linear regression problem.

Algorithm 6: (General linear regression algorithm) Input: Given n input data {(x(i), y(i))}ni=1 where x(i)^ is 1 × m and y(i)^ is scalar, as well as m basis functions {φj }mj=1, we find

wˆ = arg min w

∑n

i=

(y(i)^ − wT^ φ(x(i)))^2

by the following procedure:

  1. Compute Φ as in (3.4).
  2. Output ˆw = (ΦT^ Φ)−^1 ΦT^ y.

3.3 Regularized least squares

3.3.1 Definition

In the previous chapter we see that a linear regression problem involves solving (ΦT^ Φ)w = ΦT^ y for w. If ΦT^ Φ is invertible, we would get w = (ΦT^ Φ)−^1 ΦT^ y as in (3.5). Now what if ΦT^ Φ is not invertible?

Recall that full rank matrices are invertible, and that

rank(ΦT^ Φ) = the number of non-zero eigenvalues of ΦT^ Φ ≤ min(n, k) since Φ is n × k

In other words, ΦT^ Φ is not invertible if n < k, i.e., there are more features than data point. More specifically, we have n equations and k > n unknowns - this is an undetermined system of linear equations with many feasible solutions. In that case, the solution needs to be further constrained.

One way, for example, is Ridge Regression - using L2 norm as penalty to bias the solution to “small” values of w (so that small changes in input don’t translate to large changes in output):

wˆRidge = arg min w

∑n

i=

(yi − xiw)^2 + λ ‖w‖^22

= arg min w

(Φw − y)T^ (Φw − y) + λ ‖w‖^22 , λ ≥ 0

= (ΦT^ Φ + λI)−^1 ΦT^ y. (3.6)

We could also use Lasso Regression (L1 penalty)

wˆLasso = arg min w

∑n

i=

(yi − xiw)^2 + λ ‖w‖ 1 , (3.7)

which biases towards many parameter values being zero - in other words, many inputs become irrelevant to prediction in high-dimensional settings. There is no closed form solution for (3.7), but it can be optimized by sub-gradient descent.

Diving in the Math 5 - Ridge regression and MCAP Since we are given P (w) ∝ exp(−wT^ w/ 2 τ 2 ), let P (w) = exp(−c ‖w‖^22 ), where c is some constant, then − log P (w) = c ‖w‖^22 , so (3.10) is equivalent to finding

inf w {L(w) + c ‖w‖^22 }

= inf w {L(w)} such that ‖w‖^22 ≤ L(c),

where L(c) is a bijective function of c. So adding c ‖w‖^22 is the same as the ridge regression constraint ‖w‖^22 ≤ L for some constant L.

Similarly, we can encode the Lasso bias by letting wi ∼ Laplace(0, t) (iid) and P (wi) ∝ exp(−|wi|/t), which would yield

wˆM CAP = arg max w

P ({yi}ni=1 | w, σ^2 , {xi}ni=1) ︸ ︷︷ ︸ Conditional log likelihood

  • log P (w) ︸ ︷︷ ︸ log prior

= arg min w

∑n

i=

(xiw − yi)^2 + λ ‖w‖ 1 = ˆwLasso, (3.11)

where λ is constant in terms of σ^2 and t, and the last equality follows from (3.7). In other words, Prior belief that w is Laplace with mean 0 biases solution to “sparse” w.

Chapter 4

Logistic Regression

4.1 Definition

We know that regression is for predicting real-valued output Y , while classification is for pre- dicting (finite) discrete-valued Y. But is there a way to connect regression to classification? Can we predict the “probability” of a class label? The answer is generally yes, but we have to keep in mind the constraint that the probability value should lie in [0, 1].

Definition 4: (Logistic Regression) Assume the following functional form for P (Y | X):

P (Y = 1 | X) =

1 + exp(−(w 0 +

i wiXi))

P (Y = 0 | X) =

1 + exp(w 0 +

i wiXi)

In essence, logistic regression means applying the logistic function σ(z) = (^) 1+exp(^1 −z) to a linear function of the data. However, note that it is still a linear classifier.

Diving in the Math 6 - Logistic Regression as linear classifier Note that P (Y = 1 | X) can be rewritten as

P (Y = 1 | X) =

exp(w 0 +

i wiXi) 1 + exp(w 0 +

i wiXi)

We would assign label 1 if P (Y = 1 | X) > P (Y = 0 | X), which is equivalent to

exp(w 0 +

i

wiXi) > 1 ⇔ w 0 +

i

wiXi > 0.

Similarly, we would assign label 0 if P (Y = 1 | X) < P (Y = 0 | X), which is equivalent to

exp(w 0 +

i

wiXi) < 1 ⇔ w 0 +

i

wiXi < 0.

In other words, the decision boundary is the line w 0 +

i wiXi, which is linear.

Diving in the Math 7 - Log likelihood of logistic regression is concave For convenience we denote x( 0 i )= 1, so that w 0 +

∑d i=j wix

(i) j =^ w T (^) x(i). We first note the following lemmas:

  1. If f is convex then −f is concave and vice versa.
  2. A linear combination of n convex (concave) functions f 1 , f 2 ,... , fn with nonnegative coefficients is convex (concave).
  3. Another property of twice differentiable convex function is that the second derivative is nonnegative. Using this property, we can see that f (x) = log(1 + exp x) is convex.
  4. If f and g are both convex, twice differentiable and g is non-decreasing, then g ◦ f is convex.

Now we rewrite l(w) as follows:

l(w) =

∑^ n

i=

y(i)wT^ x(i)^ − log(1 + exp(wT^ x(i)))

∑^ n

i=

y(i)wT^ x(i)^ −

∑^ n

i=

log(1 + exp(wT^ x(i)))

∑^ n

i=

y(i)fi(w) −

∑^ n

i=

g(fi(w)),

where fi(w) = wT^ x(i)^ and g(z) = log(1 + exp z). fi(w) is of the form Ax + b where A = x(i)^ and b = 0, which means it’s affine (i.e., both concave and convex). We also know that g(z) is convex, and it’s easy to see g is non- decreasing. This means g(fi(w)) is convex, or equivalently, −g(fi(w)) is concave. To sum up, we can express l(w) as

l(w) =

∑^ n

i=

y(i)fi(w) ︸ ︷︷ ︸ concave

∑^ n

i=

−g(fi(w)) ︸ ︷︷ ︸ concave

hence l(w) is concave.

As such, it can be optimized by the gradient ascent algorthim.

Algorithm 7: (Gradient ascent algorithm) Initialize: Pick w at random. Gradient: ∇wE(w) =

∂E(w) ∂w 0

∂E(w) ∂w 1

∂E(w) ∂wd

Update:

∆w = η∇wE(w)

w t( t+1)← w( i t)+ η

∂E(w) ∂wi

where η > 0 is the learning rate.

In this case our likelihood function is specified in (4.4), so we have the following steps for training logistic regression:

Algorithm 8: (Gradient ascent algorithm for logistic regression) Initialize: Pick w at random and a learning rate η. Update:

  • Set an  > 0 and denote

Pˆ (y(i)^ = 1 | x(i), w(t)) = exp(w

(t) 0 +^

∑d j=1 w

(t) j x

(i) j ) 1 + exp(w( 0 t )+

∑d j=1 w

(t) j x

(i) j )

  • Iterate until |w( 0 t +1)− w 0 (t )| < :

w 0 (t +1)← w 0 (t )+ η

∑^ n

i=

[

y(i)^ − Pˆ (y(i)^ = 1 | x(i), w(t))

]

  • For k = 1,... , d, iterate until |w( kt +1)− w( kt )| < :

w (t+1) k ←^ w

(t) k +^ η

∑^ n

i=

x (i) j

[

y(i)^ − Pˆ (y(i)^ = 1 | x(i), w(t))

]