





























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Class notes on various topics related to machine learning, including MLE and MAP, nonparametric models, linear regression, logistic regression, naive Bayes classifier, and neural networks and deep learning. The notes provide definitions, explanations, and examples of each topic, making them useful study material for university students in machine learning courses. authored by Huy Nguyen, a PhD student in the Human-Computer Interaction Institute at Carnegie Mellon University.
Typology: Study notes
1 / 69
This page cannot be seen from the preview
Don't miss anything!






























































This is the class notes I took for CMU’s 10701: Introduction to Machine Learning in Fall 2018. The goal of this document is to serve as a quick review of key points from each topic covered in the course. A more comprehensive note collection for beginners is available at UPenn’s CIS520: Machine Learning. In this document, each chapter typically covers one machine learning methodology and contains the followings:
Intertwined with these components are transitional text (as I find them easier to review than bullet points), so the document as a whole ends up looking like a mini textbook. While there are already plenty of ML textbooks out there, I am still interested in writing up something that stays closest to the content taught by Professor Ziv Bar-Joseph and Pradeep Ravikumar. I would also like to take this opportunity to thank the two professors for their guidance.
Definition 1: (Likelihood function and MLE) Given n data points x 1 , x 2 ,... , xn we can define the likelihood of the data given the model θ (usually a collection of parameters) as follows.
Pˆ (dataset | θ) =
∏^ n
k=
P^ ˆ (xk | θ). (1.1)
The maximum likelihood estimate (MLE) of θ is
θˆM LE = arg max θ
P^ ˆ (dataset | θ). (1.2)
To determine the values for the parameters in θ, we maximize the probability of generating the observed samples. For example, let θ be the model of a coin flip (so θ = {P (Head) = q}), then
the best assignment (MLE) for θ in this case is θˆ =
q ˆ = (^) # samples# heads
Diving in the Math 1 - MLE for binary variable For a binary random variable A with P (A = 1) = q, we show that ˆq = (^) # samples# 1. Assume we observe n samples x 1 , x 2 ,... , xn with n 1 heads and n 2 tails. Then, the likelihood function is
P (D | θ) =
∏^ n
i=
P (xi | θ) = qn^1 (1 − q)n^2.
We now find ˆq that maximizes this likelihood function, i.e., ˆq = arg max q
qn^1 (1 − q)n^2.
To do so, we set the derivative to 0:
∂q
qn^1 (1 − q)n^2 = n 1 qn^1 −^1 (1 − q)n^2 − qn^1 n 2 (1 − q)n^2 −^1 ,
which is equivalent to qn^1 −^1 (1 − q)n^2 −^1 (n 1 (1 − q) − qn 2 ) = 0, which yields n 1 (1 − q) − qn 2 = 0 ⇔ q =
n 1 n 1 + n 2
If we only have very few samples, MLE may not yield accurate results, so it is useful to take into account prior knowledge. When the number of samples gets large, the effect of prior knowledge will diminish. Similar to MLE, we have the following algorithm for MAP.
Algorithm 2: (Finding the MAP) Given n data points x 1 , x 2 ,... , xn, a model θ represented by an expression for P (X | θ), and the prior knowledge P (θ), perform the following steps:
L = log P (θ) ·
∏^ n
i=
P (xi | θ) = log P (θ) +
∑^ n
i=
log P (xi | θ).
∂ˆγ^2
≤ 0 is the MAP of γ.
Classification is the task of predicting a (discrete) output label given the input data. The performance of any classification algorithm depends on two factors: (1) the parameters are correct, and (2) the underlying assumption holds. The most optimal algorithm is called the Bayes decision rule.
Algorithm 3: (Bayes decision rule) If we know the conditional probability P (x | y) and class prior P (y), use (1.4) to compute
P (y = i | x) =
P (x | y = i)P (y = i) P (x)
∝ P (x | y = i)P (y = i) = qi(x) (2.1)
and qi(x) to select the appropriate class. Choose class 0 if q 0 (x) > q 1 (x) and 1 otherwise. In general choose the class ˆc = arg max c
{qc(x)}.
Because our decision is probabilistic, there is still chance for error. The Bayes error rate (risk) of the data distribution is the probability an instance is misclassified by the Bayes decision rule. For binary classification, the risk for sample x is
R(x) = min{P (y = 0 | x), P (y = 1 | x)}. (2.2)
In other words, if P (y = 0 | x) > P (y = 1 | x), then we would pick the label 0, and the risk is the probability that the actual label is 1, which is P (y = 1 | x). We can also compute the expected risk - the risk for the entire range of values of x:
E[r(x)] =
x
r(x)P (x)dx
x
min{P (y = 0 | x), P (y = 1 | x)}dx
= P (y = 0)
L 1
P (x | y = 0)dx + P (y = 1)
L 0
P (x | y = 1)dx,
where Li is the region over which the decision rule outputs label i. The risk value we computed assumes that both errors (assigning instances of class 1 to 0 and vice versa) are equally harmful. In general, we can set the weight penalty Li,j (x) for assigning
Diving in the Math 2 - Computing probabilities for KNN Let z be the new point we want to classify. Let V be the volume of the m dimensional ball R around z containing the K nearest neighbors for z (where m is the number of features). Also assume that the distribution in R is uniform. Consider the probability P that a data point chosen at random is in R. On one hand, because there are K points in R out of a total of N points, P = KN. On the other hand, let P (x) = q be the density at a point x ∈ R (q is constant because R has uniform distribution). Then P =
x∈R
P (x)dx = qV. Hence we see that the marginal probability of z is
P (z) = q =
Similarly, the conditional probability of z given a class i is
P (z | y = i) =
Ki NiV
Finally, we compute the prior of class i:
P (y = i) =
Ni N
Using Bayes formula:
P (y = i | z) =
P (z | y = i)P (y = i) P (z)
Ki K
Using the Bayes decision rule we will choose the class with the highest probability, which corresponds to the class with the highest Ki - the number of samples in K.
2.4 Local Kernel Regression
Kernel regression is similar to KNN but used for regression. In particular, it focuses on a specific region of the input, as opposed to the global space (like linear regression does). For example, we can output the local average:
fˆ (x) =
∑n ∑i=1^ yi^ ·^ I^ (‖xi^ −^ x‖ ≤^ h) n i=1 I^ (‖xi^ −^ x‖ ≤^ h)^
which can also be expressed in a form similar to linear regression:
fˆ (x) =
∑^ n
i=
wiyi, wi =
I (‖xi − x‖ ≤ h) I
n j=1 ‖xj^ −^ x‖ ≤^ h
Note that the wi’s here represent a hard boundary: if Xi is close to x then wi = 1, else wi = 0. In the general case, w can be expressed using kernel functions.
Algorithm 5: (Nadaraya-Watson Kernel Regression) Given n data points {(xi, yi)}ni=1, we can output the value at a new point x as
fˆ (x) =
∑^ n
i=
wiyi, wi =
(∣∣x−xi h
∑n j=1 K^
∣x−xi h
where K is a kernel function. Some typical kernel functions include:
2 π
exp
−t
2 2
The distance h in this case the called the kernel bandwidth. The choice of h should depend on the number of training data (determines variance) and smoothness of function (determines bias).
In general this is the bias-variance tradeoff. Bias represents how accurate the result is (lower bias = more accurate). Variance represents how sensitive the algorithm is to changes in the input (lower variance = less sensitive). Here a large bandwidth (h = 200) yields low variance and high bias, while a small bandwidth (h = 1) yields high variance and low bias. In this case, h = 50 seems like the best middle ground.
3.2 Multivariate and general linear regression
If we have several inputs, this becomes a multivariate regression problem:
y = w 0 + w 1 x 1 +... + wkxk + .
However, not all functions can be approximated using the input values directly. In some cases we would like to use polynomial or other terms based on the input data. As long as the coefficients are linear, the equation is still a linear regression problem. For instance,
y = w 0 x 1 + w 1 x^21 +... + wkx^2 k + .
Typical non-linear basis functions include:
2 2 σ j^2 ,
Using this new notation, we formulate the general linear regression problem:
y =
j
wj φj (x),
where φj (x) can either be xj for multivariate regression or one of the non-linear bases we defined. Now assume the general case where we where have n data points (x(1), y(1)), (x(2), y(2)),... , (x(n), y(n)), and each data point has k features (recall that feature j of
x(i)^ is denoted x (i) j ). Again using LSE to find the optimal solution, by defining
φ 0 (x(1)) φ 1 (x(1))... φk(x(1)) φ 0 (x(2)) φ 1 (x(2))... φk(x(2)) .. .
φ 0 (x(n)) φ 1 (x(n))... φk(x(n))
— φ(x(1))T^ — — φ(x(2))T^ —
... — φ(x(n))T^ —
we then get w = (ΦT^ Φ)−^1 ΦT^ y. (3.5)
Diving in the Math 4 - LSE for general linear regression problem Our goal is to minimize the following loss function:
J(w) =
i
(y(i)^ −
j
wj φj (x(i)))^2 =
i
(y(i)^ − wT^ φ(x(i)))^2 ,
where w and φ(x(i)) are vectors of dimension k + 1 and y(i)^ is a scalar. Setting the derivative w.r.t w to 0:
∂w
i
(y(i)^ − wT^ φ(x(i)))^2 = 2
i
(y(i)^ − wT^ φ(x(i)))φ(x(i))T^ ,
which yields (^) ∑
i
y(i)φ(x(i))T^ = wT^
i
φ(x(i))φ(x(i))T^.
Hence, defining Φ as in (3.4) would give us
(ΦT^ Φ)w = ΦT^ y ⇒ w = (ΦT^ Φ)−^1 ΦT^ y
To sum up, we have the following algorithm for the general linear regression problem.
Algorithm 6: (General linear regression algorithm) Input: Given n input data {(x(i), y(i))}ni=1 where x(i)^ is 1 × m and y(i)^ is scalar, as well as m basis functions {φj }mj=1, we find
wˆ = arg min w
∑n
i=
(y(i)^ − wT^ φ(x(i)))^2
by the following procedure:
3.3 Regularized least squares
In the previous chapter we see that a linear regression problem involves solving (ΦT^ Φ)w = ΦT^ y for w. If ΦT^ Φ is invertible, we would get w = (ΦT^ Φ)−^1 ΦT^ y as in (3.5). Now what if ΦT^ Φ is not invertible?
Recall that full rank matrices are invertible, and that
rank(ΦT^ Φ) = the number of non-zero eigenvalues of ΦT^ Φ ≤ min(n, k) since Φ is n × k
In other words, ΦT^ Φ is not invertible if n < k, i.e., there are more features than data point. More specifically, we have n equations and k > n unknowns - this is an undetermined system of linear equations with many feasible solutions. In that case, the solution needs to be further constrained.
One way, for example, is Ridge Regression - using L2 norm as penalty to bias the solution to “small” values of w (so that small changes in input don’t translate to large changes in output):
wˆRidge = arg min w
∑n
i=
(yi − xiw)^2 + λ ‖w‖^22
= arg min w
(Φw − y)T^ (Φw − y) + λ ‖w‖^22 , λ ≥ 0
= (ΦT^ Φ + λI)−^1 ΦT^ y. (3.6)
We could also use Lasso Regression (L1 penalty)
wˆLasso = arg min w
∑n
i=
(yi − xiw)^2 + λ ‖w‖ 1 , (3.7)
which biases towards many parameter values being zero - in other words, many inputs become irrelevant to prediction in high-dimensional settings. There is no closed form solution for (3.7), but it can be optimized by sub-gradient descent.
Diving in the Math 5 - Ridge regression and MCAP Since we are given P (w) ∝ exp(−wT^ w/ 2 τ 2 ), let P (w) = exp(−c ‖w‖^22 ), where c is some constant, then − log P (w) = c ‖w‖^22 , so (3.10) is equivalent to finding
inf w {L(w) + c ‖w‖^22 }
= inf w {L(w)} such that ‖w‖^22 ≤ L(c),
where L(c) is a bijective function of c. So adding c ‖w‖^22 is the same as the ridge regression constraint ‖w‖^22 ≤ L for some constant L.
Similarly, we can encode the Lasso bias by letting wi ∼ Laplace(0, t) (iid) and P (wi) ∝ exp(−|wi|/t), which would yield
wˆM CAP = arg max w
P ({yi}ni=1 | w, σ^2 , {xi}ni=1) ︸ ︷︷ ︸ Conditional log likelihood
= arg min w
∑n
i=
(xiw − yi)^2 + λ ‖w‖ 1 = ˆwLasso, (3.11)
where λ is constant in terms of σ^2 and t, and the last equality follows from (3.7). In other words, Prior belief that w is Laplace with mean 0 biases solution to “sparse” w.
We know that regression is for predicting real-valued output Y , while classification is for pre- dicting (finite) discrete-valued Y. But is there a way to connect regression to classification? Can we predict the “probability” of a class label? The answer is generally yes, but we have to keep in mind the constraint that the probability value should lie in [0, 1].
Definition 4: (Logistic Regression) Assume the following functional form for P (Y | X):
1 + exp(−(w 0 +
i wiXi))
1 + exp(w 0 +
i wiXi)
In essence, logistic regression means applying the logistic function σ(z) = (^) 1+exp(^1 −z) to a linear function of the data. However, note that it is still a linear classifier.
Diving in the Math 6 - Logistic Regression as linear classifier Note that P (Y = 1 | X) can be rewritten as
exp(w 0 +
i wiXi) 1 + exp(w 0 +
i wiXi)
We would assign label 1 if P (Y = 1 | X) > P (Y = 0 | X), which is equivalent to
exp(w 0 +
i
wiXi) > 1 ⇔ w 0 +
i
wiXi > 0.
Similarly, we would assign label 0 if P (Y = 1 | X) < P (Y = 0 | X), which is equivalent to
exp(w 0 +
i
wiXi) < 1 ⇔ w 0 +
i
wiXi < 0.
In other words, the decision boundary is the line w 0 +
i wiXi, which is linear.
Diving in the Math 7 - Log likelihood of logistic regression is concave For convenience we denote x( 0 i )= 1, so that w 0 +
∑d i=j wix
(i) j =^ w T (^) x(i). We first note the following lemmas:
Now we rewrite l(w) as follows:
l(w) =
∑^ n
i=
y(i)wT^ x(i)^ − log(1 + exp(wT^ x(i)))
∑^ n
i=
y(i)wT^ x(i)^ −
∑^ n
i=
log(1 + exp(wT^ x(i)))
∑^ n
i=
y(i)fi(w) −
∑^ n
i=
g(fi(w)),
where fi(w) = wT^ x(i)^ and g(z) = log(1 + exp z). fi(w) is of the form Ax + b where A = x(i)^ and b = 0, which means it’s affine (i.e., both concave and convex). We also know that g(z) is convex, and it’s easy to see g is non- decreasing. This means g(fi(w)) is convex, or equivalently, −g(fi(w)) is concave. To sum up, we can express l(w) as
l(w) =
∑^ n
i=
y(i)fi(w) ︸ ︷︷ ︸ concave
∑^ n
i=
−g(fi(w)) ︸ ︷︷ ︸ concave
hence l(w) is concave.
As such, it can be optimized by the gradient ascent algorthim.
Algorithm 7: (Gradient ascent algorithm) Initialize: Pick w at random. Gradient: ∇wE(w) =
∂E(w) ∂w 0
∂E(w) ∂w 1
∂E(w) ∂wd
Update:
∆w = η∇wE(w)
w t( t+1)← w( i t)+ η
∂E(w) ∂wi
where η > 0 is the learning rate.
In this case our likelihood function is specified in (4.4), so we have the following steps for training logistic regression:
Algorithm 8: (Gradient ascent algorithm for logistic regression) Initialize: Pick w at random and a learning rate η. Update:
Pˆ (y(i)^ = 1 | x(i), w(t)) = exp(w
(t) 0 +^
∑d j=1 w
(t) j x
(i) j ) 1 + exp(w( 0 t )+
∑d j=1 w
(t) j x
(i) j )
w 0 (t +1)← w 0 (t )+ η
∑^ n
i=
y(i)^ − Pˆ (y(i)^ = 1 | x(i), w(t))
w (t+1) k ←^ w
(t) k +^ η
∑^ n
i=
x (i) j
y(i)^ − Pˆ (y(i)^ = 1 | x(i), w(t))