Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

CS 229 Supervised Learning Cheatsheet (20252026), Exams of Machine Learning

Johns Hopkins University (JHU)Machine Learning

This technical study guide is a vital resource for the CS 229 Supervised Learning Exam (2025/2026). It features 20 pages of essential formulas, algorithms, and core concepts required for mastering the mathematical foundations of machine learning (pp. 1, 20). Key topics include: Regression & Classification: Detailed formulas for Linear and Logistic Regression, including MSE loss, normal equations, and log-likelihood (pp. 1-2). Advanced Models: In-depth coverage of Support Vector Machines (SVMs), Naive Bayes, Decision Trees, and Generalized Linear Models (GLMs) (pp. 3-5). Ensembles & Regularization: Insights into Bagging, AdaBoost, and Gradient Boosting, alongside L1 (Lasso) and L2 (Ridge) regularization techniques (pp. 7-8). Model Evaluation: Mastery of the Bias-Variance decomposition, Precision, Recall, and F1-score metrics (pp. 10-11)

Typology: Exams

2025/2026

Available from 03/28/2026

BrainBank254 🇺🇸

600 documents

1 / 20

This page cannot be seen from the preview

Don't miss anything!

1 | P a g e

CS 229 Supervised Learning Cheatsheet Updated For

2025/2026

Linear Regression

1. Hypothesis:

hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx

2. MSE Loss:

J(θ)=12m∑i=1m(y(i)−θTx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T

x^{(i)})^2J(θ)=2m1i=1∑m(y(i)−θTx(i))2

3. Normal Equation (MLE):

θ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Tyθ^=(XTX)−1XTy

4. Regularized Normal Equation:

θ^=(XTX+λI)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Tyθ^=(XTX+λI)−1XTy

5. Probabilistic model (Gaussian noise):

Discover Exams of Machine Learning Johns Hopkins University (JHU)

Partial preview of the text

Download CS 229 Supervised Learning Cheatsheet (20252026) and more Exams Machine Learning in PDF only on Docsity!

CS 229 Supervised Learning Cheatsheet Updated For

Linear Regression

Hypothesis: h θ (x)= θ Txh_\theta(x) = \theta^T xh θ (x)= θ Tx
MSE Loss: J( θ )=12m∑i=1m(y(i)− θ Tx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2J( θ )= 2 m1i= 1 ∑m(y(i)− θ Tx(i)) 2
Normal Equation (MLE): θ ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Ty θ ^=(XTX)−1XTy
Regularized Normal Equation: θ ^=(XTX+ λ I)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Ty θ ^=(XTX+ λ I)−1XTy
Probabilistic model (Gaussian noise):

p(y∣x; θ )=N(y∣θTx, σ 2)p(y|x;\theta) = \mathcal{N}(y|\theta^Tx,\sigma^2)p(y∣x; θ )=N(y∣θTx, σ 2 ) Logistic Regression

Sigmoid: σ (z)=11+e−z\sigma(z) = \frac{1}{1+e^{-z}} σ (z)= 1 +e−z
Hypothesis: h θ (x)= σ ( θ Tx)h_\theta(x) = \sigma(\theta^T x)h θ (x)= σ ( θ Tx)
Log-likelihood: ℓ( θ )=∑i=1m(y(i)log h θ (x(i))+(1−y(i))log (1−h θ (x(i))))\ell(\theta) = \sum_{i=1}^m \Big(y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))\Big)ℓ( θ )=i= 1 ∑m (y(i)logh θ (x(i))+( 1 −y(i))log( 1 −h θ (x(i))))
Gradient:

g( μ )= η , μ =E[y∣x]g(\mu) = \eta, \quad \mu = \mathbb{E}[y|x]g( μ )= η , μ =E[y∣x]

Logistic regression = Bernoulli + logit link
Linear regression = Gaussian + identity link Perceptron
Update rule: θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 \theta := \theta + y^{(i)}x^{(i)} \quad \text{if } y^{(i)}(\theta^T x^{(i)}) \leq 0 θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 SVM (Support Vector Machines)
Hard-margin primal: min θ,b12∥θ∥ 2 s.t. y(i)( θ Tx(i)+b)≥ 1 \min_{\theta,b} \frac{1}{2}|\theta|^2 \quad \text{s.t. } y^{(i)}(\theta^Tx^{(i)}+b)\geq 1 θ ,bmin 21 ∥θ∥ 2 s.t. y(i)( θ Tx(i)+b)≥ 1
Soft-margin primal:

min θ,b, ξ 12 ∥θ∥2+C∑ ξ i,y(i)( θ Tx(i)+b)≥ 1 − ξ i\min_{\theta,b,\xi} \frac{1}{2}|\theta|^2 + C\sum \xi_i, \quad y^{(i)}(\theta^Tx^{(i)}+b)\geq 1-\xi_i θ ,b, ξ min 21 ∥θ∥ 2 +C∑ ξ i ,y(i)( θ Tx(i)+b)≥ 1 − ξ i

Dual: max α∑ α i− 12 ∑ α i α jy(i)y(j)K(x(i),x(j))\max_\alpha \sum \alpha_i - \frac{1}{2}\sum \alpha_i \alpha_j y^{(i)} y^{(j)} K(x^{(i)},x^{(j)}) α max∑ α i− 21 ∑ α i α jy(i)y(j)K(x(i),x(j))
Kernel trick: K(x,z)= ϕ (x)T ϕ (z)K(x,z) = \phi(x)^T \phi(z)K(x,z)= ϕ (x)T ϕ (z)
Decision function: f(x)=sign(∑ α iy(i)K(x(i),x)+b)f(x) = \text{sign}\Big(\sum \alpha_i y^{(i)} K(x^{(i)},x) + b\Big)f(x)=sign(∑ α iy(i)K(x(i),x)+b) Naive Bayes
Posterior:

IG(S,A)=H(S)−∑v∣Sv∣∣S∣H(Sv)IG(S,A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∑∣S∣∣Sv∣H(Sv)

Gini index: G(S)=1−∑cpc2G(S) = 1-\sum_c p_c^2G(S)= 1 −c∑pc Ensembles
Bagging prediction: f^bag(x)=1B∑b=1Bfb(x)\hat{f}{bag}(x) = \frac{1}{B}\sum{b=1}^B f_b(x)f^bag(x)=B b= 1 ∑Bfb(x)
AdaBoost weight update: wi(t+1)=wi(t)exp ( α t1{ht(x(i))≠y(i)})w_i^{(t+1)} = w_i^{(t)} \exp\big(\alpha_t 1{h_t(x^{(i)}) \neq y^{(i)}}\big)wi(t+ 1 )=wi(t)exp( α t 1 {ht(x(i)) =y(i)})
AdaBoost α t\alpha_t α t: α t=12ln 1 − ϵ t ϵ t\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t} α t= 21 ln ϵ t 1 − ϵ t

Gradient boosting residuals: rm(i)=−[∂L(y(i),F(x(i)))∂F(x(i))]F=Fm−1r^{(i)}m = - \left[\frac{\partial L(y^{(i)},F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F=F_{m-1}}rm(i)=−[∂F(x(i))∂L(y(i),F(x(i))) ]F=Fm− 1
Gradient boosting update: Fm(x)=Fm−1(x)+ ν hm(x)F_m(x) = F_{m-1}(x) + \nu h_m(x)Fm(x)=Fm− 1 (x)+ ν hm(x) Regularization
Ridge regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ 2 m∥θ∥ 2 J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^2 + \frac{\lambda}{2m}|\theta|^2J( θ )= 2 m1∑(y(i)− θ Tx(i)) 2 +2m λ ∥θ∥ 2
Lasso regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ m∥θ∥ 1 J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^

\frac{\lambda}{m}|\theta|_1J( θ )= 2 m1∑(y(i)− θ Tx(i)) 2 +m λ ∥θ∥ 1

p(x∣y=k)=1(2 π )n/2∣Σ∣1/2exp (−12(x− μ k)T Σ −1(x− μ k))p(x|y=k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp\Big(-\tfrac{1}{2}(x-\mu_k)^T \Sigma^{-1}(x- \mu_k)\Big)p(x∣y=k)=( 2 π )n/2∣Σ∣1/21exp(− 21 (x− μ k)T Σ − 1 (x− μ k))

LDA decision boundary: δ k(x)=xT Σ − 1 μ k− 12 μ kT Σ − 1 μ k+ln πk\delta_k(x) = x^T\Sigma^{-1}\mu_k - \tfrac{1}{2}\mu_k^T \Sigma^{-1}\mu_k + \ln \pi_k δ k(x)=xT Σ − 1 μ k− 21 μ kT Σ − 1 μ k+ln π k
QDA decision boundary: δ k(x)=−12ln ∣Σk∣−12(x− μ k)T Σ k−1(x− μ k)+ln πk\delta_k(x) = - \tfrac{1}{2}\ln|\Sigma_k| - \tfrac{1}{2}(x-\mu_k)^T \Sigma_k^{-1}(x-\mu_k) + \ln \pi_k δ k(x)=− 21 ln∣Σk∣− 21 (x− μ k )T Σ k− 1 (x− μ k)+ln π k Model Evaluation
Bias-Variance decomposition: E[(y−f^(x))2]=Bias2[f^(x)]+Var[f^(x)]+ σ 2 E[(y-\hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2E[(y−f^(x)) 2 ]=Bias2[f^(x)]+Var[f^(x)]+ σ 2

Precision: TPTP+FP\frac{TP}{TP+FP}TP+FPTP
Recall: TPTP+FN\frac{TP}{TP+FN}TP+FNTP
F1-score: F1=2⋅Precision⋅RecallPrecision+RecallF1 = \frac{2 \cdot \text{Precision}\cdot \text{Recall}}{\text{Precision}+\text{Recall}}F1=Precision+Recall2⋅Precision⋅Recall
ROC AUC = ∫01TPR(FPR−1(x))dx\int_0^1 TPR(FPR^{-1}(x)) dx∫ 01 TPR(FPR−1(x))dx Tricks & Notes
Newton-Raphson update: θ (t+1)= θ (t)−H− 1 ∇ℓ( θ (t))\theta^{(t+1)} = \theta^{(t)} - H^{-1}\nabla \ell(\theta^{(t)}) θ (t+ 1 )= θ (t)−H−1∇ℓ( θ (t))
Stochastic Gradient Descent:

2. Linear Regression

Model: hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx
Cost function (MSE): J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)= 2 m1i= 1 ∑m(hθ(x(i))−y(i)) 2
Gradient descent update: θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}θj:=θj−αm1i= 1 ∑m(hθ(x(i))−y(i))xj(i) 3. Logistic Regression (Classification)
Sigmoid function: g(z)=11+e−zg(z) = \frac{1}{1+e^{-z}}g(z)= 1 +e−z
Model: hθ(x)=g(θTx)h_\theta(x) = g(\theta^T x)hθ(x)=g(θTx)

Log-likelihood cost: J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))J(\theta) = - \frac{1}{m} \sum_{i=1}^m \Big( y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big)J(θ)=−m1i= 1 ∑m (y(i)loghθ(x(i))+( 1 −y(i))log( 1 −hθ(x(i)))) 4. Generalized Linear Models (GLM)
Predict E[y∣x]E[y|x]E[y∣x] using link function ggg: g(E[y∣x])=θTxg(E[y|x]) = \theta^T xg(E[y∣x])=θTx 5. Perceptron Algorithm
Initialize θ=0\theta = 0θ= 0.
For each (x(i),y(i))(x^{(i)}, y^{(i)})(x(i),y(i)): If y(i)(θTx(i))≤0y^{(i)} (\theta^T x^{(i)}) \leq 0y(i)(θTx(i))≤ 0 , update: θ:=θ+y(i)x(i)\theta := \theta + y^{(i)} x^{(i)}θ:=θ+y(i)x(i)

H(Y)=−∑cP(y=c)log P(y=c)H(Y) = - \sum_{c} P(y=c)\log P(y=c)H(Y)=−c∑P(y=c)logP(y=c)

Information gain: IG(Xj)=H(Y)−H(Y∣Xj)IG(X_j) = H(Y) - H(Y|X_j)IG(Xj)=H(Y)−H(Y∣Xj) 9. Bias-Variance Tradeoff
Expected error decomposition: E[(y−f^(x))2]=Bias2+Variance+σ2E[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \sigma^2E[(y−f^(x)) 2 ]=Bias2+Variance+σ 10. Regularization
L2 (Ridge): J(θ)=Loss+λ∥θ∥ 2 J(\theta) = \text{Loss} + \lambda |\theta|^2J(θ)=Loss+λ∥θ∥ 2
L1 (Lasso): J(θ)=Loss+λ∥θ∥ 1 J(\theta) = \text{Loss} + \lambda |\theta|_1J(θ)=Loss+λ∥θ∥ 1 1. Linear Regression (Closed Form MLE)

θ^MLE=arg min θ 12 σ 2 ∑i=1m(y(i)−θTx(i))2=(XTX)−1XTy\hat{\theta}{MLE} = \arg\min\theta \frac{1}{2\sigma^2}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2 = (X^TX)^{-1}X^Tyθ^MLE=argθmin 2σ21i= 1 ∑m(y(i)−θTx(i)) 2 =(XTX)−1XTy

2. Logistic Regression Log-Likelihood ℓ(θ)=∑i=1m(y(i)log σ(θTx(i))+(1−y(i))log (1−σ(θTx(i))))\ell(\theta) = \sum_{i=1}^m \Big( y^{(i)} \log \sigma(\theta^Tx^{(i)}) + (1-y^{(i)}) \log (1-\sigma(\theta^Tx^{(i)})) \Big)ℓ(θ)=i= 1 ∑m (y(i)logσ(θTx(i))+( 1 −y(i))log( 1 −σ(θTx(i)))) Gradient: ∇θℓ(θ)=∑i=1m(y(i)−σ(θTx(i)))x(i)\nabla_\theta \ell(\theta) = \sum_{i=1}^m \big(y^{(i)} - \sigma(\theta^Tx^{(i)}) \big) x^{(i)}∇θℓ(θ)=i= 1 ∑m(y(i)−σ(θTx(i)))x(i) Hessian: H=−∑i=1mσ(θTx(i))(1−σ(θTx(i)))x(i)x(i)TH = - \sum_{i=1}^m \sigma(\theta^Tx^{(i)}) \big(1 - \sigma(\theta^Tx^{(i)})\big) x^{(i)} {x^{(i)}}^TH=−i= 1 ∑mσ(θTx(i))( 1 −σ(θTx(i)))x(i)x(i)T 3. SVM Primal (Soft Margin)

IG(S,A)=H(S)−∑v∈Values(A)∣Sv∣∣S∣H(Sv)IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∈Values(A)∑∣S∣∣Sv∣H(Sv) where entropy is: H(S)=−∑c∈Classespclog pcH(S) = - \sum_{c \in \text{Classes}} p_c \log p_cH(S)=−c∈Classes∑pc logpc

7. Bias-Variance Decomposition E[(y−f^(x))2]=(E[f^(x)]−f(x))2+E[(f^(x)−E[f^(x)])2]+σ2\mathbb{E}\Big[ (y - \hat{f}(x))^2 \Big] = \Big( \mathbb{E}[\hat{f}(x)] - f(x) \Big)^2 + \mathbb{E}\Big[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\Big] + \sigma^2E[(y−f^(x)) 2 ]=(E[f^(x)]−f(x)) 2 +E[(f^(x)−E[f^(x)]) 2 ]+σ 8. Regularized Logistic Regression J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))+λ 2 m∥θ∥ 2 J(\theta) = - \frac{1}{m}\sum_{i=1}^m \Big(y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big) + \frac{\lambda}{2m}|\theta|^2J(θ)=−m1i= 1 ∑m(y(i)loghθ(x(i))+( 1 −y(i))log( 1 −hθ (x(i))))+2mλ∥θ∥ 2

9. Gaussian Discriminant Analysis (Class Posterior) p(y=k∣x)=1(2π)n/2∣Σ∣1/2exp (−12(x−μk)TΣ−1(x−μk)) ϕk∑j=1K1(2π)n/2∣Σ∣1/2exp (−12(x−μj)TΣ− 1(x−μj)) ϕjp(y=k|x) = \frac{\frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x- \mu_k)^T \Sigma^{-1}(x-\mu_k)\Big) , \phi_k}{\sum_{j=1}^K \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x-\mu_j)^T \Sigma^{-1}(x-\mu_j)\Big) , \phi_j}p(y=k∣x)=∑j=1K(2π)n/2∣Σ∣1/21exp(− 21 (x−μj)TΣ−1(x−μj))ϕj(2π)n/2∣Σ∣1/21exp(− 21 (x−μk )TΣ−1(x−μk))ϕk 10. Gradient Boosting Update Rule Residuals: rm(i)=−[∂L(y(i),F(x(i)))∂F(x(i))]F(x)=Fm−1(x)r^{(i)}m = - \left[\frac{\partial L(y^{(i)}, F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F(x)=F_{m-1}(x)}rm(i)=−[∂F(x(i))∂L(y(i),F(x(i)))]F(x)=Fm− 1 (x) Model update: Fm(x)=Fm−1(x)+ν⋅hm(x)F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)Fm(x)=Fm− 1 (x)+ν⋅hm(x)

CS 229 Supervised Learning Cheatsheet (20252026), Exams of Machine Learning

Related documents

Partial preview of the text

Download CS 229 Supervised Learning Cheatsheet (20252026) and more Exams Machine Learning in PDF only on Docsity!

CS 229 Supervised Learning Cheatsheet Updated For