CS 229 Supervised Learning Cheatsheet (20252026), Exams of Machine Learning

This technical study guide is a vital resource for the CS 229 Supervised Learning Exam (2025/2026). It features 20 pages of essential formulas, algorithms, and core concepts required for mastering the mathematical foundations of machine learning (pp. 1, 20). Key topics include: Regression & Classification: Detailed formulas for Linear and Logistic Regression, including MSE loss, normal equations, and log-likelihood (pp. 1-2). Advanced Models: In-depth coverage of Support Vector Machines (SVMs), Naive Bayes, Decision Trees, and Generalized Linear Models (GLMs) (pp. 3-5). Ensembles & Regularization: Insights into Bagging, AdaBoost, and Gradient Boosting, alongside L1 (Lasso) and L2 (Ridge) regularization techniques (pp. 7-8). Model Evaluation: Mastery of the Bias-Variance decomposition, Precision, Recall, and F1-score metrics (pp. 10-11)

Typology: Exams

2025/2026

Available from 03/28/2026

BrainBank254
BrainBank254 🇺🇸

600 documents

1 / 20

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1 | P a g e
CS 229 Supervised Learning Cheatsheet Updated For
2025/2026
Linear Regression
1. Hypothesis:
hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx
2. MSE Loss:
J(θ)=12m∑i=1m(y(i)−θTx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T
x^{(i)})^2J(θ)=2m1i=1∑m(y(i)−θTx(i))2
3. Normal Equation (MLE):
θ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Tyθ^=(XTX)−1XTy
4. Regularized Normal Equation:
θ^=(XTX+λI)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Tyθ^=(XTX+λI)−1XTy
5. Probabilistic model (Gaussian noise):
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14

Partial preview of the text

Download CS 229 Supervised Learning Cheatsheet (20252026) and more Exams Machine Learning in PDF only on Docsity!

CS 229 Supervised Learning Cheatsheet Updated For

Linear Regression

  1. Hypothesis: h θ (x)= θ Txh_\theta(x) = \theta^T xh θ (x)= θ Tx
  2. MSE Loss: J( θ )=12m∑i=1m(y(i)− θ Tx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2J( θ )= 2 m1i= 1 ∑m(y(i)− θ Tx(i)) 2
  3. Normal Equation (MLE): θ ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Ty θ ^=(XTX)−1XTy
  4. Regularized Normal Equation: θ ^=(XTX+ λ I)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Ty θ ^=(XTX+ λ I)−1XTy
  5. Probabilistic model (Gaussian noise):

p(y∣x; θ )=N(y∣θTx, σ 2)p(y|x;\theta) = \mathcal{N}(y|\theta^Tx,\sigma^2)p(y∣x; θ )=N(y∣θTx, σ 2 ) Logistic Regression

  1. Sigmoid: σ (z)=11+e−z\sigma(z) = \frac{1}{1+e^{-z}} σ (z)= 1 +e−z
  2. Hypothesis: h θ (x)= σ ( θ Tx)h_\theta(x) = \sigma(\theta^T x)h θ (x)= σ ( θ Tx)
  3. Log-likelihood: ℓ( θ )=∑i=1m(y(i)log h θ (x(i))+(1−y(i))log (1−h θ (x(i))))\ell(\theta) = \sum_{i=1}^m \Big(y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))\Big)ℓ( θ )=i= 1 ∑m (y(i)logh θ (x(i))+( 1 −y(i))log( 1 −h θ (x(i))))
  4. Gradient:

g( μ )= η , μ =E[y∣x]g(\mu) = \eta, \quad \mu = \mathbb{E}[y|x]g( μ )= η , μ =E[y∣x]

  1. Logistic regression = Bernoulli + logit link
  2. Linear regression = Gaussian + identity link Perceptron
  3. Update rule: θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 \theta := \theta + y^{(i)}x^{(i)} \quad \text{if } y^{(i)}(\theta^T x^{(i)}) \leq 0 θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 SVM (Support Vector Machines)
  4. Hard-margin primal: min θ,b12∥θ∥ 2 s.t. y(i)( θ Tx(i)+b)≥ 1 \min_{\theta,b} \frac{1}{2}|\theta|^2 \quad \text{s.t. } y^{(i)}(\theta^Tx^{(i)}+b)\geq 1 θ ,bmin 21 ∥θ∥ 2 s.t. y(i)( θ Tx(i)+b)≥ 1
  5. Soft-margin primal:

min θ,b, ξ 12 ∥θ∥2+C∑ ξ i,y(i)( θ Tx(i)+b)≥ 1 − ξ i\min_{\theta,b,\xi} \frac{1}{2}|\theta|^2 + C\sum \xi_i, \quad y^{(i)}(\theta^Tx^{(i)}+b)\geq 1-\xi_i θ ,b, ξ min 21 ∥θ∥ 2 +C∑ ξ i ,y(i)( θ Tx(i)+b)≥ 1 − ξ i

  1. Dual: max α∑ α i− 12 ∑ α i α jy(i)y(j)K(x(i),x(j))\max_\alpha \sum \alpha_i - \frac{1}{2}\sum \alpha_i \alpha_j y^{(i)} y^{(j)} K(x^{(i)},x^{(j)}) α max∑ α i− 21 ∑ α i α jy(i)y(j)K(x(i),x(j))
  2. Kernel trick: K(x,z)= ϕ (x)T ϕ (z)K(x,z) = \phi(x)^T \phi(z)K(x,z)= ϕ (x)T ϕ (z)
  3. Decision function: f(x)=sign(∑ α iy(i)K(x(i),x)+b)f(x) = \text{sign}\Big(\sum \alpha_i y^{(i)} K(x^{(i)},x) + b\Big)f(x)=sign(∑ α iy(i)K(x(i),x)+b) Naive Bayes
  4. Posterior:

IG(S,A)=H(S)−∑v∣Sv∣∣S∣H(Sv)IG(S,A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∑∣S∣∣Sv∣H(Sv)

  1. Gini index: G(S)=1−∑cpc2G(S) = 1-\sum_c p_c^2G(S)= 1 −c∑pc Ensembles
  2. Bagging prediction: f^bag(x)=1B∑b=1Bfb(x)\hat{f}{bag}(x) = \frac{1}{B}\sum{b=1}^B f_b(x)f^bag(x)=B b= 1 ∑Bfb(x)
  3. AdaBoost weight update: wi(t+1)=wi(t)exp ( α t1{ht(x(i))≠y(i)})w_i^{(t+1)} = w_i^{(t)} \exp\big(\alpha_t 1{h_t(x^{(i)}) \neq y^{(i)}}\big)wi(t+ 1 )=wi(t)exp( α t 1 {ht(x(i)) =y(i)})
  4. AdaBoost α t\alpha_t α t: α t=12ln 1 − ϵ t ϵ t\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t} α t= 21 ln ϵ t 1 − ϵ t
  1. Gradient boosting residuals: rm(i)=−[∂L(y(i),F(x(i)))∂F(x(i))]F=Fm−1r^{(i)}m = - \left[\frac{\partial L(y^{(i)},F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F=F_{m-1}}rm(i)=−[∂F(x(i))∂L(y(i),F(x(i))) ]F=Fm− 1
  2. Gradient boosting update: Fm(x)=Fm−1(x)+ ν hm(x)F_m(x) = F_{m-1}(x) + \nu h_m(x)Fm(x)=Fm− 1 (x)+ ν hm(x) Regularization
  3. Ridge regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ 2 m∥θ∥ 2 J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^2 + \frac{\lambda}{2m}|\theta|^2J( θ )= 2 m1∑(y(i)− θ Tx(i)) 2 +2m λ ∥θ∥ 2
  4. Lasso regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ m∥θ∥ 1 J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^
  • \frac{\lambda}{m}|\theta|_1J( θ )= 2 m1∑(y(i)− θ Tx(i)) 2 +m λ ∥θ∥ 1

p(x∣y=k)=1(2 π )n/2∣Σ∣1/2exp (−12(x− μ k)T Σ −1(x− μ k))p(x|y=k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp\Big(-\tfrac{1}{2}(x-\mu_k)^T \Sigma^{-1}(x- \mu_k)\Big)p(x∣y=k)=( 2 π )n/2∣Σ∣1/21exp(− 21 (x− μ k)T Σ − 1 (x− μ k))

  1. LDA decision boundary: δ k(x)=xT Σ − 1 μ k− 12 μ kT Σ − 1 μ k+ln πk\delta_k(x) = x^T\Sigma^{-1}\mu_k - \tfrac{1}{2}\mu_k^T \Sigma^{-1}\mu_k + \ln \pi_k δ k(x)=xT Σ − 1 μ k− 21 μ kT Σ − 1 μ k+ln π k
  2. QDA decision boundary: δ k(x)=−12ln ∣Σk∣−12(x− μ k)T Σ k−1(x− μ k)+ln πk\delta_k(x) = - \tfrac{1}{2}\ln|\Sigma_k| - \tfrac{1}{2}(x-\mu_k)^T \Sigma_k^{-1}(x-\mu_k) + \ln \pi_k δ k(x)=− 21 ln∣Σk∣− 21 (x− μ k )T Σ k− 1 (x− μ k)+ln π k Model Evaluation
  3. Bias-Variance decomposition: E[(y−f^(x))2]=Bias2[f^(x)]+Var[f^(x)]+ σ 2 E[(y-\hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2E[(y−f^(x)) 2 ]=Bias2[f^(x)]+Var[f^(x)]+ σ 2
  1. Precision: TPTP+FP\frac{TP}{TP+FP}TP+FPTP
  2. Recall: TPTP+FN\frac{TP}{TP+FN}TP+FNTP
  3. F1-score: F1=2⋅Precision⋅RecallPrecision+RecallF1 = \frac{2 \cdot \text{Precision}\cdot \text{Recall}}{\text{Precision}+\text{Recall}}F1=Precision+Recall2⋅Precision⋅Recall
  4. ROC AUC = ∫01TPR(FPR−1(x))dx\int_0^1 TPR(FPR^{-1}(x)) dx∫ 01 TPR(FPR−1(x))dx Tricks & Notes
  5. Newton-Raphson update: θ (t+1)= θ (t)−H− 1 ∇ℓ( θ (t))\theta^{(t+1)} = \theta^{(t)} - H^{-1}\nabla \ell(\theta^{(t)}) θ (t+ 1 )= θ (t)−H−1∇ℓ( θ (t))
  6. Stochastic Gradient Descent:

2. Linear Regression

  • Model: hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx
  • Cost function (MSE): J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)= 2 m1i= 1 ∑m(hθ(x(i))−y(i)) 2
  • Gradient descent update: θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}θj:=θj−αm1i= 1 ∑m(hθ(x(i))−y(i))xj(i) 3. Logistic Regression (Classification)
  • Sigmoid function: g(z)=11+e−zg(z) = \frac{1}{1+e^{-z}}g(z)= 1 +e−z
  • Model: hθ(x)=g(θTx)h_\theta(x) = g(\theta^T x)hθ(x)=g(θTx)
  • Log-likelihood cost: J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))J(\theta) = - \frac{1}{m} \sum_{i=1}^m \Big( y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big)J(θ)=−m1i= 1 ∑m (y(i)loghθ(x(i))+( 1 −y(i))log( 1 −hθ(x(i)))) 4. Generalized Linear Models (GLM)
  • Predict E[y∣x]E[y|x]E[y∣x] using link function ggg: g(E[y∣x])=θTxg(E[y|x]) = \theta^T xg(E[y∣x])=θTx 5. Perceptron Algorithm
  • Initialize θ=0\theta = 0θ= 0.
  • For each (x(i),y(i))(x^{(i)}, y^{(i)})(x(i),y(i)): If y(i)(θTx(i))≤0y^{(i)} (\theta^T x^{(i)}) \leq 0y(i)(θTx(i))≤ 0 , update: θ:=θ+y(i)x(i)\theta := \theta + y^{(i)} x^{(i)}θ:=θ+y(i)x(i)

H(Y)=−∑cP(y=c)log P(y=c)H(Y) = - \sum_{c} P(y=c)\log P(y=c)H(Y)=−c∑P(y=c)logP(y=c)

  • Information gain: IG(Xj)=H(Y)−H(Y∣Xj)IG(X_j) = H(Y) - H(Y|X_j)IG(Xj)=H(Y)−H(Y∣Xj) 9. Bias-Variance Tradeoff
  • Expected error decomposition: E[(y−f^(x))2]=Bias2+Variance+σ2E[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \sigma^2E[(y−f^(x)) 2 ]=Bias2+Variance+σ 10. Regularization
  • L2 (Ridge): J(θ)=Loss+λ∥θ∥ 2 J(\theta) = \text{Loss} + \lambda |\theta|^2J(θ)=Loss+λ∥θ∥ 2
  • L1 (Lasso): J(θ)=Loss+λ∥θ∥ 1 J(\theta) = \text{Loss} + \lambda |\theta|_1J(θ)=Loss+λ∥θ∥ 1 1. Linear Regression (Closed Form MLE)

θ^MLE=arg min θ 12 σ 2 ∑i=1m(y(i)−θTx(i))2=(XTX)−1XTy\hat{\theta}{MLE} = \arg\min\theta \frac{1}{2\sigma^2}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2 = (X^TX)^{-1}X^Tyθ^MLE=argθmin 2σ21i= 1 ∑m(y(i)−θTx(i)) 2 =(XTX)−1XTy

2. Logistic Regression Log-Likelihood ℓ(θ)=∑i=1m(y(i)log σ(θTx(i))+(1−y(i))log (1−σ(θTx(i))))\ell(\theta) = \sum_{i=1}^m \Big( y^{(i)} \log \sigma(\theta^Tx^{(i)}) + (1-y^{(i)}) \log (1-\sigma(\theta^Tx^{(i)})) \Big)ℓ(θ)=i= 1 ∑m (y(i)logσ(θTx(i))+( 1 −y(i))log( 1 −σ(θTx(i)))) Gradient: ∇θℓ(θ)=∑i=1m(y(i)−σ(θTx(i)))x(i)\nabla_\theta \ell(\theta) = \sum_{i=1}^m \big(y^{(i)} - \sigma(\theta^Tx^{(i)}) \big) x^{(i)}∇θℓ(θ)=i= 1 ∑m(y(i)−σ(θTx(i)))x(i) Hessian: H=−∑i=1mσ(θTx(i))(1−σ(θTx(i)))x(i)x(i)TH = - \sum_{i=1}^m \sigma(\theta^Tx^{(i)}) \big(1 - \sigma(\theta^Tx^{(i)})\big) x^{(i)} {x^{(i)}}^TH=−i= 1 ∑mσ(θTx(i))( 1 −σ(θTx(i)))x(i)x(i)T 3. SVM Primal (Soft Margin)

IG(S,A)=H(S)−∑v∈Values(A)∣Sv∣∣S∣H(Sv)IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∈Values(A)∑∣S∣∣Sv∣H(Sv) where entropy is: H(S)=−∑c∈Classespclog pcH(S) = - \sum_{c \in \text{Classes}} p_c \log p_cH(S)=−c∈Classes∑pc logpc

7. Bias-Variance Decomposition E[(y−f^(x))2]=(E[f^(x)]−f(x))2+E[(f^(x)−E[f^(x)])2]+σ2\mathbb{E}\Big[ (y - \hat{f}(x))^2 \Big] = \Big( \mathbb{E}[\hat{f}(x)] - f(x) \Big)^2 + \mathbb{E}\Big[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\Big] + \sigma^2E[(y−f^(x)) 2 ]=(E[f^(x)]−f(x)) 2 +E[(f^(x)−E[f^(x)]) 2 ]+σ 8. Regularized Logistic Regression J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))+λ 2 m∥θ∥ 2 J(\theta) = - \frac{1}{m}\sum_{i=1}^m \Big(y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big) + \frac{\lambda}{2m}|\theta|^2J(θ)=−m1i= 1 ∑m(y(i)loghθ(x(i))+( 1 −y(i))log( 1 −hθ (x(i))))+2mλ∥θ∥ 2

9. Gaussian Discriminant Analysis (Class Posterior) p(y=k∣x)=1(2π)n/2∣Σ∣1/2exp (−12(x−μk)TΣ−1(x−μk)) ϕk∑j=1K1(2π)n/2∣Σ∣1/2exp (−12(x−μj)TΣ− 1(x−μj)) ϕjp(y=k|x) = \frac{\frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x- \mu_k)^T \Sigma^{-1}(x-\mu_k)\Big) , \phi_k}{\sum_{j=1}^K \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x-\mu_j)^T \Sigma^{-1}(x-\mu_j)\Big) , \phi_j}p(y=k∣x)=∑j=1K(2π)n/2∣Σ∣1/21exp(− 21 (x−μj)TΣ−1(x−μj))ϕj(2π)n/2∣Σ∣1/21exp(− 21 (x−μk )TΣ−1(x−μk))ϕk 10. Gradient Boosting Update Rule Residuals: rm(i)=−[∂L(y(i),F(x(i)))∂F(x(i))]F(x)=Fm−1(x)r^{(i)}m = - \left[\frac{\partial L(y^{(i)}, F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F(x)=F_{m-1}(x)}rm(i)=−[∂F(x(i))∂L(y(i),F(x(i)))]F(x)=Fm− 1 (x) Model update: Fm(x)=Fm−1(x)+ν⋅hm(x)F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)Fm(x)=Fm− 1 (x)+ν⋅hm(x)