CS 229 Supervised Learning Cheatsheet ML Notes 2025-2026, Exams of Computer Science

Comprehensive CS 229 supervised learning cheatsheet featuring structured machine learning notes and formulas. Covers linear regression, logistic regression, gradient descent, regularization, bias-variance tradeoff, support vector machines, kernel methods, probabilistic models, evaluation metrics, and optimization techniques. Designed for students preparing for machine learning exams, coursework, and technical interviews in 2025–2026.

Typology: Exams

2025/2026

Available from 06/11/2026

JoeWinterfell
JoeWinterfell 🇺🇸

121 documents

1 / 20

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1 | P a g e
CS 229 Supervised Learning Cheatsheet Updated 2025
2026 | Machine Learning Study Notes & Key Formulas
Linear Regression
1. Hypothesis:
hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx
2. MSE Loss:
J(θ)=12m∑i=1m(y(i)−θTx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T
x^{(i)})^2J(θ)=2m1i=1∑m(y(i)−θTx(i))2
3. Normal Equation (MLE):
θ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Tyθ^=(XTX)−1XTy
4. Regularized Normal Equation:
θ^=(XTX+λI)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Tyθ^=(XTX+λI)−1XTy
5. Probabilistic model (Gaussian noise):
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14

Partial preview of the text

Download CS 229 Supervised Learning Cheatsheet ML Notes 2025-2026 and more Exams Computer Science in PDF only on Docsity!

CS 229 Supervised Learning Cheatsheet Updated 2025–

2026 | Machine Learning Study Notes & Key Formulas

Linear Regression

  1. Hypothesis: h θ (x)= θ Txh_\theta(x) = \theta^T xh θ (x)= θ Tx
  2. MSE Loss: J( θ )=12m∑i=1m(y(i)− θ Tx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2J( θ )=2m1i=1∑m(y(i)− θ Tx(i))
  3. Normal Equation (MLE): θ ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Ty θ ^=(XTX)−1XTy
  4. Regularized Normal Equation: θ ^=(XTX+ λ I)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Ty θ ^=(XTX+ λ I)−1XTy
  5. Probabilistic model (Gaussian noise):

p(y∣x; θ )=N(y∣θTx, σ 2)p(y|x;\theta) = \mathcal{N}(y|\theta^Tx,\sigma^2)p(y∣x; θ )=N(y∣θTx, σ 2) Logistic Regression

  1. Sigmoid: σ (z)=11+e−z\sigma(z) = \frac{1}{1+e^{-z}} σ (z)=1+e−z
  2. Hypothesis: h θ (x)= σ ( θ Tx)h_\theta(x) = \sigma(\theta^T x)h θ (x)= σ ( θ Tx)
  3. Log-likelihood: 𝓁( θ )=∑i= 1 m(y(i)log h θ (x(i))+( 1 −y(i))log ( 1 −h θ (x(i))))\ell(\theta) = \sum_{i= 1 }^m \Big(y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))\Big)𝓁( θ )=i=1∑m (y(i)logh θ (x(i))+(1−y(i))log(1−h θ (x(i))))
  4. Gradient:

g( μ )= η , μ =E[y∣x]g(\mu) = \eta, \quad \mu = \mathbb{E}[y|x]g( μ )= η , μ =E[y∣x]

  1. Logistic regression = Bernoulli + logit link
  2. Linear regression = Gaussian + identity link Perceptron
  3. Update rule: θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 \theta := \theta + y^{(i)}x^{(i)} \quad \text{if } y^{(i)}(\theta^T x^{(i)}) \leq 0 θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 SVM (Support Vector Machines)
  4. Hard-margin primal: min θ,b12∥θ∥2s.t. y(i)( θ Tx(i)+b)≥ 1 \min_{\theta,b} \frac{1}{2}|\theta|^2 \quad \text{s.t. } y^{(i)}(\theta^Tx^{(i)}+b)\geq 1 θ ,bmin21∥θ∥2s.t. y(i)( θ Tx(i)+b)≥ 1
  5. Soft-margin primal:

min θ,b, ξ 12 ∥θ∥2+C∑ ξ i,y(i)( θ Tx(i)+b)≥1− ξ i\min_{\theta,b,\xi} \frac{1}{2}|\theta|^2 + C\sum \xi_i, \quad y^{(i)}(\theta^Tx^{(i)}+b)\geq 1-\xi_i θ ,b, ξ min21∥θ∥2+C∑ ξ i ,y(i)( θ Tx(i)+b)≥1− ξ i

  1. Dual: max α∑ α i−12∑ α i α jy(i)y(j)K(x(i),x(j))\max_\alpha \sum \alpha_i - \frac{1}{2}\sum \alpha_i \alpha_j y^{(i)} y^{(j)} K(x^{(i)},x^{(j)}) α max∑ α i−21∑ α i α jy(i)y(j)K(x(i),x(j))
  2. Kernel trick: K(x,z)= ϕ (x)T ϕ (z)K(x,z) = \phi(x)^T \phi(z)K(x,z)= ϕ (x)T ϕ (z)
  3. Decision function: f(x)=sign(∑ α iy(i)K(x(i),x)+b)f(x) = \text{sign}\Big(\sum \alpha_i y^{(i)} K(x^{(i)},x) + b\Big)f(x)=sign(∑ α iy(i)K(x(i),x)+b) Naive Bayes
  4. Posterior:

IG(S,A)=H(S)−∑v∣Sv∣∣S∣H(Sv)IG(S,A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∑∣S∣∣Sv∣H(Sv)

  1. Gini index: G(S)=1−∑cpc2G(S) = 1 - \sum_c p_c^2G(S)=1−c∑pc Ensembles
  2. Bagging prediction: f^bag(x)=1B∑b=1Bfb(x)\hat{f}{bag}(x) = \frac{1}{B}\sum{b=1}^B f_b(x)f^bag(x)=B b=1∑Bfb(x)
  3. AdaBoost weight update: wi(t+1)=wi(t)exp ( α t1{ht(x(i))≠y(i)})w_i^{(t+1)} = w_i^{(t)} \exp\big(\alpha_t 1 {h_t(x^{(i)}) \neq y^{(i)}}\big)wi(t+1)=wi(t)exp( α t1{ht(x(i)) =y(i)})
  4. AdaBoost α t\alpha_t α t: α t=12ln 1− ϵ t ϵ t\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t} α t=21ln ϵ t1− ϵ t
  1. Gradient boosting residuals: rm(i)=−[𝜕L(y(i),F(x(i)))𝜕F(x(i))]F=Fm−1r^{(i)}m = - \left[\frac{\partial L(y^{(i)},F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F=F_{m-1}}rm(i)=−[𝜕F(x(i))𝜕L(y(i),F(x(i))) ]F=Fm−
  2. Gradient boosting update: Fm(x)=Fm−1(x)+ ν hm(x)F_m(x) = F_{m-1}(x) + \nu h_m(x)Fm(x)=Fm−1(x)+ ν hm(x) Regularization
  3. Ridge regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ 2m∥θ∥2J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^2 + \frac{\lambda}{2m}|\theta|^2J( θ )=2m1∑(y(i)− θ Tx(i))2+2m λ ∥θ∥ 2
  4. Lasso regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ m∥θ∥1J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^
  • \frac{\lambda}{m}|\theta|_1J( θ )=2m1∑(y(i)− θ Tx(i))2+m λ ∥θ∥ 1

p(x∣y=k)=1(2 π )n/2∣Σ∣1/2exp (−12(x− μ k)T Σ −1(x− μ k))p(x|y=k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp\Big(-\tfrac{1}{2}(x-\mu_k)^T \Sigma^{-1}(x- \mu_k)\Big)p(x∣y=k)=(2 π )n/2∣Σ∣1/21exp(−21(x− μ k)T Σ −1(x− μ k))

  1. LDA decision boundary: δ k(x)=xT Σ − 1 μ k−12 μ kT Σ − 1 μ k+ln πk\delta_k(x) = x^T\Sigma^{-1}\mu_k - \tfrac{1}{2}\mu_k^T \Sigma^{-1}\mu_k + \ln \pi_k δ k(x)=xT Σ − 1 μ k−21 μ kT Σ − 1 μ k+ln π k
  2. QDA decision boundary: δ k(x)=−12ln∣ Σk∣−12(x− μ k)T Σ k−1(x− μ k)+ln πk\delta_k(x) = - \tfrac{1}{2}\ln|\Sigma_k| - \tfrac{1}{2}(x-\mu_k)^T \Sigma_k^{-1}(x-\mu_k) + \ln \pi_k δ k(x)=−21ln∣Σk∣−21(x− μ k )T Σ k−1(x− μ k)+ln π k Model Evaluation
  3. Bias-Variance decomposition: E[(y−f^(x))2]=Bias2[f^(x)]+Var[f^(x)]+ σ 2E[(y-\hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2E[(y−f^(x))2]=Bias2[f^(x)]+Var[f^(x)]+ σ 2
  1. Precision: TPTP+FP\frac{TP}{TP+FP}TP+FPTP
  2. Recall: TPTP+FN\frac{TP}{TP+FN}TP+FNTP
  3. F1-score: F1=2⋅Precision⋅RecallPrecision+RecallF1 = \frac{2 \cdot \text{Precision}\cdot \text{Recall}}{\text{Precision}+\text{Recall}}F1=Precision+Recall2⋅Precision⋅Recall
  4. ROC AUC = ∫01TPR(FPR−1(x))dx\int_0^1 TPR(FPR^{-1}(x)) dx∫01TPR(FPR−1(x))dx Tricks & Notes
  5. Newton-Raphson update: θ (t+1)= θ (t)−H−1∇𝓁( θ (t))\theta^{(t+ 1 )} = \theta^{(t)} - H^{- 1 }\nabla \ell(\theta^{(t)}) θ (t+1)= θ (t)−H−1∇𝓁( θ (t))
  6. Stochastic Gradient Descent:

2. Linear RegressionModel: hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx  Cost function (MSE): J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)=2m1i=1∑m(hθ(x(i))−y(i))  Gradient descent update: θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}θj:=θj−αm1i=1∑m(hθ(x(i))−y(i))xj(i) ◆ 3. Logistic Regression (Classification)Sigmoid function: g(z)=11+e−zg(z) = \frac{1}{1+e^{-z}}g(z)=1+e−z  Model: hθ(x)=g(θTx)h_\theta(x) = g(\theta^T x)hθ(x)=g(θTx)

Log-likelihood cost: J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))J(\theta) = - \frac{1}{m} \sum_{i=1}^m \Big( y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big)J(θ)=−m1i=1∑m (y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))) ◆ 4. Generalized Linear Models (GLM)  Predict E[y∣x]E[y|x]E[y∣x] using link function ggg: g(E[y∣x])=θTxg(E[y|x]) = \theta^T xg(E[y∣x])=θTx ◆ 5. Perceptron Algorithm  Initialize θ=0\theta = 0θ=0.  For each (x(i),y(i))(x^{(i)}, y^{(i)})(x(i),y(i)): If y(i)(θTx(i))≤0y^{(i)} (\theta^T x^{(i)}) \leq 0y(i)(θTx(i))≤0, update: θ:=θ+y(i)x(i)\theta := \theta + y^{(i)} x^{(i)}θ:=θ+y(i)x(i)

H(Y)=−∑cP(y=c)log P(y=c)H(Y) = - \sum_{c} P(y=c)\log P(y=c)H(Y)=−c∑P(y=c)logP(y=c)  Information gain: IG(Xj)=H(Y)−H(Y∣Xj)IG(X_j) = H(Y) - H(Y|X_j)IG(Xj)=H(Y)−H(Y∣Xj) ◆ 9. Bias-Variance TradeoffExpected error decomposition: E[(y−f^(x))2]=Bias2+Variance+σ2E[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \sigma^2E[(y−f^(x))2]=Bias2+Variance+σ ◆ 10. RegularizationL2 (Ridge): J(θ)=Loss+λ∥θ∥2J(\theta) = \text{Loss} + \lambda |\theta|^2J(θ)=Loss+λ∥θ∥ 2  L1 (Lasso): J(θ)=Loss+λ∥θ∥1J(\theta) = \text{Loss} + \lambda |\theta|_1J(θ)=Loss+λ∥θ∥ 1

1. Linear Regression (Closed Form MLE)

θ^MLE=arg min θ12σ2∑i=1m(y(i)−θTx(i))2=(XTX)−1XTy\hat{\theta}{MLE} = \arg\min\theta \frac{1}{2\sigma^2}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2 = (X^TX)^{-1}X^Tyθ^MLE=argθmin 2σ21i=1∑m(y(i)−θTx(i))2=(XTX)−1XTy

2. Logistic Regression Log-Likelihood ℓ(θ)=∑i=1m(y(i)log σ(θTx(i))+(1−y(i))log (1−σ(θTx(i))))\ell(\theta) = \sum_{i=1}^m \Big( y^{(i)} \log \sigma(\theta^Tx^{(i)}) + (1-y^{(i)}) \log (1-\sigma(\theta^Tx^{(i)})) \Big)ℓ(θ)=i=1∑m (y(i)logσ(θTx(i))+(1−y(i))log(1−σ(θTx(i)))) Gradient: ∇θℓ(θ)=∑i=1m(y(i)−σ(θTx(i)))x(i)\nabla_\theta \ell(\theta) = \sum_{i=1}^m \big(y^{(i)} - \sigma(\theta^Tx^{(i)}) \big) x^{(i)}∇θℓ(θ)=i=1∑m(y(i)−σ(θTx(i)))x(i) Hessian: H=−∑i=1mσ(θTx(i))(1−σ(θTx(i)))x(i)x(i)TH = - \sum_{i=1}^m \sigma(\theta^Tx^{(i)}) \big(1 - \sigma(\theta^Tx^{(i)})\big) x^{(i)} {x^{(i)}}^TH=−i=1∑mσ(θTx(i))(1−σ(θTx(i)))x(i)x(i)T 3. SVM Primal (Soft Margin)

IG(S,A)=H(S)−∑v∈Values(A)∣Sv∣∣S∣H(Sv)IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∈Values(A)∑∣S∣∣Sv∣H(Sv) where entropy is: H(S)=−∑c∈Classespclog pcH(S) = - \sum_{c \in \text{Classes}} p_c \log p_cH(S)=−c∈Classes∑pc logpc

7. Bias-Variance Decomposition E[(y−f^(x))2]=(E[f^(x)]−f(x))2+E[(f^(x)−E[f^(x)])2]+σ2\mathbb{E}\Big[ (y - \hat{f}(x))^2 \Big] = \Big( \mathbb{E}[\hat{f}(x)] - f(x) \Big)^2 + \mathbb{E}\Big[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\Big] + \sigma^2E[(y−f^(x))2]=(E[f^(x)]−f(x))2+E[(f^(x)−E[f^(x)])2]+σ 8. Regularized Logistic Regression J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))+λ2m∥θ∥2J(\theta) = - \frac{1}{m}\sum_{i=1}^m \Big(y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big) + \frac{\lambda}{2m}|\theta|^2J(θ)=−m1i=1∑m(y(i)loghθ(x(i))+(1−y(i))log(1−hθ (x(i))))+2mλ∥θ∥ 2

9. Gaussian Discriminant Analysis (Class Posterior) p(y=k∣x)=1(2π)n/2∣Σ∣1/2exp (−12(x−μk)TΣ−1(x−μk)) ϕk∑j=1K1(2π)n/2∣Σ∣1/2exp (−12(x−μj)TΣ− 1(x−μj)) ϕjp(y=k|x) = \frac{\frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x- \mu_k)^T \Sigma^{-1}(x-\mu_k)\Big) , \phi_k}{\sum_{j=1}^K \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x-\mu_j)^T \Sigma^{-1}(x-\mu_j)\Big) , \phi_j}p(y=k∣x)=∑j=1K(2π)n/2∣Σ∣1/21exp(−21(x−μj)TΣ−1(x−μj))ϕj(2π)n/2∣Σ∣1/21exp(−21(x−μk )TΣ−1(x−μk))ϕk 10. Gradient Boosting Update Rule Residuals: rm(i)=−[∂L(y(i),F(x(i)))∂F(x(i))]F(x)=Fm−1(x)r^{(i)}m = - \left[\frac{\partial L(y^{(i)}, F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F(x)=F_{m-1}(x)}rm(i)=−[∂F(x(i))∂L(y(i),F(x(i)))]F(x)=Fm−1(x) Model update: Fm(x)=Fm−1(x)+ν⋅hm(x)F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)Fm(x)=Fm−1(x)+ν⋅hm(x)