CS 229 Supervised Learning Cheatsheet Stanford Machine Learning 2025-2026, Exams of Machine Learning

This CS 229 Supervised Learning cheatsheet provides a concise reference to the most important machine learning formulas and concepts, including linear regression, logistic regression, generalized linear models (GLMs), perceptrons, support vector machines (SVMs), maximum likelihood estimation (MLE), MAP estimation, gradients, Hessians, and optimization techniques. Ideal for Stanford CS 229 students, machine learning practitioners, and data science professionals preparing for exams, interviews, and coursework in 2025–2026.

Typology: Exams

2025/2026

Available from 06/12/2026

loreen-qui
loreen-qui 🇺🇸

153 documents

1 / 20

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1 | P a g e
CS 229 Supervised Learning Complete Study Guide 20252026 |
Machine Learning Cheatsheet Covering Linear Regression,
Logistic Regression, GLMs, Perceptron, SVMs, MAP Estimation,
Optimization, and Core Formulas
Linear Regression
1. Hypothesis:
hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx
2. MSE Loss:
J(θ)=12m∑i=1m(y(i)−θTx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T
x^{(i)})^2J(θ)=2m1i=1∑m(y(i)−θTx(i))2
3. Normal Equation (MLE):
θ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Tyθ^=(XTX)−1XTy
4. Regularized Normal Equation:
θ^=(XTX+λI)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Tyθ^=(XTX+λI)−1XTy
5. Probabilistic model (Gaussian noise):
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14

Partial preview of the text

Download CS 229 Supervised Learning Cheatsheet Stanford Machine Learning 2025-2026 and more Exams Machine Learning in PDF only on Docsity!

CS 229 Supervised Learning Complete Study Guide 2025–2026 |

Machine Learning Cheatsheet Covering Linear Regression,

Logistic Regression, GLMs, Perceptron, SVMs, MAP Estimation,

Optimization, and Core Formulas

Linear Regression

  1. Hypothesis: h θ (x)= θ Txh_\theta(x) = \theta^T xh θ (x)= θ Tx
  2. MSE Loss: J( θ )=12m∑i=1m(y(i)− θ Tx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2J( θ )=2m1i=1∑m(y(i)− θ Tx(i))
  3. Normal Equation (MLE): θ ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Ty θ ^=(XTX)−1XTy
  4. Regularized Normal Equation: θ ^=(XTX+ λ I)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Ty θ ^=(XTX+ λ I)−1XTy
  5. Probabilistic model (Gaussian noise):

p(y∣x; θ )=N(y∣θTx, σ 2)p(y|x;\theta) = \mathcal{N}(y|\theta^Tx,\sigma^2)p(y∣x; θ )=N(y∣θTx, σ 2) Logistic Regression

  1. Sigmoid: σ (z)=11+e−z\sigma(z) = \frac{1}{1+e^{-z}} σ (z)=1+e−z
  2. Hypothesis: h θ (x)= σ ( θ Tx)h_\theta(x) = \sigma(\theta^T x)h θ (x)= σ ( θ Tx)
  3. Log-likelihood: 𝓁( θ )=∑i= 1 m(y(i)log h θ (x(i))+( 1 −y(i))log ( 1 −h θ (x(i))))\ell(\theta) = \sum_{i= 1 }^m \Big(y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))\Big)𝓁( θ )=i=1∑m (y(i)logh θ (x(i))+(1−y(i))log(1−h θ (x(i))))
  4. Gradient:

g( μ )= η , μ =E[y∣x]g(\mu) = \eta, \quad \mu = \mathbb{E}[y|x]g( μ )= η , μ =E[y∣x]

  1. Logistic regression = Bernoulli + logit link
  2. Linear regression = Gaussian + identity link Perceptron
  3. Update rule: θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 \theta := \theta + y^{(i)}x^{(i)} \quad \text{if } y^{(i)}(\theta^T x^{(i)}) \leq 0 θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 SVM (Support Vector Machines)
  4. Hard-margin primal: min θ,b12∥θ∥2s.t. y(i)( θ Tx(i)+b)≥ 1 \min_{\theta,b} \frac{1}{2}|\theta|^2 \quad \text{s.t. } y^{(i)}(\theta^Tx^{(i)}+b)\geq 1 θ ,bmin21∥θ∥2s.t. y(i)( θ Tx(i)+b)≥ 1
  5. Soft-margin primal:

min θ,b, ξ 12 ∥θ∥2+C∑ ξ i,y(i)( θ Tx(i)+b)≥1− ξ i\min_{\theta,b,\xi} \frac{1}{2}|\theta|^2 + C\sum \xi_i, \quad y^{(i)}(\theta^Tx^{(i)}+b)\geq 1-\xi_i θ ,b, ξ min21∥θ∥2+C∑ ξ i ,y(i)( θ Tx(i)+b)≥1− ξ i

  1. Dual: max α∑ α i−12∑ α i α jy(i)y(j)K(x(i),x(j))\max_\alpha \sum \alpha_i - \frac{1}{2}\sum \alpha_i \alpha_j y^{(i)} y^{(j)} K(x^{(i)},x^{(j)}) α max∑ α i−21∑ α i α jy(i)y(j)K(x(i),x(j))
  2. Kernel trick: K(x,z)= ϕ (x)T ϕ (z)K(x,z) = \phi(x)^T \phi(z)K(x,z)= ϕ (x)T ϕ (z)
  3. Decision function: f(x)=sign(∑ α iy(i)K(x(i),x)+b)f(x) = \text{sign}\Big(\sum \alpha_i y^{(i)} K(x^{(i)},x) + b\Big)f(x)=sign(∑ α iy(i)K(x(i),x)+b) Naive Bayes
  4. Posterior:

IG(S,A)=H(S)−∑v∣Sv∣∣S∣H(Sv)IG(S,A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∑∣S∣∣Sv∣H(Sv)

  1. Gini index: G(S)=1−∑cpc2G(S) = 1 - \sum_c p_c^2G(S)=1−c∑pc Ensembles
  2. Bagging prediction: f^bag(x)=1B∑b=1Bfb(x)\hat{f}{bag}(x) = \frac{1}{B}\sum{b=1}^B f_b(x)f^bag(x)=B b=1∑Bfb(x)
  3. AdaBoost weight update: wi(t+1)=wi(t)exp ( α t1{ht(x(i))≠y(i)})w_i^{(t+1)} = w_i^{(t)} \exp\big(\alpha_t 1 {h_t(x^{(i)}) \neq y^{(i)}}\big)wi(t+1)=wi(t)exp( α t1{ht(x(i)) =y(i)})
  4. AdaBoost α t\alpha_t α t: α t=12ln 1− ϵ t ϵ t\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t} α t=21ln ϵ t1− ϵ t
  1. Gradient boosting residuals: rm(i)=−[𝜕L(y(i),F(x(i)))𝜕F(x(i))]F=Fm−1r^{(i)}m = - \left[\frac{\partial L(y^{(i)},F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F=F_{m-1}}rm(i)=−[𝜕F(x(i))𝜕L(y(i),F(x(i))) ]F=Fm−
  2. Gradient boosting update: Fm(x)=Fm−1(x)+ ν hm(x)F_m(x) = F_{m-1}(x) + \nu h_m(x)Fm(x)=Fm−1(x)+ ν hm(x) Regularization
  3. Ridge regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ 2m∥θ∥2J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^2 + \frac{\lambda}{2m}|\theta|^2J( θ )=2m1∑(y(i)− θ Tx(i))2+2m λ ∥θ∥ 2
  4. Lasso regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ m∥θ∥1J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^
  • \frac{\lambda}{m}|\theta|_1J( θ )=2m1∑(y(i)− θ Tx(i))2+m λ ∥θ∥ 1

p(x∣y=k)=1(2 π )n/2∣Σ∣1/2exp (−12(x− μ k)T Σ −1(x− μ k))p(x|y=k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp\Big(-\tfrac{1}{2}(x-\mu_k)^T \Sigma^{-1}(x- \mu_k)\Big)p(x∣y=k)=(2 π )n/2∣Σ∣1/21exp(−21(x− μ k)T Σ −1(x− μ k))

  1. LDA decision boundary: δ k(x)=xT Σ − 1 μ k−12 μ kT Σ − 1 μ k+ln πk\delta_k(x) = x^T\Sigma^{-1}\mu_k - \tfrac{1}{2}\mu_k^T \Sigma^{-1}\mu_k + \ln \pi_k δ k(x)=xT Σ − 1 μ k−21 μ kT Σ − 1 μ k+ln π k
  2. QDA decision boundary: δ k(x)=−12ln∣ Σk∣−12(x− μ k)T Σ k−1(x− μ k)+ln πk\delta_k(x) = - \tfrac{1}{2}\ln|\Sigma_k| - \tfrac{1}{2}(x-\mu_k)^T \Sigma_k^{-1}(x-\mu_k) + \ln \pi_k δ k(x)=−21ln∣Σk∣−21(x− μ k )T Σ k−1(x− μ k)+ln π k Model Evaluation
  3. Bias-Variance decomposition: E[(y−f^(x))2]=Bias2[f^(x)]+Var[f^(x)]+ σ 2E[(y-\hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2E[(y−f^(x))2]=Bias2[f^(x)]+Var[f^(x)]+ σ 2
  1. Precision: TPTP+FP\frac{TP}{TP+FP}TP+FPTP
  2. Recall: TPTP+FN\frac{TP}{TP+FN}TP+FNTP
  3. F1-score: F1=2⋅Precision⋅RecallPrecision+RecallF1 = \frac{2 \cdot \text{Precision}\cdot \text{Recall}}{\text{Precision}+\text{Recall}}F1=Precision+Recall2⋅Precision⋅Recall
  4. ROC AUC = ∫01TPR(FPR−1(x))dx\int_0^1 TPR(FPR^{-1}(x)) dx∫01TPR(FPR−1(x))dx Tricks & Notes
  5. Newton-Raphson update: θ (t+1)= θ (t)−H−1∇𝓁( θ (t))\theta^{(t+ 1 )} = \theta^{(t)} - H^{- 1 }\nabla \ell(\theta^{(t)}) θ (t+1)= θ (t)−H−1∇𝓁( θ (t))
  6. Stochastic Gradient Descent:

2. Linear RegressionModel: hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx  Cost function (MSE): J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)=2m1i=1∑m(hθ(x(i))−y(i))  Gradient descent update: θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}θj:=θj−αm1i=1∑m(hθ(x(i))−y(i))xj(i) ◆ 3. Logistic Regression (Classification)Sigmoid function: g(z)=11+e−zg(z) = \frac{1}{1+e^{-z}}g(z)=1+e−z  Model: hθ(x)=g(θTx)h_\theta(x) = g(\theta^T x)hθ(x)=g(θTx)

Log-likelihood cost: J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))J(\theta) = - \frac{1}{m} \sum_{i=1}^m \Big( y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big)J(θ)=−m1i=1∑m (y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))) ◆ 4. Generalized Linear Models (GLM)  Predict E[y∣x]E[y|x]E[y∣x] using link function ggg: g(E[y∣x])=θTxg(E[y|x]) = \theta^T xg(E[y∣x])=θTx ◆ 5. Perceptron Algorithm  Initialize θ=0\theta = 0θ=0.  For each (x(i),y(i))(x^{(i)}, y^{(i)})(x(i),y(i)): If y(i)(θTx(i))≤0y^{(i)} (\theta^T x^{(i)}) \leq 0y(i)(θTx(i))≤0, update: θ:=θ+y(i)x(i)\theta := \theta + y^{(i)} x^{(i)}θ:=θ+y(i)x(i)

H(Y)=−∑cP(y=c)log P(y=c)H(Y) = - \sum_{c} P(y=c)\log P(y=c)H(Y)=−c∑P(y=c)logP(y=c)  Information gain: IG(Xj)=H(Y)−H(Y∣Xj)IG(X_j) = H(Y) - H(Y|X_j)IG(Xj)=H(Y)−H(Y∣Xj) ◆ 9. Bias-Variance TradeoffExpected error decomposition: E[(y−f^(x))2]=Bias2+Variance+σ2E[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \sigma^2E[(y−f^(x))2]=Bias2+Variance+σ ◆ 10. RegularizationL2 (Ridge): J(θ)=Loss+λ∥θ∥2J(\theta) = \text{Loss} + \lambda |\theta|^2J(θ)=Loss+λ∥θ∥ 2  L1 (Lasso): J(θ)=Loss+λ∥θ∥1J(\theta) = \text{Loss} + \lambda |\theta|_1J(θ)=Loss+λ∥θ∥ 1

1. Linear Regression (Closed Form MLE)

θ^MLE=arg min θ12σ2∑i=1m(y(i)−θTx(i))2=(XTX)−1XTy\hat{\theta}{MLE} = \arg\min\theta \frac{1}{2\sigma^2}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2 = (X^TX)^{-1}X^Tyθ^MLE=argθmin 2σ21i=1∑m(y(i)−θTx(i))2=(XTX)−1XTy

2. Logistic Regression Log-Likelihood ℓ(θ)=∑i=1m(y(i)log σ(θTx(i))+(1−y(i))log (1−σ(θTx(i))))\ell(\theta) = \sum_{i=1}^m \Big( y^{(i)} \log \sigma(\theta^Tx^{(i)}) + (1-y^{(i)}) \log (1-\sigma(\theta^Tx^{(i)})) \Big)ℓ(θ)=i=1∑m (y(i)logσ(θTx(i))+(1−y(i))log(1−σ(θTx(i)))) Gradient: ∇θℓ(θ)=∑i=1m(y(i)−σ(θTx(i)))x(i)\nabla_\theta \ell(\theta) = \sum_{i=1}^m \big(y^{(i)} - \sigma(\theta^Tx^{(i)}) \big) x^{(i)}∇θℓ(θ)=i=1∑m(y(i)−σ(θTx(i)))x(i) Hessian: H=−∑i=1mσ(θTx(i))(1−σ(θTx(i)))x(i)x(i)TH = - \sum_{i=1}^m \sigma(\theta^Tx^{(i)}) \big(1 - \sigma(\theta^Tx^{(i)})\big) x^{(i)} {x^{(i)}}^TH=−i=1∑mσ(θTx(i))(1−σ(θTx(i)))x(i)x(i)T 3. SVM Primal (Soft Margin)

IG(S,A)=H(S)−∑v∈Values(A)∣Sv∣∣S∣H(Sv)IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∈Values(A)∑∣S∣∣Sv∣H(Sv) where entropy is: H(S)=−∑c∈Classespclog pcH(S) = - \sum_{c \in \text{Classes}} p_c \log p_cH(S)=−c∈Classes∑pc logpc

7. Bias-Variance Decomposition E[(y−f^(x))2]=(E[f^(x)]−f(x))2+E[(f^(x)−E[f^(x)])2]+σ2\mathbb{E}\Big[ (y - \hat{f}(x))^2 \Big] = \Big( \mathbb{E}[\hat{f}(x)] - f(x) \Big)^2 + \mathbb{E}\Big[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\Big] + \sigma^2E[(y−f^(x))2]=(E[f^(x)]−f(x))2+E[(f^(x)−E[f^(x)])2]+σ 8. Regularized Logistic Regression J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))+λ2m∥θ∥2J(\theta) = - \frac{1}{m}\sum_{i=1}^m \Big(y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big) + \frac{\lambda}{2m}|\theta|^2J(θ)=−m1i=1∑m(y(i)loghθ(x(i))+(1−y(i))log(1−hθ (x(i))))+2mλ∥θ∥ 2

9. Gaussian Discriminant Analysis (Class Posterior) p(y=k∣x)=1(2π)n/2∣Σ∣1/2exp (−12(x−μk)TΣ−1(x−μk)) ϕk∑j=1K1(2π)n/2∣Σ∣1/2exp (−12(x−μj)TΣ− 1(x−μj)) ϕjp(y=k|x) = \frac{\frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x- \mu_k)^T \Sigma^{-1}(x-\mu_k)\Big) , \phi_k}{\sum_{j=1}^K \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x-\mu_j)^T \Sigma^{-1}(x-\mu_j)\Big) , \phi_j}p(y=k∣x)=∑j=1K(2π)n/2∣Σ∣1/21exp(−21(x−μj)TΣ−1(x−μj))ϕj(2π)n/2∣Σ∣1/21exp(−21(x−μk )TΣ−1(x−μk))ϕk 10. Gradient Boosting Update Rule Residuals: rm(i)=−[∂L(y(i),F(x(i)))∂F(x(i))]F(x)=Fm−1(x)r^{(i)}m = - \left[\frac{\partial L(y^{(i)}, F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F(x)=F_{m-1}(x)}rm(i)=−[∂F(x(i))∂L(y(i),F(x(i)))]F(x)=Fm−1(x) Model update: Fm(x)=Fm−1(x)+ν⋅hm(x)F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)Fm(x)=Fm−1(x)+ν⋅hm(x)