Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

CS 229 Supervised Learning Cheatsheet ML Notes 2025-2026, Exams of Computer Science

University of Michigan (UM) - Ann Arbor Computer Science

Comprehensive CS 229 supervised learning cheatsheet featuring structured machine learning notes and formulas. Covers linear regression, logistic regression, gradient descent, regularization, bias-variance tradeoff, support vector machines, kernel methods, probabilistic models, evaluation metrics, and optimization techniques. Designed for students preparing for machine learning exams, coursework, and technical interviews in 2025–2026.

Typology: Exams

2025/2026

Available from 06/11/2026

JoeWinterfell 🇺🇸

121 documents

1 / 20

This page cannot be seen from the preview

Don't miss anything!

1 | P a g e

CS 229 Supervised Learning Cheatsheet Updated 2025–

2026 | Machine Learning Study Notes & Key Formulas

Linear Regression

1. Hypothesis:

hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx

2. MSE Loss:

J(θ)=12m∑i=1m(y(i)−θTx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T

x^{(i)})^2J(θ)=2m1i=1∑m(y(i)−θTx(i))2

3. Normal Equation (MLE):

θ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Tyθ^=(XTX)−1XTy

4. Regularized Normal Equation:

θ^=(XTX+λI)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Tyθ^=(XTX+λI)−1XTy

5. Probabilistic model (Gaussian noise):

Discover Exams of Computer Science University of Michigan (UM) - Ann Arbor

Partial preview of the text

Download CS 229 Supervised Learning Cheatsheet ML Notes 2025-2026 and more Exams Computer Science in PDF only on Docsity!

CS 229 Supervised Learning Cheatsheet Updated 2025–

2026 | Machine Learning Study Notes & Key Formulas

Linear Regression

Hypothesis: h θ (x)= θ Txh_\theta(x) = \theta^T xh θ (x)= θ Tx
MSE Loss: J( θ )=12m∑i=1m(y(i)− θ Tx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2J( θ )=2m1i=1∑m(y(i)− θ Tx(i))
Normal Equation (MLE): θ ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Ty θ ^=(XTX)−1XTy
Regularized Normal Equation: θ ^=(XTX+ λ I)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Ty θ ^=(XTX+ λ I)−1XTy
Probabilistic model (Gaussian noise):

p(y∣x; θ )=N(y∣θTx, σ 2)p(y|x;\theta) = \mathcal{N}(y|\theta^Tx,\sigma^2)p(y∣x; θ )=N(y∣θTx, σ 2) Logistic Regression

Sigmoid: σ (z)=11+e−z\sigma(z) = \frac{1}{1+e^{-z}} σ (z)=1+e−z
Hypothesis: h θ (x)= σ ( θ Tx)h_\theta(x) = \sigma(\theta^T x)h θ (x)= σ ( θ Tx)
Log-likelihood: 𝓁( θ )=∑i= 1 m(y(i)log h θ (x(i))+( 1 −y(i))log ( 1 −h θ (x(i))))\ell(\theta) = \sum_{i= 1 }^m \Big(y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))\Big)𝓁( θ )=i=1∑m (y(i)logh θ (x(i))+(1−y(i))log(1−h θ (x(i))))
Gradient:

g( μ )= η , μ =E[y∣x]g(\mu) = \eta, \quad \mu = \mathbb{E}[y|x]g( μ )= η , μ =E[y∣x]

Logistic regression = Bernoulli + logit link
Linear regression = Gaussian + identity link Perceptron
Update rule: θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 \theta := \theta + y^{(i)}x^{(i)} \quad \text{if } y^{(i)}(\theta^T x^{(i)}) \leq 0 θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 SVM (Support Vector Machines)
Hard-margin primal: min θ,b12∥θ∥2s.t. y(i)( θ Tx(i)+b)≥ 1 \min_{\theta,b} \frac{1}{2}|\theta|^2 \quad \text{s.t. } y^{(i)}(\theta^Tx^{(i)}+b)\geq 1 θ ,bmin21∥θ∥2s.t. y(i)( θ Tx(i)+b)≥ 1
Soft-margin primal:

min θ,b, ξ 12 ∥θ∥2+C∑ ξ i,y(i)( θ Tx(i)+b)≥1− ξ i\min_{\theta,b,\xi} \frac{1}{2}|\theta|^2 + C\sum \xi_i, \quad y^{(i)}(\theta^Tx^{(i)}+b)\geq 1-\xi_i θ ,b, ξ min21∥θ∥2+C∑ ξ i ,y(i)( θ Tx(i)+b)≥1− ξ i

Dual: max α∑ α i−12∑ α i α jy(i)y(j)K(x(i),x(j))\max_\alpha \sum \alpha_i - \frac{1}{2}\sum \alpha_i \alpha_j y^{(i)} y^{(j)} K(x^{(i)},x^{(j)}) α max∑ α i−21∑ α i α jy(i)y(j)K(x(i),x(j))
Kernel trick: K(x,z)= ϕ (x)T ϕ (z)K(x,z) = \phi(x)^T \phi(z)K(x,z)= ϕ (x)T ϕ (z)
Decision function: f(x)=sign(∑ α iy(i)K(x(i),x)+b)f(x) = \text{sign}\Big(\sum \alpha_i y^{(i)} K(x^{(i)},x) + b\Big)f(x)=sign(∑ α iy(i)K(x(i),x)+b) Naive Bayes
Posterior:

IG(S,A)=H(S)−∑v∣Sv∣∣S∣H(Sv)IG(S,A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∑∣S∣∣Sv∣H(Sv)

Gini index: G(S)=1−∑cpc2G(S) = 1 - \sum_c p_c^2G(S)=1−c∑pc Ensembles
Bagging prediction: f^bag(x)=1B∑b=1Bfb(x)\hat{f}{bag}(x) = \frac{1}{B}\sum{b=1}^B f_b(x)f^bag(x)=B b=1∑Bfb(x)
AdaBoost weight update: wi(t+1)=wi(t)exp ( α t1{ht(x(i))≠y(i)})w_i^{(t+1)} = w_i^{(t)} \exp\big(\alpha_t 1 {h_t(x^{(i)}) \neq y^{(i)}}\big)wi(t+1)=wi(t)exp( α t1{ht(x(i)) =y(i)})
AdaBoost α t\alpha_t α t: α t=12ln 1− ϵ t ϵ t\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t} α t=21ln ϵ t1− ϵ t

Gradient boosting residuals: rm(i)=−[𝜕L(y(i),F(x(i)))𝜕F(x(i))]F=Fm−1r^{(i)}m = - \left[\frac{\partial L(y^{(i)},F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F=F_{m-1}}rm(i)=−[𝜕F(x(i))𝜕L(y(i),F(x(i))) ]F=Fm−
Gradient boosting update: Fm(x)=Fm−1(x)+ ν hm(x)F_m(x) = F_{m-1}(x) + \nu h_m(x)Fm(x)=Fm−1(x)+ ν hm(x) Regularization
Ridge regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ 2m∥θ∥2J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^2 + \frac{\lambda}{2m}|\theta|^2J( θ )=2m1∑(y(i)− θ Tx(i))2+2m λ ∥θ∥ 2
Lasso regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ m∥θ∥1J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^

\frac{\lambda}{m}|\theta|_1J( θ )=2m1∑(y(i)− θ Tx(i))2+m λ ∥θ∥ 1

p(x∣y=k)=1(2 π )n/2∣Σ∣1/2exp (−12(x− μ k)T Σ −1(x− μ k))p(x|y=k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp\Big(-\tfrac{1}{2}(x-\mu_k)^T \Sigma^{-1}(x- \mu_k)\Big)p(x∣y=k)=(2 π )n/2∣Σ∣1/21exp(−21(x− μ k)T Σ −1(x− μ k))

LDA decision boundary: δ k(x)=xT Σ − 1 μ k−12 μ kT Σ − 1 μ k+ln πk\delta_k(x) = x^T\Sigma^{-1}\mu_k - \tfrac{1}{2}\mu_k^T \Sigma^{-1}\mu_k + \ln \pi_k δ k(x)=xT Σ − 1 μ k−21 μ kT Σ − 1 μ k+ln π k
QDA decision boundary: δ k(x)=−12ln∣ Σk∣−12(x− μ k)T Σ k−1(x− μ k)+ln πk\delta_k(x) = - \tfrac{1}{2}\ln|\Sigma_k| - \tfrac{1}{2}(x-\mu_k)^T \Sigma_k^{-1}(x-\mu_k) + \ln \pi_k δ k(x)=−21ln∣Σk∣−21(x− μ k )T Σ k−1(x− μ k)+ln π k Model Evaluation
Bias-Variance decomposition: E[(y−f^(x))2]=Bias2[f^(x)]+Var[f^(x)]+ σ 2E[(y-\hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2E[(y−f^(x))2]=Bias2[f^(x)]+Var[f^(x)]+ σ 2

Precision: TPTP+FP\frac{TP}{TP+FP}TP+FPTP
Recall: TPTP+FN\frac{TP}{TP+FN}TP+FNTP
F1-score: F1=2⋅Precision⋅RecallPrecision+RecallF1 = \frac{2 \cdot \text{Precision}\cdot \text{Recall}}{\text{Precision}+\text{Recall}}F1=Precision+Recall2⋅Precision⋅Recall
ROC AUC = ∫01TPR(FPR−1(x))dx\int_0^1 TPR(FPR^{-1}(x)) dx∫01TPR(FPR−1(x))dx Tricks & Notes
Newton-Raphson update: θ (t+1)= θ (t)−H−1∇𝓁( θ (t))\theta^{(t+ 1 )} = \theta^{(t)} - H^{- 1 }\nabla \ell(\theta^{(t)}) θ (t+1)= θ (t)−H−1∇𝓁( θ (t))
Stochastic Gradient Descent:

◆ 2. Linear Regression  Model: hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx  Cost function (MSE): J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)=2m1i=1∑m(hθ(x(i))−y(i))  Gradient descent update: θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}θj:=θj−αm1i=1∑m(hθ(x(i))−y(i))xj(i) ◆ 3. Logistic Regression (Classification)  Sigmoid function: g(z)=11+e−zg(z) = \frac{1}{1+e^{-z}}g(z)=1+e−z  Model: hθ(x)=g(θTx)h_\theta(x) = g(\theta^T x)hθ(x)=g(θTx)

 Log-likelihood cost: J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))J(\theta) = - \frac{1}{m} \sum_{i=1}^m \Big( y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big)J(θ)=−m1i=1∑m (y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))) ◆ 4. Generalized Linear Models (GLM)  Predict E[y∣x]E[y|x]E[y∣x] using link function ggg: g(E[y∣x])=θTxg(E[y|x]) = \theta^T xg(E[y∣x])=θTx ◆ 5. Perceptron Algorithm  Initialize θ=0\theta = 0θ=0.  For each (x(i),y(i))(x^{(i)}, y^{(i)})(x(i),y(i)): If y(i)(θTx(i))≤0y^{(i)} (\theta^T x^{(i)}) \leq 0y(i)(θTx(i))≤0, update: θ:=θ+y(i)x(i)\theta := \theta + y^{(i)} x^{(i)}θ:=θ+y(i)x(i)

H(Y)=−∑cP(y=c)log P(y=c)H(Y) = - \sum_{c} P(y=c)\log P(y=c)H(Y)=−c∑P(y=c)logP(y=c)  Information gain: IG(Xj)=H(Y)−H(Y∣Xj)IG(X_j) = H(Y) - H(Y|X_j)IG(Xj)=H(Y)−H(Y∣Xj) ◆ 9. Bias-Variance Tradeoff  Expected error decomposition: E[(y−f^(x))2]=Bias2+Variance+σ2E[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \sigma^2E[(y−f^(x))2]=Bias2+Variance+σ ◆ 10. Regularization  L2 (Ridge): J(θ)=Loss+λ∥θ∥2J(\theta) = \text{Loss} + \lambda |\theta|^2J(θ)=Loss+λ∥θ∥ 2  L1 (Lasso): J(θ)=Loss+λ∥θ∥1J(\theta) = \text{Loss} + \lambda |\theta|_1J(θ)=Loss+λ∥θ∥ 1

1. Linear Regression (Closed Form MLE)

θ^MLE=arg min θ12σ2∑i=1m(y(i)−θTx(i))2=(XTX)−1XTy\hat{\theta}{MLE} = \arg\min\theta \frac{1}{2\sigma^2}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2 = (X^TX)^{-1}X^Tyθ^MLE=argθmin 2σ21i=1∑m(y(i)−θTx(i))2=(XTX)−1XTy

2. Logistic Regression Log-Likelihood ℓ(θ)=∑i=1m(y(i)log σ(θTx(i))+(1−y(i))log (1−σ(θTx(i))))\ell(\theta) = \sum_{i=1}^m \Big( y^{(i)} \log \sigma(\theta^Tx^{(i)}) + (1-y^{(i)}) \log (1-\sigma(\theta^Tx^{(i)})) \Big)ℓ(θ)=i=1∑m (y(i)logσ(θTx(i))+(1−y(i))log(1−σ(θTx(i)))) Gradient: ∇θℓ(θ)=∑i=1m(y(i)−σ(θTx(i)))x(i)\nabla_\theta \ell(\theta) = \sum_{i=1}^m \big(y^{(i)} - \sigma(\theta^Tx^{(i)}) \big) x^{(i)}∇θℓ(θ)=i=1∑m(y(i)−σ(θTx(i)))x(i) Hessian: H=−∑i=1mσ(θTx(i))(1−σ(θTx(i)))x(i)x(i)TH = - \sum_{i=1}^m \sigma(\theta^Tx^{(i)}) \big(1 - \sigma(\theta^Tx^{(i)})\big) x^{(i)} {x^{(i)}}^TH=−i=1∑mσ(θTx(i))(1−σ(θTx(i)))x(i)x(i)T 3. SVM Primal (Soft Margin)

IG(S,A)=H(S)−∑v∈Values(A)∣Sv∣∣S∣H(Sv)IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∈Values(A)∑∣S∣∣Sv∣H(Sv) where entropy is: H(S)=−∑c∈Classespclog pcH(S) = - \sum_{c \in \text{Classes}} p_c \log p_cH(S)=−c∈Classes∑pc logpc

7. Bias-Variance Decomposition E[(y−f^(x))2]=(E[f^(x)]−f(x))2+E[(f^(x)−E[f^(x)])2]+σ2\mathbb{E}\Big[ (y - \hat{f}(x))^2 \Big] = \Big( \mathbb{E}[\hat{f}(x)] - f(x) \Big)^2 + \mathbb{E}\Big[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\Big] + \sigma^2E[(y−f^(x))2]=(E[f^(x)]−f(x))2+E[(f^(x)−E[f^(x)])2]+σ 8. Regularized Logistic Regression J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))+λ2m∥θ∥2J(\theta) = - \frac{1}{m}\sum_{i=1}^m \Big(y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big) + \frac{\lambda}{2m}|\theta|^2J(θ)=−m1i=1∑m(y(i)loghθ(x(i))+(1−y(i))log(1−hθ (x(i))))+2mλ∥θ∥ 2

9. Gaussian Discriminant Analysis (Class Posterior) p(y=k∣x)=1(2π)n/2∣Σ∣1/2exp (−12(x−μk)TΣ−1(x−μk)) ϕk∑j=1K1(2π)n/2∣Σ∣1/2exp (−12(x−μj)TΣ− 1(x−μj)) ϕjp(y=k|x) = \frac{\frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x- \mu_k)^T \Sigma^{-1}(x-\mu_k)\Big) , \phi_k}{\sum_{j=1}^K \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x-\mu_j)^T \Sigma^{-1}(x-\mu_j)\Big) , \phi_j}p(y=k∣x)=∑j=1K(2π)n/2∣Σ∣1/21exp(−21(x−μj)TΣ−1(x−μj))ϕj(2π)n/2∣Σ∣1/21exp(−21(x−μk )TΣ−1(x−μk))ϕk 10. Gradient Boosting Update Rule Residuals: rm(i)=−[∂L(y(i),F(x(i)))∂F(x(i))]F(x)=Fm−1(x)r^{(i)}m = - \left[\frac{\partial L(y^{(i)}, F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F(x)=F_{m-1}(x)}rm(i)=−[∂F(x(i))∂L(y(i),F(x(i)))]F(x)=Fm−1(x) Model update: Fm(x)=Fm−1(x)+ν⋅hm(x)F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)Fm(x)=Fm−1(x)+ν⋅hm(x)

CS 229 Supervised Learning Cheatsheet ML Notes 2025-2026, Exams of Computer Science

Related documents

Partial preview of the text

Download CS 229 Supervised Learning Cheatsheet ML Notes 2025-2026 and more Exams Computer Science in PDF only on Docsity!

CS 229 Supervised Learning Cheatsheet Updated 2025–

2026 | Machine Learning Study Notes & Key Formulas