Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

CS 229 Supervised Learning Cheatsheet Stanford Machine Learning 2025-2026, Exams of Machine Learning

Chamberlain College of Nursing Machine Learning

This CS 229 Supervised Learning cheatsheet provides a concise reference to the most important machine learning formulas and concepts, including linear regression, logistic regression, generalized linear models (GLMs), perceptrons, support vector machines (SVMs), maximum likelihood estimation (MLE), MAP estimation, gradients, Hessians, and optimization techniques. Ideal for Stanford CS 229 students, machine learning practitioners, and data science professionals preparing for exams, interviews, and coursework in 2025–2026.

Typology: Exams

2025/2026

Available from 06/12/2026

loreen-qui 🇺🇸

153 documents

1 / 20

This page cannot be seen from the preview

Don't miss anything!

1 | P a g e

CS 229 Supervised Learning Complete Study Guide 2025–2026 |

Machine Learning Cheatsheet Covering Linear Regression,

Logistic Regression, GLMs, Perceptron, SVMs, MAP Estimation,

Optimization, and Core Formulas

Linear Regression

1. Hypothesis:

hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx

2. MSE Loss:

J(θ)=12m∑i=1m(y(i)−θTx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T

x^{(i)})^2J(θ)=2m1i=1∑m(y(i)−θTx(i))2

3. Normal Equation (MLE):

θ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Tyθ^=(XTX)−1XTy

4. Regularized Normal Equation:

θ^=(XTX+λI)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Tyθ^=(XTX+λI)−1XTy

5. Probabilistic model (Gaussian noise):

Discover Exams of Machine Learning Chamberlain College of Nursing

Partial preview of the text

Download CS 229 Supervised Learning Cheatsheet Stanford Machine Learning 2025-2026 and more Exams Machine Learning in PDF only on Docsity!

CS 229 Supervised Learning Complete Study Guide 2025–2026 |

Machine Learning Cheatsheet Covering Linear Regression,

Logistic Regression, GLMs, Perceptron, SVMs, MAP Estimation,

Optimization, and Core Formulas

Linear Regression

Hypothesis: h θ (x)= θ Txh_\theta(x) = \theta^T xh θ (x)= θ Tx
MSE Loss: J( θ )=12m∑i=1m(y(i)− θ Tx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2J( θ )=2m1i=1∑m(y(i)− θ Tx(i))
Normal Equation (MLE): θ ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Ty θ ^=(XTX)−1XTy
Regularized Normal Equation: θ ^=(XTX+ λ I)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Ty θ ^=(XTX+ λ I)−1XTy
Probabilistic model (Gaussian noise):

p(y∣x; θ )=N(y∣θTx, σ 2)p(y|x;\theta) = \mathcal{N}(y|\theta^Tx,\sigma^2)p(y∣x; θ )=N(y∣θTx, σ 2) Logistic Regression

Sigmoid: σ (z)=11+e−z\sigma(z) = \frac{1}{1+e^{-z}} σ (z)=1+e−z
Hypothesis: h θ (x)= σ ( θ Tx)h_\theta(x) = \sigma(\theta^T x)h θ (x)= σ ( θ Tx)
Log-likelihood: 𝓁( θ )=∑i= 1 m(y(i)log h θ (x(i))+( 1 −y(i))log ( 1 −h θ (x(i))))\ell(\theta) = \sum_{i= 1 }^m \Big(y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))\Big)𝓁( θ )=i=1∑m (y(i)logh θ (x(i))+(1−y(i))log(1−h θ (x(i))))
Gradient:

g( μ )= η , μ =E[y∣x]g(\mu) = \eta, \quad \mu = \mathbb{E}[y|x]g( μ )= η , μ =E[y∣x]

Logistic regression = Bernoulli + logit link
Linear regression = Gaussian + identity link Perceptron
Update rule: θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 \theta := \theta + y^{(i)}x^{(i)} \quad \text{if } y^{(i)}(\theta^T x^{(i)}) \leq 0 θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 SVM (Support Vector Machines)
Hard-margin primal: min θ,b12∥θ∥2s.t. y(i)( θ Tx(i)+b)≥ 1 \min_{\theta,b} \frac{1}{2}|\theta|^2 \quad \text{s.t. } y^{(i)}(\theta^Tx^{(i)}+b)\geq 1 θ ,bmin21∥θ∥2s.t. y(i)( θ Tx(i)+b)≥ 1
Soft-margin primal:

min θ,b, ξ 12 ∥θ∥2+C∑ ξ i,y(i)( θ Tx(i)+b)≥1− ξ i\min_{\theta,b,\xi} \frac{1}{2}|\theta|^2 + C\sum \xi_i, \quad y^{(i)}(\theta^Tx^{(i)}+b)\geq 1-\xi_i θ ,b, ξ min21∥θ∥2+C∑ ξ i ,y(i)( θ Tx(i)+b)≥1− ξ i

Dual: max α∑ α i−12∑ α i α jy(i)y(j)K(x(i),x(j))\max_\alpha \sum \alpha_i - \frac{1}{2}\sum \alpha_i \alpha_j y^{(i)} y^{(j)} K(x^{(i)},x^{(j)}) α max∑ α i−21∑ α i α jy(i)y(j)K(x(i),x(j))
Kernel trick: K(x,z)= ϕ (x)T ϕ (z)K(x,z) = \phi(x)^T \phi(z)K(x,z)= ϕ (x)T ϕ (z)
Decision function: f(x)=sign(∑ α iy(i)K(x(i),x)+b)f(x) = \text{sign}\Big(\sum \alpha_i y^{(i)} K(x^{(i)},x) + b\Big)f(x)=sign(∑ α iy(i)K(x(i),x)+b) Naive Bayes
Posterior:

IG(S,A)=H(S)−∑v∣Sv∣∣S∣H(Sv)IG(S,A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∑∣S∣∣Sv∣H(Sv)

Gini index: G(S)=1−∑cpc2G(S) = 1 - \sum_c p_c^2G(S)=1−c∑pc Ensembles
Bagging prediction: f^bag(x)=1B∑b=1Bfb(x)\hat{f}{bag}(x) = \frac{1}{B}\sum{b=1}^B f_b(x)f^bag(x)=B b=1∑Bfb(x)
AdaBoost weight update: wi(t+1)=wi(t)exp ( α t1{ht(x(i))≠y(i)})w_i^{(t+1)} = w_i^{(t)} \exp\big(\alpha_t 1 {h_t(x^{(i)}) \neq y^{(i)}}\big)wi(t+1)=wi(t)exp( α t1{ht(x(i)) =y(i)})
AdaBoost α t\alpha_t α t: α t=12ln 1− ϵ t ϵ t\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t} α t=21ln ϵ t1− ϵ t

Gradient boosting residuals: rm(i)=−[𝜕L(y(i),F(x(i)))𝜕F(x(i))]F=Fm−1r^{(i)}m = - \left[\frac{\partial L(y^{(i)},F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F=F_{m-1}}rm(i)=−[𝜕F(x(i))𝜕L(y(i),F(x(i))) ]F=Fm−
Gradient boosting update: Fm(x)=Fm−1(x)+ ν hm(x)F_m(x) = F_{m-1}(x) + \nu h_m(x)Fm(x)=Fm−1(x)+ ν hm(x) Regularization
Ridge regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ 2m∥θ∥2J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^2 + \frac{\lambda}{2m}|\theta|^2J( θ )=2m1∑(y(i)− θ Tx(i))2+2m λ ∥θ∥ 2
Lasso regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ m∥θ∥1J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^

\frac{\lambda}{m}|\theta|_1J( θ )=2m1∑(y(i)− θ Tx(i))2+m λ ∥θ∥ 1

p(x∣y=k)=1(2 π )n/2∣Σ∣1/2exp (−12(x− μ k)T Σ −1(x− μ k))p(x|y=k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp\Big(-\tfrac{1}{2}(x-\mu_k)^T \Sigma^{-1}(x- \mu_k)\Big)p(x∣y=k)=(2 π )n/2∣Σ∣1/21exp(−21(x− μ k)T Σ −1(x− μ k))

LDA decision boundary: δ k(x)=xT Σ − 1 μ k−12 μ kT Σ − 1 μ k+ln πk\delta_k(x) = x^T\Sigma^{-1}\mu_k - \tfrac{1}{2}\mu_k^T \Sigma^{-1}\mu_k + \ln \pi_k δ k(x)=xT Σ − 1 μ k−21 μ kT Σ − 1 μ k+ln π k
QDA decision boundary: δ k(x)=−12ln∣ Σk∣−12(x− μ k)T Σ k−1(x− μ k)+ln πk\delta_k(x) = - \tfrac{1}{2}\ln|\Sigma_k| - \tfrac{1}{2}(x-\mu_k)^T \Sigma_k^{-1}(x-\mu_k) + \ln \pi_k δ k(x)=−21ln∣Σk∣−21(x− μ k )T Σ k−1(x− μ k)+ln π k Model Evaluation
Bias-Variance decomposition: E[(y−f^(x))2]=Bias2[f^(x)]+Var[f^(x)]+ σ 2E[(y-\hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2E[(y−f^(x))2]=Bias2[f^(x)]+Var[f^(x)]+ σ 2

Precision: TPTP+FP\frac{TP}{TP+FP}TP+FPTP
Recall: TPTP+FN\frac{TP}{TP+FN}TP+FNTP
F1-score: F1=2⋅Precision⋅RecallPrecision+RecallF1 = \frac{2 \cdot \text{Precision}\cdot \text{Recall}}{\text{Precision}+\text{Recall}}F1=Precision+Recall2⋅Precision⋅Recall
ROC AUC = ∫01TPR(FPR−1(x))dx\int_0^1 TPR(FPR^{-1}(x)) dx∫01TPR(FPR−1(x))dx Tricks & Notes
Newton-Raphson update: θ (t+1)= θ (t)−H−1∇𝓁( θ (t))\theta^{(t+ 1 )} = \theta^{(t)} - H^{- 1 }\nabla \ell(\theta^{(t)}) θ (t+1)= θ (t)−H−1∇𝓁( θ (t))
Stochastic Gradient Descent:

◆ 2. Linear Regression  Model: hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx  Cost function (MSE): J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)=2m1i=1∑m(hθ(x(i))−y(i))  Gradient descent update: θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}θj:=θj−αm1i=1∑m(hθ(x(i))−y(i))xj(i) ◆ 3. Logistic Regression (Classification)  Sigmoid function: g(z)=11+e−zg(z) = \frac{1}{1+e^{-z}}g(z)=1+e−z  Model: hθ(x)=g(θTx)h_\theta(x) = g(\theta^T x)hθ(x)=g(θTx)

 Log-likelihood cost: J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))J(\theta) = - \frac{1}{m} \sum_{i=1}^m \Big( y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big)J(θ)=−m1i=1∑m (y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i)))) ◆ 4. Generalized Linear Models (GLM)  Predict E[y∣x]E[y|x]E[y∣x] using link function ggg: g(E[y∣x])=θTxg(E[y|x]) = \theta^T xg(E[y∣x])=θTx ◆ 5. Perceptron Algorithm  Initialize θ=0\theta = 0θ=0.  For each (x(i),y(i))(x^{(i)}, y^{(i)})(x(i),y(i)): If y(i)(θTx(i))≤0y^{(i)} (\theta^T x^{(i)}) \leq 0y(i)(θTx(i))≤0, update: θ:=θ+y(i)x(i)\theta := \theta + y^{(i)} x^{(i)}θ:=θ+y(i)x(i)

H(Y)=−∑cP(y=c)log P(y=c)H(Y) = - \sum_{c} P(y=c)\log P(y=c)H(Y)=−c∑P(y=c)logP(y=c)  Information gain: IG(Xj)=H(Y)−H(Y∣Xj)IG(X_j) = H(Y) - H(Y|X_j)IG(Xj)=H(Y)−H(Y∣Xj) ◆ 9. Bias-Variance Tradeoff  Expected error decomposition: E[(y−f^(x))2]=Bias2+Variance+σ2E[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \sigma^2E[(y−f^(x))2]=Bias2+Variance+σ ◆ 10. Regularization  L2 (Ridge): J(θ)=Loss+λ∥θ∥2J(\theta) = \text{Loss} + \lambda |\theta|^2J(θ)=Loss+λ∥θ∥ 2  L1 (Lasso): J(θ)=Loss+λ∥θ∥1J(\theta) = \text{Loss} + \lambda |\theta|_1J(θ)=Loss+λ∥θ∥ 1

1. Linear Regression (Closed Form MLE)

θ^MLE=arg min θ12σ2∑i=1m(y(i)−θTx(i))2=(XTX)−1XTy\hat{\theta}{MLE} = \arg\min\theta \frac{1}{2\sigma^2}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2 = (X^TX)^{-1}X^Tyθ^MLE=argθmin 2σ21i=1∑m(y(i)−θTx(i))2=(XTX)−1XTy

2. Logistic Regression Log-Likelihood ℓ(θ)=∑i=1m(y(i)log σ(θTx(i))+(1−y(i))log (1−σ(θTx(i))))\ell(\theta) = \sum_{i=1}^m \Big( y^{(i)} \log \sigma(\theta^Tx^{(i)}) + (1-y^{(i)}) \log (1-\sigma(\theta^Tx^{(i)})) \Big)ℓ(θ)=i=1∑m (y(i)logσ(θTx(i))+(1−y(i))log(1−σ(θTx(i)))) Gradient: ∇θℓ(θ)=∑i=1m(y(i)−σ(θTx(i)))x(i)\nabla_\theta \ell(\theta) = \sum_{i=1}^m \big(y^{(i)} - \sigma(\theta^Tx^{(i)}) \big) x^{(i)}∇θℓ(θ)=i=1∑m(y(i)−σ(θTx(i)))x(i) Hessian: H=−∑i=1mσ(θTx(i))(1−σ(θTx(i)))x(i)x(i)TH = - \sum_{i=1}^m \sigma(\theta^Tx^{(i)}) \big(1 - \sigma(\theta^Tx^{(i)})\big) x^{(i)} {x^{(i)}}^TH=−i=1∑mσ(θTx(i))(1−σ(θTx(i)))x(i)x(i)T 3. SVM Primal (Soft Margin)

IG(S,A)=H(S)−∑v∈Values(A)∣Sv∣∣S∣H(Sv)IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∈Values(A)∑∣S∣∣Sv∣H(Sv) where entropy is: H(S)=−∑c∈Classespclog pcH(S) = - \sum_{c \in \text{Classes}} p_c \log p_cH(S)=−c∈Classes∑pc logpc

7. Bias-Variance Decomposition E[(y−f^(x))2]=(E[f^(x)]−f(x))2+E[(f^(x)−E[f^(x)])2]+σ2\mathbb{E}\Big[ (y - \hat{f}(x))^2 \Big] = \Big( \mathbb{E}[\hat{f}(x)] - f(x) \Big)^2 + \mathbb{E}\Big[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\Big] + \sigma^2E[(y−f^(x))2]=(E[f^(x)]−f(x))2+E[(f^(x)−E[f^(x)])2]+σ 8. Regularized Logistic Regression J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))+λ2m∥θ∥2J(\theta) = - \frac{1}{m}\sum_{i=1}^m \Big(y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big) + \frac{\lambda}{2m}|\theta|^2J(θ)=−m1i=1∑m(y(i)loghθ(x(i))+(1−y(i))log(1−hθ (x(i))))+2mλ∥θ∥ 2

9. Gaussian Discriminant Analysis (Class Posterior) p(y=k∣x)=1(2π)n/2∣Σ∣1/2exp (−12(x−μk)TΣ−1(x−μk)) ϕk∑j=1K1(2π)n/2∣Σ∣1/2exp (−12(x−μj)TΣ− 1(x−μj)) ϕjp(y=k|x) = \frac{\frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x- \mu_k)^T \Sigma^{-1}(x-\mu_k)\Big) , \phi_k}{\sum_{j=1}^K \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x-\mu_j)^T \Sigma^{-1}(x-\mu_j)\Big) , \phi_j}p(y=k∣x)=∑j=1K(2π)n/2∣Σ∣1/21exp(−21(x−μj)TΣ−1(x−μj))ϕj(2π)n/2∣Σ∣1/21exp(−21(x−μk )TΣ−1(x−μk))ϕk 10. Gradient Boosting Update Rule Residuals: rm(i)=−[∂L(y(i),F(x(i)))∂F(x(i))]F(x)=Fm−1(x)r^{(i)}m = - \left[\frac{\partial L(y^{(i)}, F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F(x)=F_{m-1}(x)}rm(i)=−[∂F(x(i))∂L(y(i),F(x(i)))]F(x)=Fm−1(x) Model update: Fm(x)=Fm−1(x)+ν⋅hm(x)F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)Fm(x)=Fm−1(x)+ν⋅hm(x)

CS 229 Supervised Learning Cheatsheet Stanford Machine Learning 2025-2026, Exams of Machine Learning

Related documents

Partial preview of the text

Download CS 229 Supervised Learning Cheatsheet Stanford Machine Learning 2025-2026 and more Exams Machine Learning in PDF only on Docsity!

CS 229 Supervised Learning Complete Study Guide 2025–2026 |

Machine Learning Cheatsheet Covering Linear Regression,

Logistic Regression, GLMs, Perceptron, SVMs, MAP Estimation,

Optimization, and Core Formulas