Download CS 229 Supervised Learning Cheatsheet (20252026) and more Exams Machine Learning in PDF only on Docsity!
CS 229 Supervised Learning Cheatsheet Updated For
Linear Regression
- Hypothesis: h θ (x)= θ Txh_\theta(x) = \theta^T xh θ (x)= θ Tx
- MSE Loss: J( θ )=12m∑i=1m(y(i)− θ Tx(i))2J(\theta) = \frac{1}{2m}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2J( θ )= 2 m1i= 1 ∑m(y(i)− θ Tx(i)) 2
- Normal Equation (MLE): θ ^=(XTX)−1XTy\hat{\theta} = (X^TX)^{-1}X^Ty θ ^=(XTX)−1XTy
- Regularized Normal Equation: θ ^=(XTX+ λ I)−1XTy\hat{\theta} = (X^TX + \lambda I)^{-1}X^Ty θ ^=(XTX+ λ I)−1XTy
- Probabilistic model (Gaussian noise):
p(y∣x; θ )=N(y∣θTx, σ 2)p(y|x;\theta) = \mathcal{N}(y|\theta^Tx,\sigma^2)p(y∣x; θ )=N(y∣θTx, σ 2 ) Logistic Regression
- Sigmoid: σ (z)=11+e−z\sigma(z) = \frac{1}{1+e^{-z}} σ (z)= 1 +e−z
- Hypothesis: h θ (x)= σ ( θ Tx)h_\theta(x) = \sigma(\theta^T x)h θ (x)= σ ( θ Tx)
- Log-likelihood: ℓ( θ )=∑i=1m(y(i)log h θ (x(i))+(1−y(i))log (1−h θ (x(i))))\ell(\theta) = \sum_{i=1}^m \Big(y^{(i)}\log h_\theta(x^{(i)}) + (1-y^{(i)})\log (1-h_\theta(x^{(i)}))\Big)ℓ( θ )=i= 1 ∑m (y(i)logh θ (x(i))+( 1 −y(i))log( 1 −h θ (x(i))))
- Gradient:
g( μ )= η , μ =E[y∣x]g(\mu) = \eta, \quad \mu = \mathbb{E}[y|x]g( μ )= η , μ =E[y∣x]
- Logistic regression = Bernoulli + logit link
- Linear regression = Gaussian + identity link Perceptron
- Update rule: θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 \theta := \theta + y^{(i)}x^{(i)} \quad \text{if } y^{(i)}(\theta^T x^{(i)}) \leq 0 θ := θ +y(i)x(i)if y(i)( θ Tx(i))≤ 0 SVM (Support Vector Machines)
- Hard-margin primal: min θ,b12∥θ∥ 2 s.t. y(i)( θ Tx(i)+b)≥ 1 \min_{\theta,b} \frac{1}{2}|\theta|^2 \quad \text{s.t. } y^{(i)}(\theta^Tx^{(i)}+b)\geq 1 θ ,bmin 21 ∥θ∥ 2 s.t. y(i)( θ Tx(i)+b)≥ 1
- Soft-margin primal:
min θ,b, ξ 12 ∥θ∥2+C∑ ξ i,y(i)( θ Tx(i)+b)≥ 1 − ξ i\min_{\theta,b,\xi} \frac{1}{2}|\theta|^2 + C\sum \xi_i, \quad y^{(i)}(\theta^Tx^{(i)}+b)\geq 1-\xi_i θ ,b, ξ min 21 ∥θ∥ 2 +C∑ ξ i ,y(i)( θ Tx(i)+b)≥ 1 − ξ i
- Dual: max α∑ α i− 12 ∑ α i α jy(i)y(j)K(x(i),x(j))\max_\alpha \sum \alpha_i - \frac{1}{2}\sum \alpha_i \alpha_j y^{(i)} y^{(j)} K(x^{(i)},x^{(j)}) α max∑ α i− 21 ∑ α i α jy(i)y(j)K(x(i),x(j))
- Kernel trick: K(x,z)= ϕ (x)T ϕ (z)K(x,z) = \phi(x)^T \phi(z)K(x,z)= ϕ (x)T ϕ (z)
- Decision function: f(x)=sign(∑ α iy(i)K(x(i),x)+b)f(x) = \text{sign}\Big(\sum \alpha_i y^{(i)} K(x^{(i)},x) + b\Big)f(x)=sign(∑ α iy(i)K(x(i),x)+b) Naive Bayes
- Posterior:
IG(S,A)=H(S)−∑v∣Sv∣∣S∣H(Sv)IG(S,A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∑∣S∣∣Sv∣H(Sv)
- Gini index: G(S)=1−∑cpc2G(S) = 1-\sum_c p_c^2G(S)= 1 −c∑pc Ensembles
- Bagging prediction: f^bag(x)=1B∑b=1Bfb(x)\hat{f}{bag}(x) = \frac{1}{B}\sum{b=1}^B f_b(x)f^bag(x)=B b= 1 ∑Bfb(x)
- AdaBoost weight update: wi(t+1)=wi(t)exp ( α t1{ht(x(i))≠y(i)})w_i^{(t+1)} = w_i^{(t)} \exp\big(\alpha_t 1{h_t(x^{(i)}) \neq y^{(i)}}\big)wi(t+ 1 )=wi(t)exp( α t 1 {ht(x(i)) =y(i)})
- AdaBoost α t\alpha_t α t: α t=12ln 1 − ϵ t ϵ t\alpha_t = \frac{1}{2}\ln\frac{1-\epsilon_t}{\epsilon_t} α t= 21 ln ϵ t 1 − ϵ t
- Gradient boosting residuals: rm(i)=−[∂L(y(i),F(x(i)))∂F(x(i))]F=Fm−1r^{(i)}m = - \left[\frac{\partial L(y^{(i)},F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F=F_{m-1}}rm(i)=−[∂F(x(i))∂L(y(i),F(x(i))) ]F=Fm− 1
- Gradient boosting update: Fm(x)=Fm−1(x)+ ν hm(x)F_m(x) = F_{m-1}(x) + \nu h_m(x)Fm(x)=Fm− 1 (x)+ ν hm(x) Regularization
- Ridge regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ 2 m∥θ∥ 2 J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^2 + \frac{\lambda}{2m}|\theta|^2J( θ )= 2 m1∑(y(i)− θ Tx(i)) 2 +2m λ ∥θ∥ 2
- Lasso regression: J( θ )=12m∑(y(i)− θ Tx(i))2+ λ m∥θ∥ 1 J(\theta) = \frac{1}{2m}\sum (y^{(i)} - \theta^T x^{(i)})^
- \frac{\lambda}{m}|\theta|_1J( θ )= 2 m1∑(y(i)− θ Tx(i)) 2 +m λ ∥θ∥ 1
p(x∣y=k)=1(2 π )n/2∣Σ∣1/2exp (−12(x− μ k)T Σ −1(x− μ k))p(x|y=k) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp\Big(-\tfrac{1}{2}(x-\mu_k)^T \Sigma^{-1}(x- \mu_k)\Big)p(x∣y=k)=( 2 π )n/2∣Σ∣1/21exp(− 21 (x− μ k)T Σ − 1 (x− μ k))
- LDA decision boundary: δ k(x)=xT Σ − 1 μ k− 12 μ kT Σ − 1 μ k+ln πk\delta_k(x) = x^T\Sigma^{-1}\mu_k - \tfrac{1}{2}\mu_k^T \Sigma^{-1}\mu_k + \ln \pi_k δ k(x)=xT Σ − 1 μ k− 21 μ kT Σ − 1 μ k+ln π k
- QDA decision boundary: δ k(x)=−12ln ∣Σk∣−12(x− μ k)T Σ k−1(x− μ k)+ln πk\delta_k(x) = - \tfrac{1}{2}\ln|\Sigma_k| - \tfrac{1}{2}(x-\mu_k)^T \Sigma_k^{-1}(x-\mu_k) + \ln \pi_k δ k(x)=− 21 ln∣Σk∣− 21 (x− μ k )T Σ k− 1 (x− μ k)+ln π k Model Evaluation
- Bias-Variance decomposition: E[(y−f^(x))2]=Bias2[f^(x)]+Var[f^(x)]+ σ 2 E[(y-\hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2E[(y−f^(x)) 2 ]=Bias2[f^(x)]+Var[f^(x)]+ σ 2
- Precision: TPTP+FP\frac{TP}{TP+FP}TP+FPTP
- Recall: TPTP+FN\frac{TP}{TP+FN}TP+FNTP
- F1-score: F1=2⋅Precision⋅RecallPrecision+RecallF1 = \frac{2 \cdot \text{Precision}\cdot \text{Recall}}{\text{Precision}+\text{Recall}}F1=Precision+Recall2⋅Precision⋅Recall
- ROC AUC = ∫01TPR(FPR−1(x))dx\int_0^1 TPR(FPR^{-1}(x)) dx∫ 01 TPR(FPR−1(x))dx Tricks & Notes
- Newton-Raphson update: θ (t+1)= θ (t)−H− 1 ∇ℓ( θ (t))\theta^{(t+1)} = \theta^{(t)} - H^{-1}\nabla \ell(\theta^{(t)}) θ (t+ 1 )= θ (t)−H−1∇ℓ( θ (t))
- Stochastic Gradient Descent:
2. Linear Regression
- Model: hθ(x)=θTxh_\theta(x) = \theta^T xhθ(x)=θTx
- Cost function (MSE): J(θ)=12m∑i=1m(hθ(x(i))−y(i))2J(\theta) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2J(θ)= 2 m1i= 1 ∑m(hθ(x(i))−y(i)) 2
- Gradient descent update: θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))xj(i)\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}θj:=θj−αm1i= 1 ∑m(hθ(x(i))−y(i))xj(i) 3. Logistic Regression (Classification)
- Sigmoid function: g(z)=11+e−zg(z) = \frac{1}{1+e^{-z}}g(z)= 1 +e−z
- Model: hθ(x)=g(θTx)h_\theta(x) = g(\theta^T x)hθ(x)=g(θTx)
- Log-likelihood cost: J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))J(\theta) = - \frac{1}{m} \sum_{i=1}^m \Big( y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big)J(θ)=−m1i= 1 ∑m (y(i)loghθ(x(i))+( 1 −y(i))log( 1 −hθ(x(i)))) 4. Generalized Linear Models (GLM)
- Predict E[y∣x]E[y|x]E[y∣x] using link function ggg: g(E[y∣x])=θTxg(E[y|x]) = \theta^T xg(E[y∣x])=θTx 5. Perceptron Algorithm
- Initialize θ=0\theta = 0θ= 0.
- For each (x(i),y(i))(x^{(i)}, y^{(i)})(x(i),y(i)): If y(i)(θTx(i))≤0y^{(i)} (\theta^T x^{(i)}) \leq 0y(i)(θTx(i))≤ 0 , update: θ:=θ+y(i)x(i)\theta := \theta + y^{(i)} x^{(i)}θ:=θ+y(i)x(i)
H(Y)=−∑cP(y=c)log P(y=c)H(Y) = - \sum_{c} P(y=c)\log P(y=c)H(Y)=−c∑P(y=c)logP(y=c)
- Information gain: IG(Xj)=H(Y)−H(Y∣Xj)IG(X_j) = H(Y) - H(Y|X_j)IG(Xj)=H(Y)−H(Y∣Xj) 9. Bias-Variance Tradeoff
- Expected error decomposition: E[(y−f^(x))2]=Bias2+Variance+σ2E[(y - \hat{f}(x))^2] = \text{Bias}^2 + \text{Variance} + \sigma^2E[(y−f^(x)) 2 ]=Bias2+Variance+σ 10. Regularization
- L2 (Ridge): J(θ)=Loss+λ∥θ∥ 2 J(\theta) = \text{Loss} + \lambda |\theta|^2J(θ)=Loss+λ∥θ∥ 2
- L1 (Lasso): J(θ)=Loss+λ∥θ∥ 1 J(\theta) = \text{Loss} + \lambda |\theta|_1J(θ)=Loss+λ∥θ∥ 1 1. Linear Regression (Closed Form MLE)
θ^MLE=arg min θ 12 σ 2 ∑i=1m(y(i)−θTx(i))2=(XTX)−1XTy\hat{\theta}{MLE} = \arg\min\theta \frac{1}{2\sigma^2}\sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2 = (X^TX)^{-1}X^Tyθ^MLE=argθmin 2σ21i= 1 ∑m(y(i)−θTx(i)) 2 =(XTX)−1XTy
2. Logistic Regression Log-Likelihood ℓ(θ)=∑i=1m(y(i)log σ(θTx(i))+(1−y(i))log (1−σ(θTx(i))))\ell(\theta) = \sum_{i=1}^m \Big( y^{(i)} \log \sigma(\theta^Tx^{(i)}) + (1-y^{(i)}) \log (1-\sigma(\theta^Tx^{(i)})) \Big)ℓ(θ)=i= 1 ∑m (y(i)logσ(θTx(i))+( 1 −y(i))log( 1 −σ(θTx(i)))) Gradient: ∇θℓ(θ)=∑i=1m(y(i)−σ(θTx(i)))x(i)\nabla_\theta \ell(\theta) = \sum_{i=1}^m \big(y^{(i)} - \sigma(\theta^Tx^{(i)}) \big) x^{(i)}∇θℓ(θ)=i= 1 ∑m(y(i)−σ(θTx(i)))x(i) Hessian: H=−∑i=1mσ(θTx(i))(1−σ(θTx(i)))x(i)x(i)TH = - \sum_{i=1}^m \sigma(\theta^Tx^{(i)}) \big(1 - \sigma(\theta^Tx^{(i)})\big) x^{(i)} {x^{(i)}}^TH=−i= 1 ∑mσ(θTx(i))( 1 −σ(θTx(i)))x(i)x(i)T 3. SVM Primal (Soft Margin)
IG(S,A)=H(S)−∑v∈Values(A)∣Sv∣∣S∣H(Sv)IG(S, A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} H(S_v)IG(S,A)=H(S)−v∈Values(A)∑∣S∣∣Sv∣H(Sv) where entropy is: H(S)=−∑c∈Classespclog pcH(S) = - \sum_{c \in \text{Classes}} p_c \log p_cH(S)=−c∈Classes∑pc logpc
7. Bias-Variance Decomposition E[(y−f^(x))2]=(E[f^(x)]−f(x))2+E[(f^(x)−E[f^(x)])2]+σ2\mathbb{E}\Big[ (y - \hat{f}(x))^2 \Big] = \Big( \mathbb{E}[\hat{f}(x)] - f(x) \Big)^2 + \mathbb{E}\Big[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2\Big] + \sigma^2E[(y−f^(x)) 2 ]=(E[f^(x)]−f(x)) 2 +E[(f^(x)−E[f^(x)]) 2 ]+σ 8. Regularized Logistic Regression J(θ)=−1m∑i=1m(y(i)log hθ(x(i))+(1−y(i))log (1−hθ(x(i))))+λ 2 m∥θ∥ 2 J(\theta) = - \frac{1}{m}\sum_{i=1}^m \Big(y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \Big) + \frac{\lambda}{2m}|\theta|^2J(θ)=−m1i= 1 ∑m(y(i)loghθ(x(i))+( 1 −y(i))log( 1 −hθ (x(i))))+2mλ∥θ∥ 2
9. Gaussian Discriminant Analysis (Class Posterior) p(y=k∣x)=1(2π)n/2∣Σ∣1/2exp (−12(x−μk)TΣ−1(x−μk)) ϕk∑j=1K1(2π)n/2∣Σ∣1/2exp (−12(x−μj)TΣ− 1(x−μj)) ϕjp(y=k|x) = \frac{\frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x- \mu_k)^T \Sigma^{-1}(x-\mu_k)\Big) , \phi_k}{\sum_{j=1}^K \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\Big(-\frac{1}{2}(x-\mu_j)^T \Sigma^{-1}(x-\mu_j)\Big) , \phi_j}p(y=k∣x)=∑j=1K(2π)n/2∣Σ∣1/21exp(− 21 (x−μj)TΣ−1(x−μj))ϕj(2π)n/2∣Σ∣1/21exp(− 21 (x−μk )TΣ−1(x−μk))ϕk 10. Gradient Boosting Update Rule Residuals: rm(i)=−[∂L(y(i),F(x(i)))∂F(x(i))]F(x)=Fm−1(x)r^{(i)}m = - \left[\frac{\partial L(y^{(i)}, F(x^{(i)}))}{\partial F(x^{(i)})}\right]{F(x)=F_{m-1}(x)}rm(i)=−[∂F(x(i))∂L(y(i),F(x(i)))]F(x)=Fm− 1 (x) Model update: Fm(x)=Fm−1(x)+ν⋅hm(x)F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)Fm(x)=Fm− 1 (x)+ν⋅hm(x)