Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Introduction to Machine Learning at Carnegie Mellon University, Lecture notes of Machine Learning

A lecture note from a course on Introduction to Machine Learning at Carnegie Mellon University. The lecture covers topics such as PAC Learning, Oracles, Sampling, Generative vs. Discriminative, Generalization and Inductive Bias, VC Dimension, and Generalization and Overfitting. The lecture also includes Q&A sessions where the instructor answers questions from students. The document can be useful as study notes, lecture notes, and summaries for students taking a course on machine learning or related fields.

Typology: Lecture notes

2020/2021

Uploaded on 05/11/2023

borich
borich 🇬🇧

4.5

(25)

293 documents

Partial preview of the text

Download Introduction to Machine Learning at Carnegie Mellon University and more Lecture notes Machine Learning in PDF only on Docsity! PAC Learning + Oracles, Sampling, Generative vs. Discriminative 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 16 Oct. 24, 2018 Machine Learning Department School of Computer Science Carnegie Mellon University Q&A 2 Q: Why do we shuffle the examples in SGD? A: This is how we do sampling without replacement 1. Theoretically we can show sampling without replacement is not significantly worse than sampling with replacement (Shamir, 2016) 2. Practically sampling without replacement tends to work better Q: What is “bias”? A: That depends. The word “bias” shows up all over machine learning! Watch out… 1. The additive term in a linear model (i.e. b in wTx + b) 2. Inductive bias is the principle by which a learning algorithm generalizes to unseen examples 3. Bias of a model in a societal sense may refer to racial, socio-economic, gender biases that exist in the predictions of your model 4. The difference between the expected predictions of your model and the ground truth (as in “bias-variance tradeoff”) (See your TAs excellent post here: https://piazza.com/class/jkmt7l4of093k5?cid=383) Generalization and Inductive Bias Chalkboard: – Setting: binary classification with binary feature vectors – Instance space vs. Hypothesis space – Counting: # of instances, # leaves in a full decision tree, # of full decision trees, # of labelings of training examples – Algorithm: keep all full decision trees consistent with the training data and do a majority vote to classify – Case study: training size is all, all-but-one, all-but- two, all-but-three,… 5 VC DIMENSION 6 7 What if H is infinite? E.g., linear separators in Rd + - + + + - - - - - E.g., intervals on the real line a b + - - E.g., thresholds on the real line w + - Slide from Nina Balcan 10 E.g., H= linear separators in R2 Shattering, VC-dimension VCdim H ≥ 3 Slide from Nina Balcan 11 Shattering, VC-dimension VCdim H < 4 Case 1: one point inside the triangle formed by the others. Cannot label inside point as positive and outside points as negative. Case 2: all points on the boundary (convex hull). Cannot label two diagonally as positive and other two as negative. Fact: VCdim of linear separators in Rd is d+1 E.g., H= linear separators in R2 Slide from Nina Balcan 12 Shattering, VC-dimension E.g., H= Thresholds on the real line VCdim H = 1 w + - If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered. E.g., H= Intervals on the real line + - - + - VCdim H = 2 + - + Slide from Nina Balcan SLT-style Corollaries Corollary 3 (Realizable, Infinite ||). For some 6 > 0, with proba- bility at least (1 - 6), for any hypothesis h in H consistent with the data (i.e. with R(h) = 0), R(h) <O (5 vec) In (sun) +n (5)}) (1) Corollary 4 (Agnostic, Infinite |H/|). Forsome6 > 0, with probabil- ity at least (1 — 6), for all hypotheses hin H, R(h) < R(h) +O ( + vem) +n ()]) (2) Generalization and Overfitting Whiteboard: – Empirical Risk Minimization – Structural Risk Minimization – Motivation for Regularization 18 Questions For Today 1. Given a classifier with zero training error, what can we say about generalization error? (Sample Complexity, Realizable Case) 2. Given a classifier with low training error, what can we say about generalization error? (Sample Complexity, Agnostic Case) 3. Is there a theoretical justification for regularization to avoid overfitting? (Structural Risk Minimization) 23 Classification and Regression: The Big Picture Whiteboard – Decision Rules / Models (probabilistic generative, probabilistic discriminative, perceptron, SVM, regression) – Objective Functions (likelihood, conditional likelihood, hinge loss, mean squared error) – Regularization (L1, L2, priors for MAP) – Update Rules (SGD, perceptron) – Nonlinear Features (preprocessing, kernel trick) 27 ML Big Picture 28 Learning Paradigms: What data is available and when? What form of prediction? • supervised learning • unsupervised learning • semi-supervised learning • reinforcement learning • active learning • imitation learning • domain adaptation • online learning • density estimation • recommender systems • feature learning • manifold learning • dimensionality reduction • ensemble learning • distant supervision • hyperparameter optimization Problem Formulation: What is the structure of our output prediction? boolean Binary Classification categorical Multiclass Classification ordinal Ordinal Classification real Regression ordering Ranking multiple discrete Structured Prediction multiple continuous (e.g. dynamical systems) both discrete & cont. (e.g. mixed graphical models) Theoretical Foundations: What principles guide learning? q probabilistic q information theoretic q evolutionary search q ML as optimization Facets of Building ML Systems: How to build systems that are robust, efficient, adaptive, effective? 1. Data prep 2. Model selection 3. Training (optimization / search) 4. Hyperparameter tuning on validation data 5. (Blind) Assessment on test data Big Ideas in ML: Which are the ideas driving development of the field? • inductive bias • generalization / overfitting • bias-variance decomposition • generative vs. discriminative • deep nets, graphical models • PAC learning • distant rewards A pp lic at io n A re as Ke y ch al le ng es ? N LP , S pe ec h, C om pu te r Vi si on , R ob ot ic s, M ed ic in e, Se ar ch PROBABILISTIC LEARNING 29 Oracles and Sampling Whiteboard – Sampling from common probability distributions • Bernoulli • Categorical • Uniform • Gaussian – Pretending to be an Oracle (Regression) • Case 1: Deterministic outputs • Case 2: Probabilistic outputs – Probabilistic Interpretation of Linear Regression • Adding Gaussian noise to linear function • Sampling from the noise model – Pretending to be an Oracle (Classification) • Case 1: Deterministic labels • Case 2: Probabilistic outputs (Logistic Regression) • Case 3: Probabilistic outputs (Gaussian Naïve Bayes) 33 In-Class Exercise 1. With your neighbor, write a function which returns samples from a Categorical – Assume access to the rand() function – Function signature should be: categorical_sample(theta) where theta is the array of parameters – Make your implementation as efficient as possible! 2. What is the expected runtime of your function? 34 Generative vs. Discrminative Whiteboard – Generative vs. Discriminative Models • Chain rule of probability • Maximum (Conditional) Likelihood Estimation for Discriminative models • Maximum Likelihood Estimation for Generative models 35