Download Machine Learning Lecture Notes: Introduction to Machine Learning and Gaussian Distribution and more Exams Machine Learning in PDF only on Docsity! COMS W4771 MACHINE LEARNING LECTURE NOTES COMPLETE 2024 Introduction to Machine Learning Machine Learning: the study of computational mechanisms that "learn" from data in order to make predictions and decisions o Statistical data-driven Computation Models o Real domains (vision, speech behavior): No E=MC^2 Noisy, complex, nonlinear Have many variables Non-deterministic Incomplete, approximate models o Need: statistical models driven by data & sensors o Bottom-up: use data to form a model o Why? Complex data everywhere, audio, video, internet o Intelligence = Learning = Prediction o Statistician: Breiman, industry learning, very efficient Machine Learning Tacks o Supervised: algorithms where we have the answers in advanced and making forecasts for future data (a known relationship/function). The learning part happens where the results can be compared to with expected values. Classification 𝑥 > 0 Regression 𝑥 = 𝑦 o Unsupervised: algorithms don't know in advanced the labels/clusters/relevant features. Exploring what we see and figure out what information we have. Modeling/Structuring 𝑝 𝑥 Representing data, help organize data Clustering Separating into common characteristics Find what the groups are and the similar features Feature Selection Extracting most relevant features Detection 𝑝 𝑥 < 𝑡 Below a certain threshold Machine Learning Applications o Interdisciplinary (CS, Math, Stats, Physics, OR, Psych) o Data-driven approach to AI o Many domains are too hard to do manually o For example (any type of large data sets): Speech Recognition Computer Vision Time Series Prediction Genomics NLP and Parsing Text and InfoRetrieval Medical Behavior/Games Machine learning is a subset of Artificial Intelligence, using statistical approach based on data Example 1: Image classification o Goal: automatically recognize bird species in new photos Example 2: Matchmaking o Goal: predict how likely any pair of students will go on a date if introduced to each other Example 3: Machine Translation o Goal: automatically translate any English sentence into French Example 4: Personalized Medicine o Goal: prescribed a personalized treatment for any patient that delivers the best possible health outcome for that patient Basic Setting o Data: labeled examples: 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑛, 𝑦𝑛 ∈ ∗ 𝑌 Prediction problems: o Goal: learn a prediction function that provides correct labels to input that may be encountered in the future (e.g. new unlabeled examples) o Using collection of labelled examples, turn into learning algorithm which then turns into learned predictor Assumption that the past data and future data are similar Some Basic Issues o How should we represent the input objects? o What types of prediction functions should we consider? o How should data be used to select a predictor? o How can we evaluate whether learning was successful? Special case: binary classification o 𝑦 = 0,1 is it a [category] or not o Why is this hard? Only have labels for previous data, which together comprise a miniscule fraction of the input space X Relationship between an input 𝑥 ∈ and its correct label 𝑦 ∈ 𝑌 may be complicated possibly ambiguous ML Example: Digit Recognition o Automate zip code reading in post office o Learn from labeled training images o Predict labels on testing images o is input space, 𝑌 is output space o 𝑌 has 𝐾 classes, so 𝑌 = {1,2, … , 𝐾} o Labeled examples, construct classifier, given X predict Y o Note: possible to see both (x,1) and (x,2) for same input x How do we say how good a classifier is? o Prediction accuracy, if Pr [ = 𝑌] is high then better o Prediction error, if forecast is different than true value 𝑟𝑟 = 𝑃𝑟[ ≠ 𝑌] o Key assumption: data is iid, connection between what we've seen in the past to what we expect to see in the future More notation o Let 𝑍 be a random variable o Indicator function: 1{A} = {1 is A is true; 0 if A is false} o Expectation: given Z with distribution Q and a real valued function h 𝐸 h 𝑍 = 𝑥𝑝𝑐𝑡𝑑 𝑣𝑎𝑙𝑢 h 𝑍 = Pr ∑ 𝑍 = 𝑧 ⋅ h 𝑧 Multiply forecasted value by its probability o Suppose A and B are random variables o What kind of object is 𝐶 = 𝐸[𝐴|𝐵]? o Answer: A random variable h 𝑏 = 𝐸[𝐴|𝐵 = 𝑏] is a deterministic function of b Distribution of C is given by Pr 𝐶 = h 𝑏 = Pr [𝐵 = 𝑏] o What is expectation of 𝐶 = 𝐸[𝐴|𝐵]? o Answer: 𝐸 𝐶 = 𝐸 𝐴 What is the optimal classifier? o Suppose , 𝑌 ~𝑃, for any classifier its prediction error is Pr ≠ 𝑌 = 𝐸 𝐸 1 ≠ 𝑌 o Minimized when 𝑥 = 𝑎𝑟𝑔𝑚𝑎𝑥 Pr [𝑌 = 𝑦| = 𝑥] o Classifier f with property above for all x is called the Bayes classifier, it has the smallest prediction error among all classifiers Try to find function f where input x give particular value y Divides up input space X into different regions by how it predicts; the boundaries between these regions are called the decision boundaries Meaning that we found a threshold o Bayes rule: P[A|B]=P(A)P(B|A)/P(B), since P(B) is not dependent on a, Bayes classifier is 𝑥 = 𝑎𝑟𝑔𝑚𝑎𝑥 Pr 𝑌 = 𝑦 ⋅ Pr 𝑌 = 𝑦 = 𝑥 Gaussian class conditional densities o Binary distribution o prior probability is the probability that it will happen based on its history o Conditional density for class o Then use Bayes classifier approach Can only take values 0 or 1, multiply by probability that the case will be 0 or 1 Use density function, multiplying it by prior probabilities, output 1 if more likely, 0 if not o Can use linear or quadratic separator, or even more complicated 𝑛 Machine Learning Lecture 2 5/25 Classifiers Via Generative Models Plug in Classifier Bayes classifier o Has smallest prediction error among all possible classifiers o Cannot construct classifier without knowing Pr [𝑌 = 𝑦| = 𝑥] for all 𝑥, 𝑦 ∈ ⋅ 𝑌 o All we have are labeled examples drawn from the distribution o E.g. sets of (x,y): {(2,1), (7,-1), (3,1)} Plug in classifiers o Using labeled examples form an approximation to Pr [𝑌 = 𝑦| = 𝑥] then plug in to formula for Bayes classifier o Goal is to predict labels 'y' given examples 'x' o Use generative statistical models to estimate P and then form approximation to Pr [𝑌 = 𝑦| = 𝑥] Process of plug in classifiers using generative statistical models: o Use training data (labeled examples) to obtain approximations for each component in Bayes classifier formula 𝑥 = 𝑎𝑟𝑔𝑚a𝑥 Pr 𝑌 = 𝑦 ⋅ Pr 𝑌 = 𝑦 = 𝑥 o Plus in approx to formula to form classifier f_hat Example: Gaussian Class Conditional Densities o = 𝑅𝑑, 𝑌 = {1,2, . . , 𝐾} o Class priors: MLE estimate of 𝜋𝑦 is : 𝜋𝑦 = 1 ∑ 1 𝑦i = 𝑦 This is the total number of that particular y observed over the total number of y's o Class conditional density 𝑁(µ𝑦, Σ𝑦): the MLE estimate of (µ𝑦, Σ𝑦) is : o Plug in classifier: Advantages of classifiers: o Very simple recipe, conditional distributions o Surprisingly effective Disadvantages o Hard to justify o Effort modeling P away from decision boundary between classes is not necessary for good classification Evaluating Classifiers Separating data sets into training and testing data sets o E.g. for time series, use older data as training and newer for testing o Otherwise, a best way to split into two sets is to do a random selection of observations E.g. 70% of observations used as training data and 30% as testing o Training data set is also split into two data sets: Training and validation Basically training and testing repeatedly with these two smaller data sets, once model fits with validation data, then can use for testing data set Find test error o Can use proportion of wrong estimations: # wrong/ total # Definition: o Training data: Let S be set of (𝑥i, 𝑦i) of observations used to construct a classifier f_hat Can we tell is f_hat is any good? o True error: 𝑟𝑟 h𝑎𝑡 = 𝑃𝑟[h𝑎𝑡 ≠ 𝑌] where , 𝑌 ~𝑃. We don't know P so this can't be computed o Training error: This often underestimates the error Test Error o General method Training data S Test data T o Only use S to build classifier f, the rest error is a good estimate of the true error of f_hat o Assuming T is an iid sample of P The test error is an unbiased estimator o Usually training error 𝑟𝑟(h𝑎𝑡, 𝑆) shows less than test error 𝑟𝑟(h𝑎𝑡 , 𝑇), since we used the training data to build the model o True error? Somewhere between 𝑟𝑟(h𝑎𝑡, 𝑆) and 𝑟𝑟(h𝑎𝑡, 𝑇), since 𝑟𝑟 h𝑎𝑡, 𝑆 is the best case Pr Z=𝑧 o Support data S at a leaf l is split by a rule h into S_L and S_R, where 𝑝𝐿 : = |𝑆𝐿|/|𝑆| and 𝑝𝑅 : = |𝑆𝑅|/ 𝑆 . o Reduction in uncertainty from using rule h at leaf l is 𝑢 𝑆 − 𝑝𝐿 · 𝑢 𝑆𝐿 + 𝑝𝑅 · 𝑢 𝑆𝑅 . Example: o Entropy = ∑(Pr 𝑍 = 𝑧 log 1 ) o Total of 30 observations, 14 with label A, and 16 with label B o Calculating entropy: − 14 log 14 − 16 log 16 = 0.996 30 30 30 30 − 13 log 13 − 4 log 4 = 0.787 17 17 17 17 17 ⋅ 0.787 + 13 ⋅ 0.391 = 0.615 30 30 Information gain = 0.996 − 0.615 = 0.38 Stopping Criterion o When tree reaches a pre-specified size. Involves setting additional “tuning parameters”—use hold-out or cross-validation. o When every leaf is pure Serious danger of overfitting spurious structure due to sampling Overfitting o Training error goes to zero as the number of nodes in the tree increases o True error decreases initially, but eventually increases due to overfitting Preventing overfitting o Splitting training data S into two parts S' and S'' Use first part S' to grow the tree until all leaves are pure Use second part S'' to choose a good pruning of the tree. o Pruning algorithm Loop: Replace any tree node by a leaf node if it improves the error on S'' Can be done with dynamic programming (bottom-up traversal of tree) o Independence of S' and S'' make it unlikely for spurious structures in each to perfectly align. ID3 (basic decision tree) Quinlan 1986 o No back tracking: local minima (cannot go back to review decision) o Hypothesis space is complete: target function is there o Outputs a single hypothesis o Statistically based search choices: robust to noise data o Inductive bias prefer shortest tree: Occam's razor C4.5 (improvement of ID3) o Pruning trees after creation o Handles continuous and discrete features o Handles attributes with differing costs o Handles data with missing value CART (Classification and Regression Tree) very similar to C4.5 o Different numerical target variables with regression o Does not compute rule sets o Grow large trees and prune to minimize error rate using gross validation o Binary tree, Gini impurity Note about HW LaTeX preferred, or else must be very clear Note on R Trail <- read.csv("trial.csv") #open and read file Write.csv(trial, file="hope.csv") #writes into file Table(myyankees$bats) #a variable is bats Myyankees$bats[5]<-1 # assigning 1 to fifth element in bats Chapter 3 Lab: Linear Regression o Library(MASS) o Library(ISLR) #import libraries