Download Lecture 23: Classification and Inductive Bias in Machine Learning by David C. Parkes and more Study notes Computer Numerical Control in PDF only on Docsity! CS181 Lecture 23: Wrap-up David C. Parkes Today • Review • Applications Examples: Classification • Decision trees • Perceptron algorithm • Neural networks • Support vector machines • Instance-based methods • Probabilistic methods – e.g., Naïve Bayes, graphical models Inductive Bias • Inductive bias: the assumptions made by a learning algorithm that allow it to learn • “a basis for choosing one generalization over another” Two Kinds of Inductive Bias • Restriction bias: H ½ F – e.g., perceptrons vs. general neural networks; naïve Bayes models – saw monotone formula PAC-learnable in O(m) but general DNF not* PAC-learnable • Preference bias: – prefer simpler h over complex h – intuition: find a good h2H in a small class H then less likely to be good just by chance Achieving Preference Bias • Preference bias may be achieved by: – Regularization • Score(h) = f1(Complexity(h)) + f2(Error(h)) • e.g., penalize ½ w2 in neural networks – Search bias • prefer simple hypotheses • e.g., ID3 in decision trees – Bayesian method • adopt a prior that prefers simpler probabilistic models over more complex models • P(M,µ | D) / P(D| M,µ)P(M,µ) = P(D | M,µ)P(µ|M)P(M) Selecting an Inductive Bias through Cross-validation • Often times have a parameter to set or a model to select • Examples: – DTs: threshold for chi-squared pruning – NNs: number of hidden nodes – SVMs: kernel selection – Instance-based: k in nearest-neighbor method Reporting Error A B min¸ Error(B | h¸(A)) = 15% C validation testtrain Reporting Error: Test Set! A B min¸ Error(B | h¸(A)) = 15% C Report: Error(C | h¸*(A)) validation testtrain Perceptrons, Logistic Regression and Neural Networks Perceptron Fails for XOR - + + - X1 X2 1-1 1 -1 Linear decision boundaries cannot classify 36 Strong restriction bias! 38 one layer (Duda) Neural Networks Complete hidden layer: can represent any continuous function from the inputs to the outputs with a finite number of hidden units Sigmoid units. Back propagation to train! sample training patterns
learned input-to-hidden weights
(Duda)
43
DTs vs NNs Decision Trees – discrete attributes (can be made continuous) – few attributes determine decision on any path – fast learning – models easy to interpret – fixed features Neural Networks – continuous attributes – all attributes contribute – slow learning – not easy to interpret – finds new non-linear features 48 Max-Margin Approaches • Boosting • Support Vector Machines • Underlying theory: max-margin classifiers promote simple hypothesis classes with low VC dimension ) avoid over-fitting Boosting • Run a series of rounds k=1,…,K, each with a re-weighted training set Dk • Learn a different hypothesis hk in each round, associate hk with weight ®k • Upon termination use weighted majority of {(hk,®k)} as the classifier Cumulative distribution of margins of training data After 5, 100 and 1000 rounds 5 100 1000 test baseline Support vector machines • Find max-margin decision boundary in higher dimensional space RF • “Kernel trick”: – K(x,x‟) = Á(x)¢Á(x‟); Á: RM ! RF; compute directly in RM space – algorithm accesses RF only through K(.,.) 58 (Bishop) Naive Bayes Y X1 X2 Xm… arg maxy P(y|x) = arg maxy P(x|y)P(y) Simple. Fast training/evaluation. Usefulness depends on inductive bias. Example: Text Categorization • Categorize documents in Reuters corpus: 90 topics (crude, trade, corn,…), 9603 training/3299 test; • Pre-process data: Stem words (puppies -> puppy); drop stop-words (and, or, etc.). • Document is a feature vector in 10,000+ feature space (lots of features relevant) • SVMs: polynomial and RBF kernels; Naïve Bayes; k-nearest neighbor; DTs (ID3 + validation set pruning) Joachims‟98 Precision-Recall Curve • Precision = TP / (TP + FP) – fraction of classified documents that are relevant • Recall = TP / (TP + FN) – fraction of relevant documents that are retrieved recall p re c is io n 0 1 0 1 P=R “breakeven point” ROC curve Unsupervised Learning • x =(x1, …, xm)2X • Data: D = {x1, …, xn} • Possible Goals: – Dimensionality reduction: project into 2- 3 dimensions – Clustering: discover groups of similar examples – Density estimation: determine distribution of data within input space • Probabilistic methods popular Uses of Probabilistic Methods • Prediction/Diagnosis: – P(D | S3=T, S5 =F) • Temporal reasoning: – P(OT+1 | o1,…, oT); P(Xt | o1,…, oT) • Decision making w/ uncertainty: – maxa Q(s,a) • Classification: – arg maxy P(Y=y | x1, x3) = arg maxy P(y, x1,x3) = arg maxy x2 P(y,x1,x2,x3) • Learning: – e.g., via max likelihood (+EM) arg maxµ P(D | µ) Graphical Models • Bayesian networks • Hidden Markov models Basic Problem: Inference • Core requirement is to compute P(x1,…,xk) given some observed variables • This is a sum-of-products expression • Solve using variable elimination. • Tractable if polytree. Intractable in general. – depends on size of maximal intermediate term – elimination order is crucial 1 1 1 ])Pa[|(...),...( k m j x x m j jk xxPxxP Polytrees • A polytree has no undirected cycles A B C D E A B C D E polytree not polytree Approximate Inference • Sample-based approximations – MCMC (e.g., Metropolis-Hastings; Gibbs sampling); define a Markov chain that converges to distribution on random variables • Deterministic approximations – e.g., variational methods – approximate posterior with a factored distribution Task 1: Inference • Forward-backward algorithm • Compute ® and ¯ by a pass forward and back along the HMM model • ®(xt) = P(o1,…,ot, xt); ¯(xt) = P(ot+1,..,oT| xt) • P(o1,…,oT, xt) = ®(xt)¯(xt) • P(xt | o1,…,oT) / ®(xt) … Task 2: Most Likely Explanation Given: an observation sequence o1,…,oT Goal: Compute the most likely sequence of states maxx P(x1,…,xT | o1,…,oT) Viterbi algorithm: maxx P(x1,.., xT, o1,…,oT) max-product rather than sum-product single forward/backward pass Task 3: Learning Parameters • Given observation sequence (o1, …, oT), and target number of hidden states, can learn the parameters of HMM • Use EM with forward-backward for “E” step and then “M” step via statistics Notable Successes • TD-value Implemented by Arthur Samuel in 1959! Used tiny amount of memory • Learned to play much better than he did • Gerry Tesauro used Similar ideas produced a world-champion level backgammon player (TD-Gammon) • Helicoptor control • Abbeel, Ng et al. • Aerobatic Helicoptor Flight • Learn a transition model by observing a pilot fly (e.g., for 5 minutes) • Learn a reward model from expert – “inverse reinforcement learning” – 24 feature reward function, depends on error state, squared inputs, change in inputs • “Program” by changing reward function to do various tricks Bellman Equations ' * )]'(),|'(),([maxarg)( * s a sVassPasRs ])'(),|'(),([max)( ' ** s a sVassPasRsV Value iteration (V -> V -> V …) treats second equation as fixpoint equation Policy iteration (¼ -> V -> ¼ -> V -> ¼ -> …) treats first equation as fixpoint equation PI vs VI • PI typically takes less iterations than VI • Each iteration is more costly – evaluate V(.) for new policy – O(N3) vs. O(MNL) • Overall: PI usually faster in practice Reinforcement Learning • Select an action using some sort of action selection process • If it leads to a reward, reinforce taking that action in future • If it leads to a punishment, avoid taking that action in future Hybrid RL Approaches • TD value – also learn model, and choose action a to maximize R(s,a) + s’ P(s’|s,a) V(s’); need to explore as well • Dyna-Q: also learn model and pick random (s,a) pairs for updates • Both enable faster learning in world than Q-learner; but clear tradeoff w/ space ))()'(()()( sVsVrsVsV Scaling RL: Generalization • What if state space is very large? – e.g., state consists of a number of variables – curse of dimensionality • We need to generalize from states we have seen to states we haven‟t seen • Combine with supervised learning: 151 POMDPs • Set S = {1,…,n} states • Set A = {1, … m} actions • Set O = {1,…, l} observations • Reward model R(s,a) • Transition model P(s’|s,a) • Observation model P(o|s) • Initial model P0(s) Probably Approximately Correct • Consider a supervised learning problem with attributes X1,…,Xm and class Y. • A hypothesis space H is PAC-learnable if – there exists an algorithm such that • for every deterministic domain with true model fH, • for every distribution P over examples, • for every , > 0, – with probability > 1 - the algorithm returns a hypothesis hH with errorP(h) < – in time polynomial in 1/ , 1/ and m. “For every distribution P over examples” • How can we expect an algorithm to work for every distribution? • Couldn‟t a distribution make most of your training sets unlucky? • Because the same distribution is used for training and future instances, the training set will be reflective of future instances with high probability “Must learn in polynomial time” This requires 1. Polynomial number of training examples 2. Polynomial run time given a training set • Neither implies the other • But usually number of training examples required is the limiting factor • How many examples are required to learn a hypothesis space? Some Results • |H|=3m for conjunctive formula • Sample complexity bound • PAC-learnable (fast, consistent learner available) • |H|=22 m for DNF • Sample complexity exponential in m • not known to be PAC-learnable ) 1 ln3ln( 1 mn Infinite Hypothesis Spaces • E.g., neural networks, perceptrons etc. • Use VC dimension. Many infinite hypothesis spaces have finite VC dimension Upper bound: Lower bound: Both bounds linear in VC(H)! VC dimension • The VC dimension of a hypothesis space H is the size of the largest set shattered by H Can H Shatter Some Set of Three Points? No + - + Therefore VC(H) = 2 (and if co-located doesn‟t help!) Extended Applications Internet-Scale Technology • Mahout – Apache Hadoop (open source map-reduce) • many ML take a single pass; others e.g. EM can be done via a chain of map-reduce passes • very parallelizable – scalable ML libraries (e.g., kNN) – tell you what they’re doing • Google PredictionAPI/BigQuery – no idea what algorithms – no idea how they pick which one to use, or what the “accuracy” means – billing model unclear for future Key idea: Train to Agree • Simultaneous training of alignment in both directions µ • P(x,z; µ1) = p(e)p(z,f | e; µ1) • P(x,z; µ2) = p(f)p(z,e | f; µ2) • x = (e,f); z = alignment • max L(x; µ1) + L(x; µ2) + likelihood alignments agree on x Liang, Taskar and Klein (2006) Watson Q&A • 20 researchers + SEs • 4+ years • Scalable to 3sec response via Apache Hadoop and 2500+ processors • Need good precision and recall • Ensemble method important • http://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf • http://www.youtube.com/watch?v=lI-M7O_bRNg Ferrucci et al.‟10 MySong • HMMs to generate chords to harmonize music • http://research.microsoft.com/en-us/um/people/dan/mysong/ Simon et al. „08 Learning Structure of Hierarchies • Fully Bayesian approach to learning structure of graphical models • Potentially unbounded number of hidden variables (Adams et al‟10) • Applied to structure identification in images http://www.cs.toronto.edu/~rpa/papers/adams-wallach-ghahramani-2010a.pdf
BRERS-PELEE
He
ERR
EGE
TER
4 Adams et al.'10
er On
ERRHREUEEH BES
Ee
Efficient training for BMs • General Boltzmann Machines • Use (mean-field) variational approx for estimating data-dependent statistics • Use MCMC for unconditional statistics • Application to object recognition (7.1% error on NORB; LeCun‟04) – 25 test/25 training, 3D images with stereoscopic views. 5 classes (car, truck, plane, animal, human). – 96x96 gray scale… 3 layer 4000 node DBN w/ 68 million parameters (Salakhutdinov et al.‟11)