Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Machine Learning: Unsupervised Learning, Statistical Perspective, and Probability Models, Lecture notes of Probability and Statistics

A set of lecture notes from Tony Jebara's course on Machine Learning at Columbia University. The notes cover topics such as unsupervised learning, statistical perspective, probability models, and linear classification. The notes also include examples and explanations of concepts such as maximum likelihood, conditioning, marginalizing, Bayes rule, expectations, dependence/independence, and least squares. The document could be useful as study notes or lecture notes for a university student studying machine learning.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

shekhar_hin
shekhar_hin 🇺🇸

4.9

(8)

226 documents

1 / 21

Toggle sidebar

Related documents


Partial preview of the text

Download Machine Learning: Unsupervised Learning, Statistical Perspective, and Probability Models and more Lecture notes Probability and Statistics in PDF only on Docsity! Tony Jebara, Columbia University Machine Learning 4771 Instructor: Tony Jebara Tony Jebara, Columbia University Topic 7 • Unsupervised Learning • Statistical Perspective • Probability Models • Discrete & Continuous: Gaussian, Bernoulli, Multinomial • Maximum Likelihood  Logistic Regression • Conditioning, Marginalizing, Bayes Rule, Expectations • Classification, Regression, Detection • Dependence/Independence • Maximum Likelihood  Naïve Bayes Tony Jebara, Columbia University • Example of Projectile Cannon (45 degree problem) x = input target distance y = output cannon angle • What does least squares do? • Conditional statistical models address this problem… Statistical Perspective x = v 0( )2 g sin 2y( ) + noise input distance ou tp ut a ng le Tony Jebara, Columbia University Probability Models • Instead of deterministic functions, output is a probability • Previously: our output was a scalar • Now: our output is a probability e.g. a probability bump: •  subsumes or is a superset of • Why is this representation for our answer more general? p y( ) ŷ = f x( ) = θTx +b p y( ) p y( ) Tony Jebara, Columbia University Probability Models • Instead of deterministic functions, output is a probability • Previously: our output was a scalar • Now: our output is a probability e.g. a probability bump: •  subsumes or is a superset of • Why is this representation for our answer more general?  A deterministic answer with complete confidence is like putting a probability where all the mass is at ! p y( ) ŷ = f x( ) = θTx +b p y( ) p y( ) p y( ) ŷ ⇔ p y( ) = δ y − ŷ( ) p y( ) Tony Jebara, Columbia University Probability Models • Can extend probability model to 2 bumps: • Each mean can be a linear regression fn. • Therefore the (conditional) log-likelihood to maximize is: • Maximize l(θ) using gradient ascent • Nicely handles the “cannon firing” data p y |Θ( ) = 1 2 N y | µ 1( ) + 1 2 N y | µ 2( ) p y | x,Θ( ) = 1 2 N y | f 1 x( )( ) + 1 2 N y | f 2 x( )( ) = 1 2 N y | θ 1 Tx +b 1( ) + 1 2 N y | θ 2 Tx +b 2( ) l Θ( ) = log 1 2 N y i | θ 1 Tx i +b 1( ) + 1 2 N y i | θ 2 Tx i +b 2( )( )i=1 N∑ Tony Jebara, Columbia University Probability Models • Now classification: can also go beyond deterministic! • Previously: wanted output to be binary • Now: our output is a probability e.g. a probability table: • This subsumes or is a superset again… • Consider probability over binary events (coin flips!): e.g. Bernoulli distribution (i.e 1x2 probability table) with parameter α • Linear classification can be done by setting α equal to f(x): p y( ) y=0 y=1 0.73 0.27 p y | α( ) = αy 1−α( )1−y α ∈ 0,1⎡⎣⎢ ⎤ ⎦⎥ p y | x( ) = f x( )y 1− f x( )( )1−y f x( )∈ 0,1⎡⎣⎢ ⎤ ⎦⎥ α Tony Jebara, Columbia University Probability Models • Now linear classification is: • Log-likelihood is (negative of cost function): • But, need a squashing function since f(x) in [0,1] • Use sigmoid or logistic again… • Called logistic regression  new loss function • Do gradient descent, similar to logistic output neural net! • Can also handle multi-layer f(x) and do backprop again! p y | x( ) = f x( )y 1− f x( )( )1−y f x( )≡ α ∈ 0,1⎡⎣⎢ ⎤ ⎦⎥ log p y i | x i( )i=1 N∑ = log i=1 N∑ f x i( )yi 1− f x i( )( )1−yi = y i log i=1 N∑ f x i( ) + 1−y i( ) log 1− f x i( )( ) = log f x i( )i∈class1∑ + log 1− f x i( )( )i∈class0∑ f x( ) = sigmoid θTx +b( )∈ [0,1] Tony Jebara, Columbia University Properties of PDFs • Marginalizing: integrate/sum out a variable leaves a marginal distribution over the remaining ones… • Conditioning: if a variable ‘y’ is ‘given’ we get a conditional distribution over the remaining ones… • Bayes Rule: mathematically just redo conditioning but has a deeper meaning (1764)… if we have X being data and θ being a model p x,y( )y∑ = p x( ) p x | y( ) = p x,y( ) p y( ) p θ | X( ) = p X | θ( )p θ( ) p X( ) posterior likelihood evidence prior Tony Jebara, Columbia University Properties of PDFs • Expectation: can use pdf p(x) to compute averages and expected values for quantities, denoted by: • Properties: • Mean: expected value for x • Variance: expected value of (x-mean)2, how much x varies E p x( ) x{ } = p x( )x −∞ ∞ ∫ dx Fine=0$ Fine=20$ 0.8 0.2 example: speeding ticket expected cost of speeding? f(x=0)=0, f(x=1)=20 p(x=0)=0.8, p(x=1)=0.2 E p x( ) f x( ){ } = p x( ) f x( ) x∫ dx or = p x( ) f x( )x∑ Var x{ } = E x −E x{ }( )2⎧ ⎨ ⎪⎪ ⎩⎪⎪ ⎫ ⎬ ⎪⎪ ⎭⎪⎪ = E x 2 − 2xE x{ } + E x{ }2{ } = E x 2{ }− 2E x{ }E x{ } + E x{ }2 = E x 2{ }−E x{ }2 E cf x( ){ } = cE f x( ){ } E f x( ) +c{ } = E f x( ){ } +c E E f x( ){ }{ } = E f x( ){ } • Covariance: how strongly x and y vary together • Conditional Expectation: • Sample Expectation: If we don’t have pdf p(x,y) can approximate expectations using samples of data • Sample Mean: • Sample Var: • Sample Cov: Tony Jebara, Columbia University Properties of PDFs E p x( ) f x( ){ } 1 N f x i( )i=1 N∑ Cov x,y{ } = E x −E x{ }( ) y −E y{ }( ){ } = E xy{ }−E x{ }E y{ } E x{ } x = 1 N x ii=1 N∑ E x −E x( )( )2⎧ ⎨ ⎪⎪ ⎩⎪⎪ ⎫ ⎬ ⎪⎪ ⎭⎪⎪  1 N x i −x( )2 i=1 N∑ E x −E x( )( ) y −E y( )( ){ } 1 N x i −x( ) y i −y( )i=1 N∑ E y | x{ } = p y | x( )y y∫ dy E E y | x{ }{ } = p x( ) p y | x( )y y∫ dydx x∫ = E y{ } Tony Jebara, Columbia University The IID Assumption • Bayes rule says likelihood is probability of data given model • The likelihood of under IID assumptions is: • Learn joint distribution by maximum likelihood: • Learn conditional by max conditional likelihood: p X |Θ( ) = p x 1 ,…,x N |Θ( ) = p i x i |Θ( )i=1 N∏ = p x i |Θ( )i=1 N∏ X = x 1 ,…,x N{ } Θ* = arg max Θ p x i |Θ( )i=1 N∏ = arg max Θ log p x i |Θ( )i=1 N∑ p θ | X( ) = p X | θ( )p θ( ) p X( ) posterior likelihood evidence prior p x |Θ( ) p y | x,Θ( ) Θ* = arg max Θ p y i | x i ,Θ( )i=1 N∏ = arg max Θ log p y i | x i ,Θ( )i=1 N∑ Tony Jebara, Columbia University Uses of PDFs • Classification: have p(x,y) and given x. Asked for discrete y output, give most probable one • Regression: have p(x,y) and given x. Asked for a scalar y output, give most probable or expected one • Anomaly Detection: if have p(x,y) and given both x,y. Asked if it is similar  threshold p x,y( )→ p y | x( )→ ŷ = arg max m p y = m | x( ) x i ,y i( ){ } → p x,y( ) → p y | x( ) ŷ = arg max y p y | x( ) E p y|x( ) y{ } ⎧ ⎨ ⎪⎪⎪ ⎩ ⎪⎪⎪⎪ p x,y( )≥ threshold → normal,anomaly{ }