Download Machine Learning: Unsupervised Learning, Statistical Perspective, and Probability Models and more Lecture notes Probability and Statistics in PDF only on Docsity! Tony Jebara, Columbia University Machine Learning 4771 Instructor: Tony Jebara Tony Jebara, Columbia University Topic 7 • Unsupervised Learning • Statistical Perspective • Probability Models • Discrete & Continuous: Gaussian, Bernoulli, Multinomial • Maximum Likelihood Logistic Regression • Conditioning, Marginalizing, Bayes Rule, Expectations • Classification, Regression, Detection • Dependence/Independence • Maximum Likelihood Naïve Bayes Tony Jebara, Columbia University • Example of Projectile Cannon (45 degree problem) x = input target distance y = output cannon angle • What does least squares do? • Conditional statistical models address this problem… Statistical Perspective x = v 0( )2 g sin 2y( ) + noise input distance ou tp ut a ng le Tony Jebara, Columbia University Probability Models • Instead of deterministic functions, output is a probability • Previously: our output was a scalar • Now: our output is a probability e.g. a probability bump: • subsumes or is a superset of • Why is this representation for our answer more general? p y( ) ŷ = f x( ) = θTx +b p y( ) p y( ) Tony Jebara, Columbia University Probability Models • Instead of deterministic functions, output is a probability • Previously: our output was a scalar • Now: our output is a probability e.g. a probability bump: • subsumes or is a superset of • Why is this representation for our answer more general? A deterministic answer with complete confidence is like putting a probability where all the mass is at ! p y( ) ŷ = f x( ) = θTx +b p y( ) p y( ) p y( ) ŷ ⇔ p y( ) = δ y − ŷ( ) p y( ) Tony Jebara, Columbia University Probability Models • Can extend probability model to 2 bumps: • Each mean can be a linear regression fn. • Therefore the (conditional) log-likelihood to maximize is: • Maximize l(θ) using gradient ascent • Nicely handles the “cannon firing” data p y |Θ( ) = 1 2 N y | µ 1( ) + 1 2 N y | µ 2( ) p y | x,Θ( ) = 1 2 N y | f 1 x( )( ) + 1 2 N y | f 2 x( )( ) = 1 2 N y | θ 1 Tx +b 1( ) + 1 2 N y | θ 2 Tx +b 2( ) l Θ( ) = log 1 2 N y i | θ 1 Tx i +b 1( ) + 1 2 N y i | θ 2 Tx i +b 2( )( )i=1 N∑ Tony Jebara, Columbia University Probability Models • Now classification: can also go beyond deterministic! • Previously: wanted output to be binary • Now: our output is a probability e.g. a probability table: • This subsumes or is a superset again… • Consider probability over binary events (coin flips!): e.g. Bernoulli distribution (i.e 1x2 probability table) with parameter α • Linear classification can be done by setting α equal to f(x): p y( ) y=0 y=1 0.73 0.27 p y | α( ) = αy 1−α( )1−y α ∈ 0,1⎡⎣⎢ ⎤ ⎦⎥ p y | x( ) = f x( )y 1− f x( )( )1−y f x( )∈ 0,1⎡⎣⎢ ⎤ ⎦⎥ α Tony Jebara, Columbia University Probability Models • Now linear classification is: • Log-likelihood is (negative of cost function): • But, need a squashing function since f(x) in [0,1] • Use sigmoid or logistic again… • Called logistic regression new loss function • Do gradient descent, similar to logistic output neural net! • Can also handle multi-layer f(x) and do backprop again! p y | x( ) = f x( )y 1− f x( )( )1−y f x( )≡ α ∈ 0,1⎡⎣⎢ ⎤ ⎦⎥ log p y i | x i( )i=1 N∑ = log i=1 N∑ f x i( )yi 1− f x i( )( )1−yi = y i log i=1 N∑ f x i( ) + 1−y i( ) log 1− f x i( )( ) = log f x i( )i∈class1∑ + log 1− f x i( )( )i∈class0∑ f x( ) = sigmoid θTx +b( )∈ [0,1] Tony Jebara, Columbia University Properties of PDFs • Marginalizing: integrate/sum out a variable leaves a marginal distribution over the remaining ones… • Conditioning: if a variable ‘y’ is ‘given’ we get a conditional distribution over the remaining ones… • Bayes Rule: mathematically just redo conditioning but has a deeper meaning (1764)… if we have X being data and θ being a model p x,y( )y∑ = p x( ) p x | y( ) = p x,y( ) p y( ) p θ | X( ) = p X | θ( )p θ( ) p X( ) posterior likelihood evidence prior Tony Jebara, Columbia University Properties of PDFs • Expectation: can use pdf p(x) to compute averages and expected values for quantities, denoted by: • Properties: • Mean: expected value for x • Variance: expected value of (x-mean)2, how much x varies E p x( ) x{ } = p x( )x −∞ ∞ ∫ dx Fine=0$ Fine=20$ 0.8 0.2 example: speeding ticket expected cost of speeding? f(x=0)=0, f(x=1)=20 p(x=0)=0.8, p(x=1)=0.2 E p x( ) f x( ){ } = p x( ) f x( ) x∫ dx or = p x( ) f x( )x∑ Var x{ } = E x −E x{ }( )2⎧ ⎨ ⎪⎪ ⎩⎪⎪ ⎫ ⎬ ⎪⎪ ⎭⎪⎪ = E x 2 − 2xE x{ } + E x{ }2{ } = E x 2{ }− 2E x{ }E x{ } + E x{ }2 = E x 2{ }−E x{ }2 E cf x( ){ } = cE f x( ){ } E f x( ) +c{ } = E f x( ){ } +c E E f x( ){ }{ } = E f x( ){ } • Covariance: how strongly x and y vary together • Conditional Expectation: • Sample Expectation: If we don’t have pdf p(x,y) can approximate expectations using samples of data • Sample Mean: • Sample Var: • Sample Cov: Tony Jebara, Columbia University Properties of PDFs E p x( ) f x( ){ } 1 N f x i( )i=1 N∑ Cov x,y{ } = E x −E x{ }( ) y −E y{ }( ){ } = E xy{ }−E x{ }E y{ } E x{ } x = 1 N x ii=1 N∑ E x −E x( )( )2⎧ ⎨ ⎪⎪ ⎩⎪⎪ ⎫ ⎬ ⎪⎪ ⎭⎪⎪ 1 N x i −x( )2 i=1 N∑ E x −E x( )( ) y −E y( )( ){ } 1 N x i −x( ) y i −y( )i=1 N∑ E y | x{ } = p y | x( )y y∫ dy E E y | x{ }{ } = p x( ) p y | x( )y y∫ dydx x∫ = E y{ } Tony Jebara, Columbia University The IID Assumption • Bayes rule says likelihood is probability of data given model • The likelihood of under IID assumptions is: • Learn joint distribution by maximum likelihood: • Learn conditional by max conditional likelihood: p X |Θ( ) = p x 1 ,…,x N |Θ( ) = p i x i |Θ( )i=1 N∏ = p x i |Θ( )i=1 N∏ X = x 1 ,…,x N{ } Θ* = arg max Θ p x i |Θ( )i=1 N∏ = arg max Θ log p x i |Θ( )i=1 N∑ p θ | X( ) = p X | θ( )p θ( ) p X( ) posterior likelihood evidence prior p x |Θ( ) p y | x,Θ( ) Θ* = arg max Θ p y i | x i ,Θ( )i=1 N∏ = arg max Θ log p y i | x i ,Θ( )i=1 N∑ Tony Jebara, Columbia University Uses of PDFs • Classification: have p(x,y) and given x. Asked for discrete y output, give most probable one • Regression: have p(x,y) and given x. Asked for a scalar y output, give most probable or expected one • Anomaly Detection: if have p(x,y) and given both x,y. Asked if it is similar threshold p x,y( )→ p y | x( )→ ŷ = arg max m p y = m | x( ) x i ,y i( ){ } → p x,y( ) → p y | x( ) ŷ = arg max y p y | x( ) E p y|x( ) y{ } ⎧ ⎨ ⎪⎪⎪ ⎩ ⎪⎪⎪⎪ p x,y( )≥ threshold → normal,anomaly{ }