Introduction to Bayesian Learning

Machine Learning (CS 5350/CS 6350) 27 Mar 2007

Introduction to Bayesian learning

What is Bayesian learning?

1. A formal model of uncertainty

2. A method for expressing prior beliefs

3. A methodology for making inference about data

4. A paradigm for decision making

The central difference on the learning side between Bayesian and non-Bayesian learning (aka frequentist

learning or learning theory) is the Bayesian treat parameters as true unknowns—i.e., as random variables.

Let’s take an example from statistics.

Let’s say we have a coin that may be biased. It has probability π∈[0,1] of coming up heads. Suppose we

flip it once and it comes up tails. How do we infer π?

Frequentist answer: π= 0, because this is the maximum likelihood solution.

Sort-of frequentist answer: π=1

3because I’ll “smooth” and compute π= (# heads + 1)/(total flips + 2).

These are derived because we assume that we want to find πwhich maximizes the likelihood of the data,

p(D|π). Several people have complained that conditioning on πis weird and it is! Only random variables

should be conditioned on, and in frequentist land, a parameter is definitely not a random variable.

Let’s say that we know π∈ {0,0.25,0.5,0.75,1}. Still, the ML solution would give 0.

The Bayesian solution is quite different. We don’t actually try to “find” a single value of π, but rather

compute a distribution over possible π. This comes from a simple application of Bayes rule:

p(π|D) = p(π)p(D|π)

p(D)=p(π)p(D|π)

Pπ0p(π0)p(D|π0)

Here, p(π) is called the prior,p(D|π) is the likelihood and p(D) is the marginal (or evidence or partition

function).

In our coin flipping example, our likelihood is just πh(1 −π)t, where hand tare the counts of heads and

tails.

If we think about the frequentist perspective, what happens is that they effectively put a uniform prior over

πand “approximate” the posterior by a point distribution centered at the maximum. (We will soon see how

to justify smoothing in a similar manner.)

But this entails two weird approximations: maybe we don’t want a uniform prior and maybe we don’t want

to make this approximation.

Let’s say that a priori, we believe the five values of πhave probability 0.1,0.2,0.4,0.2,0.1, respectively. This

basically means that we expect the coin is likely to not be severely biased. This is a valid prior because it

sums to one over the range of π.

Now, let’s revisit the case where we flip once and it comes up tails. This gives us the following unnormalized

posterior:

Introduction to Bayesian Learning - Lecture Notes | CS 5350, Study notes of Computer Science