Introduction to Bayesian Learning - Lecture Notes | CS 5350, Study notes of Computer Science

Material Type: Notes; Professor: Daume; Class: Machine Learning; Subject: Computer Science; University: University of Utah; Term: Spring 2007;

Typology: Study notes

Pre 2010

Uploaded on 08/30/2009

koofers-user-a4v-1
koofers-user-a4v-1 🇺🇸

3

(1)

9 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Machine Learning (CS 5350/CS 6350) 27 Mar 2007
Introduction to Bayesian learning
What is Bayesian learning?
1. A formal model of uncertainty
2. A method for expressing prior beliefs
3. A methodology for making inference about data
4. A paradigm for decision making
The central difference on the learning side between Bayesian and non-Bayesian learning (aka frequentist
learning or learning theory) is the Bayesian treat parameters as true unknowns—i.e., as random variables.
Let’s take an example from statistics.
Let’s say we have a coin that may be biased. It has probability π[0,1] of coming up heads. Suppose we
flip it once and it comes up tails. How do we infer π?
Frequentist answer: π= 0, because this is the maximum likelihood solution.
Sort-of frequentist answer: π=1
3because I’ll “smooth” and compute π= (# heads + 1)/(total flips + 2).
These are derived because we assume that we want to find πwhich maximizes the likelihood of the data,
p(D|π). Several people have complained that conditioning on πis weird and it is! Only random variables
should be conditioned on, and in frequentist land, a parameter is definitely not a random variable.
Let’s say that we know π {0,0.25,0.5,0.75,1}. Still, the ML solution would give 0.
The Bayesian solution is quite different. We don’t actually try to “find” a single value of π, but rather
compute a distribution over possible π. This comes from a simple application of Bayes rule:
p(π|D) = p(π)p(D|π)
p(D)=p(π)p(D|π)
Pπ0p(π0)p(D|π0)
Here, p(π) is called the prior,p(D|π) is the likelihood and p(D) is the marginal (or evidence or partition
function).
In our coin flipping example, our likelihood is just πh(1 π)t, where hand tare the counts of heads and
tails.
If we think about the frequentist perspective, what happens is that they effectively put a uniform prior over
πand “approximate” the posterior by a point distribution centered at the maximum. (We will soon see how
to justify smoothing in a similar manner.)
But this entails two weird approximations: maybe we don’t want a uniform prior and maybe we don’t want
to make this approximation.
Let’s say that a priori, we believe the five values of πhave probability 0.1,0.2,0.4,0.2,0.1, respectively. This
basically means that we expect the coin is likely to not be severely biased. This is a valid prior because it
sums to one over the range of π.
Now, let’s revisit the case where we flip once and it comes up tails. This gives us the following unnormalized
posterior:
1
pf3
pf4
pf5

Partial preview of the text

Download Introduction to Bayesian Learning - Lecture Notes | CS 5350 and more Study notes Computer Science in PDF only on Docsity!

Machine Learning (CS 5350/CS 6350) 27 Mar 2007

Introduction to Bayesian learning

What is Bayesian learning?

  1. A formal model of uncertainty
  2. A method for expressing prior beliefs
  3. A methodology for making inference about data
  4. A paradigm for decision making

The central difference on the learning side between Bayesian and non-Bayesian learning (aka frequentist learning or learning theory) is the Bayesian treat parameters as true unknowns—i.e., as random variables.

Let’s take an example from statistics.

Let’s say we have a coin that may be biased. It has probability π ∈ [0, 1] of coming up heads. Suppose we flip it once and it comes up tails. How do we infer π?

Frequentist answer: π = 0, because this is the maximum likelihood solution.

Sort-of frequentist answer: π = 13 because I’ll “smooth” and compute π = (# heads + 1)/(total flips + 2).

These are derived because we assume that we want to find π which maximizes the likelihood of the data, p(D | π). Several people have complained that conditioning on π is weird and it is! Only random variables should be conditioned on, and in frequentist land, a parameter is definitely not a random variable.

Let’s say that we know π ∈ { 0 , 0. 25 , 0. 5 , 0. 75 , 1 }. Still, the ML solution would give 0.

The Bayesian solution is quite different. We don’t actually try to “find” a single value of π, but rather compute a distribution over possible π. This comes from a simple application of Bayes rule:

p(π | D) =

p(π)p(D | π) p(D)

p(π)p(D | π) ∑ π′^ p(π ′)p(D | π′)

Here, p(π) is called the prior, p(D | π) is the likelihood and p(D) is the marginal (or evidence or partition function).

In our coin flipping example, our likelihood is just πh(1 − π)t, where h and t are the counts of heads and tails.

If we think about the frequentist perspective, what happens is that they effectively put a uniform prior over π and “approximate” the posterior by a point distribution centered at the maximum. (We will soon see how to justify smoothing in a similar manner.)

But this entails two weird approximations: maybe we don’t want a uniform prior and maybe we don’t want to make this approximation.

Let’s say that a priori, we believe the five values of π have probability 0. 1 , 0. 2 , 0. 4 , 0. 2 , 0 .1, respectively. This basically means that we expect the coin is likely to not be severely biased. This is a valid prior because it sums to one over the range of π.

Now, let’s revisit the case where we flip once and it comes up tails. This gives us the following unnormalized posterior:

p(π = 0 | D) ∝ 0. 1 × 00 × 11 = 0. 1 p(π = 0. 25 | D) ∝ 0. 2 × 0. 250 × 0. 751 = 0. 15 p(π = 0. 5 | D) ∝ 0. 4 × 0. 50 × 0. 51 = 0. 2 p(π = 0. 75 | D) ∝ 0. 2 × 0. 750 × 0. 251 = 0. 05 p(π = 1 | D) ∝ 0. 1 × 10 × 01 = 0

After normalizing, we get: 0. 2 , 0. 3 , 0. 4 , 0. 1 , 0 as the posterior. Note that the posterior at π = 0.5 hasn’t changed, but the probability of π > 0 .5 has significantly decreased (and we know that π = 1 is impossible).

Suppose we flip again and get another tails. This gives:

p(π = 0 | D) ∝ 0. 2 × 00 × 11 = 0. 2 p(π = 0. 25 | D) ∝ 0. 3 × 0. 250 × 0. 751 = 0. 225 p(π = 0. 5 | D) ∝ 0. 4 × 0. 50 × 0. 51 = 0. 2 p(π = 0. 75 | D) ∝ 0. 1 × 0. 750 × 0. 251 = 0. 025 p(π = 1 | D) ∝ 0. 0 × 10 × 01 = 0

Here we have used a technique known as posterior updating. We take the posterior from the first example and treat it as the prior for the second example. After normalizing, we get approximately: 0. 31 , 0. 35 , 0. 31 , 0. 03 , 0. Now, we are more sure that π should be 0.25, but only by a little. We can repeat this process indefinitely. If we observe an infinite number of flips, we will converge to the true value (this is known as consistency).

Now, let’s say that we don’t want to confine π to one of five values but want to allow it to range continuously. That is, we need a probability distribution p with domain [0, 1]. One could cook up many such distributions (with a bit of integration to ensure proper normalization). However, there is a standard such distribution known as the beta distribution:

Bet(π | a, b) =

Γ(a + b) Γ(a)Γ(b)

πa−^1 (1 − π)b−^1

Here, a and b are parameters of the prior, or hyperparameters of the model. Ignore the fraction term for a second (it serves to normalize the beta). Nine beta distributions are shown below with a ∈ { 0 , 1 , 5 } in the columns and b ∈ { 0 , 1 , 5 } in the rows:

This shown results after eleven flips with a uniform (a = b = 1) Beta prior. The flips are: THHTTTTHTTT (the stars indicate whether it was tails or heads). You can see that over time, the distribution tends toward π < 0 .5 and becomes more and more peaked.

It’s easy to verify that the value of π that maximizes Bet(π | a, b) is exactly a/(a+b). This (somewhat) justifies smoothing: to obtain “add one” smoothing, we pretend that we start with a beta prior with a = b = 1. To get “add alpha” smoothing, we set a = b = λ. Then we do “maximum likelihood.” Technically, this is called maximum a posteriori or MAP, since we’re choosing a value that maximizes the posterior, rather than one that maximizes the likelihood.

Now, back to the Γ(·) function. This thing appears all the time in normalization terms, and is defined by:

Γ(z) =

0

dttz−^1 exp[−t]

This integral has no closed form solution. However, it can be computed by standard techniques, available in matlab and many other languages. It has the nice property that it extends the factorial function to the real line: if n is a positive integer, then Γ(n) = (n − 1)!. Given this, we often compute log Γ(·), since it grows too quickly. The functions in matlab are gamma and gammaln.

There are two other distributions you’ll need to known about (other than the standard Normal, Multinomial, Binomial, etc.). These are the gamma and the Dirichlet.

We’ll do Dirichlet first, since it’s basically an extension of the beta. Remember that a multinomial is just like a more complicated binomial. Instead of having a coin with a single parameter π, we have a die with a parameter vector θ 1 ,... , θK , all positive and sum to one. We would like a prior on this. The Dirichlet is a multivariate version of the beta. It is parameterized by a vector α 1 ,... , αK , all positive but need not sum to one:

Dir(θ | α) =

∏^ k^ αk) k Γ(αk)

k

θα kk^ −^1

In the two parameter case, this is exactly a beta distribution. We also get the same posterior updating. If the prior was Dir(θ | α), then after observing x 1 rolls of a 1, x 2 rolls of a 2 and so on, the posterior

hyperparameters becomes (α 1 + x 1 ,... , αK + xK ) = α + x. Again, we can think of smoothing as MAP inference with a Dirichlet prior.

Finally, we need a gamma distribution. This is a distribution over positive reals. This will be useful as a prior for the inverse variance of a normal distribution (i.e., p(1/σ^2 )), but for now, just think of it as some distribution over (0, ∞):

Gam(λ | a, b) =

ba Γ(a) λ−a−^1 exp[−λ/b]

Note that several definitions of Gam actually exist – sometimes people use −a instead of a, which puts the ba^ in the denominator and replaces λ−a−^1 with λa−^1. Examples of gamma priors are shown below:

(^00 5 )

1

(^00 5 )

(^00 5 )

(^00 5 )

1

2

(^00 5 )

1

(^00 5 )

(^00 5 )

1

2

3

4

(^00 5 )

1

(^00 5 )

1

Now, suppose we have a posterior p(θ | D) (here, θ is just an arbitrary symbol for “parameters”). What do we do with it? Well, at the end of the day, sometimes the posterior is of interest in its own right. Often, however, we probably want to make predictions. That is, we may want to predict how many of 100 coin flips will land tails. In general, if there’s a quantity f (θ) that we want to predict, we want to compute:

Eθ∼p(· | D)

[

f (θ)

]

Θ

dθp(θ | D)f (θ)

p(D)

Θ

dθp(θ)p(D | θ)f (θ)

(If Θ is a discrete space, replace the integrals by sums.)

For instance, take the coin flipping example. Suppose we have a Bet(1, 1) prior, then observe 9 heads and 19 tails. This gives us a Bet(10, 20) posterior. Let’s say for simplicity that we want to know the probability that the next two flips will come up tails. In this case (writing π for θ), we have f (π) = (1 − π)^2. Thus, we want to compute: