Two Fundamental Probabilistic Models, Lecture notes of Probability and Statistics

Learning probabilistic models explain Two Fundamental Probabilistic Models

Typology: Lecture notes

2020/2021

Uploaded on 06/21/2021

tiuw
tiuw 🇺🇸

4.7

(18)

286 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 15: Learning probabilistic models
Roger Grosse and Nitish Srivastava
1 Overview
In the first half of the course, we introduced backpropagation, a technique we used to
train neural nets to minimize a variety of cost functions. One of the cost functions we
discussed was cross-entropy, which encourages the network to learn to predict a probability
distribution over the targets. This was our first glimpse into probabilistic modeling. But
probabilistic modeling is so important that we’re going to spend almost the whole second
half of the course on it. This lecture introduces some of the key principles.
Actually, there’s aren’t any major new ideas in this lecture. You’ve already seen maxi-
mum likelihood estimation in the context of neural probabilistic language models (Coursera
Lecture D, in-class Lecture 7, and Assignment 1). Coursera Lecture J introduced the full
Bayesian approach and the maximum a-posteriori (MAP) approximation. All we’re doing
here is stating these principles in slightly more general terms, and working through lots of
examples in order to gain a better intuition. Once you’ve gotten more practice with these
techniques, it’s a good idea to go back and revisit those lectures.
This lecture and the next one aren’t about neural nets. Instead, they’ll introduce the
principles of probabilistic modeling in as simple a setting as possible. Then, starting next
week, we’re going to apply these principles in the context of neural nets, and this will result
in some very powerful models.
1.1 Learning goals
Know some terminology for probabilistic models: likelihood, prior distribution, poste-
rior distribution, posterior predictive distribution, i.i.d. assumption, sufficient statis-
tics, conjugate prior
Be able to learn the parameters of a probabilistic model using maximum likelihood,
the full Bayesian method, and the maximum a-posteriori approximation.
Understand how these methods are related to each other. Understand why they tend
to agree in the large data regime, but can often make very different predictions in
the small data regime.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Two Fundamental Probabilistic Models and more Lecture notes Probability and Statistics in PDF only on Docsity!

Lecture 15: Learning probabilistic models

Roger Grosse and Nitish Srivastava

1 Overview

In the first half of the course, we introduced backpropagation, a technique we used to train neural nets to minimize a variety of cost functions. One of the cost functions we discussed was cross-entropy, which encourages the network to learn to predict a probability distribution over the targets. This was our first glimpse into probabilistic modeling. But probabilistic modeling is so important that we’re going to spend almost the whole second half of the course on it. This lecture introduces some of the key principles. Actually, there’s aren’t any major new ideas in this lecture. You’ve already seen maxi- mum likelihood estimation in the context of neural probabilistic language models (Coursera Lecture D, in-class Lecture 7, and Assignment 1). Coursera Lecture J introduced the full Bayesian approach and the maximum a-posteriori (MAP) approximation. All we’re doing here is stating these principles in slightly more general terms, and working through lots of examples in order to gain a better intuition. Once you’ve gotten more practice with these techniques, it’s a good idea to go back and revisit those lectures. This lecture and the next one aren’t about neural nets. Instead, they’ll introduce the principles of probabilistic modeling in as simple a setting as possible. Then, starting next week, we’re going to apply these principles in the context of neural nets, and this will result in some very powerful models.

1.1 Learning goals

  • Know some terminology for probabilistic models: likelihood, prior distribution, poste- rior distribution, posterior predictive distribution, i.i.d. assumption, sufficient statis- tics, conjugate prior
  • Be able to learn the parameters of a probabilistic model using maximum likelihood, the full Bayesian method, and the maximum a-posteriori approximation.
  • Understand how these methods are related to each other. Understand why they tend to agree in the large data regime, but can often make very different predictions in the small data regime.

2 Maximum likelihood

The first method we’ll cover for fitting probabilistic models is maximum likelihood. In addition to being a useful method in its own right, it will also be a stepping stone towards Bayesian modeling. Actually, you’ve already done maximum likelihood learning in the context of the language model from Assignment 1. All we’re doing now is presenting the more general framework. Let’s begin with a simple example: we have flipped a particular coin 100 times, and it landed heads NH = 55 times and tails NT = 45 times. We want to know the probability that it will come up heads if we flip it again. We formulate the probabilistic model:

The behavior of the coin is summarized with a parameter θ, the probability that a flip lands heads (H). The flips D =

x(1),... , x(100)

are independent Bernoulli random variables with parameter θ.

(In general, we will use D as a shorthand for all the observed data.) We say that the indi- vidual flips are independent and identically distributed (i.i.d.); they are independent because one outcome does not influence any of the other outcomes, and they are identically distributed because they all follow the same distribution (i.e. a Bernoulli distribution with parameter θ). We now define the likelihood function L(θ), which is the probability of the observed data, as a function of θ. In the coin example, the likelihood is the probability of the particular sequence of H’s and T’s being generated:

L(θ) = p(D) = θNH^ (1 − θ)NT^.

Note that L is a function of the model parameters (in this case, θ), not the observed data. This likelihood function will generally take on extremely small values; for instance, L(0.5) = 0. 5100 ≈ 7. 9 × 10 −^31. Therefore, in practice we almost always work with the log-likelihood function,

`(θ) = log L(θ) = NH log θ + NT log(1 − θ).

For our coin example, (0.5) = log 0. 5100 = 100 log 0.5 = − 69 .31. This is a much easier value to work with. In general, we would expect good choices of θ to assign high likelihood to the observed data. This suggests the maximum likelihood criterion: choose the parameter θ which maximizes(θ). If we’re lucky, we can do this analytically by computing the derivative and setting it to zero. (More precisely, we find critical points by setting the derivative to zero. We check which of the critical points, or boundary points, has the largest value.) Let’s try

Therefore,

∑N

i=1 x (i) (^) − μ = 0, and solving for μ, we get μ = 1 N

∑N

i=1 x (i). The maximum likelihood estimate of the mean of a normal distribution is simply the mean of the observed values, or the empirical mean. Plugging in our temperature data, we get ˆμML = − 5 .97.

Example 2. In the last example, we pulled the standard deviation σ = 5 out of a hat. Really we’d like to learn it from data as well. Let’s add it as a parameter to the model. The likelihood function is the same as before, except now it’s a function of both μ and σ, rather than just μ. To maximize a function of two variables, we find critical points by setting the partial derivatives to 0. In this case,

∂`

∂μ

σ^2

∑^ N

i=

x(i)^ − μ

∂`

∂σ

∂σ

[ N

i=

log 2π − log σ −

2 σ^2 (x(i)^ − μ)^2

]

∑^ N

i=

∂σ

log 2π −

∂σ

log σ −

∂σ

2 σ

(x(i)^ − μ)^2

∑^ N

i=

σ

σ^3 (x(i)^ − μ)^2

N

σ

σ^3

∑^ N

i=

(x(i)^ − μ)^2

From the first equality, we find that ˆμML = (^) N^1

∑N

i=1 x (i) (^) is the empirical mean, just

as before. From the second inequality, we find ˆσML =

1 N

∑N

i=1(x (i) (^) − μ) (^2). In other words, ˆσML is simply the empirical standard deviation. In the case of the Toronto temperatures, we get ˆμML = − 5 .97 (as before) and ˆσML = 4.55.

Example 3. We’ve just seen two examples where we could obtain the exact max- imum likelihood solution analytically. Unfortunately, this situation is the exception rather than the rule. Let’s consider how to compute the maximum likelihood estimate of the parameters of the gamma distribution, whose PDF is defined as

p(x) =

ba Γ(a) xa−^1 e−bx,

where Γ(a) is the gamma function, which is a generalization of the factorial function to continuous values.^1 The model parameters are a and b, both of which must take

(^1) The definition is Γ(t) = ∫^ ∞ 0 x t− (^1) e−x (^) dx, but we’re never going to use the definition in this class.

positive values. The log-likelihood, therefore, is

`(a, b) =

∑^ N

i=

a log b − log Γ(a) + (a − 1) log x(i)^ − bx(i)

= N a log b − N log Γ(a) + (a − 1)

∑^ N

i=

log x(i)^ − b

∑^ N

i=

x(i).

Most scientific computing environments provide a function which computes log Γ(a). In SciPy, for instance, it is scipy.special.gammaln. To maximize the log-likelihood, we’re going to use gradient ascent, which is just like gradient descent, except we move uphill rather than downhill. To derive the update rules, we need the partial derivatives:

∂` ∂a = N log b − N

d da log Γ(a) +

∑^ N

i=

log x(i)

∂`

∂b

= N

a b

∑^ N

i=

x(i).

Our implementation of gradient ascent, therefore, consists of computing these deriva- tives, and then updating a ← a + α ∂∂a and b ← b + α ∂∂b , where α is the learning rate. Most scientific computing environments provide a function to compute (^) dda log Γ(a); for instance, it is scipy.special.digamma in SciPy.

Here are some observations about these examples:

  • In each of these examples, the log-likelihood function ` decomposed as a sum of terms, one for each training example. This results from our independence assumption. Be- cause different observations are independent, the likelihood decomposes as a product over training examples, so the log-likelihood decomposes as a sum.
  • The derivatives worked out nicely because we were dealing with log-likelihoods. Try taking derivatives of the likelihood function L(θ), and you’ll see that they’re much messier.
  • All of the log-likelihood functions we looked at wound up being expressible in terms of certain sufficient statistics of the dataset, such as

∑N

i=1 x (i), ∑N i=1[x (i)] (^2) , or ∑N i=1 log^ x (i). When we’re fitting the maximum likelihood solution, we can forget the data itself and just remember the sufficient statistics. This doesn’t happen for all of our models; for instance, it didn’t happen when we fit the neural language model

3 Bayesian parameter estimation

In the maximum likelihood approach, the observations (i.e. the xi’s) were treated as random variables, but the model parameters were not. In the Bayesian approach, we treat the parameters as random variables as well. We define a model for the joint distribution p(θ, D) over parameters θ and data D. (In our coin example, θ would be the probability of H, and D would be the sequence of 100 flips that we observed.) Then we can perform the usual operations on this joint distribution, such as marginalization and conditioning. In order to define this joint distribution, we need two things:

  • A distribution p(θ), known as the prior distribution. It’s called the prior because it’s supposed to encode your “prior beliefs,” i.e. everything you believed about the parameters before looking at the data. In practice, we normally choose priors to be computationally convenient, rather than based on any sort of statistical principle. More on this later.
  • The likelihood p(D | θ), the probability of the observations given the parameters, just like in maximum likelihood.

Bayesians are primarily interested in computing two things:

  • The posterior distribution p(θ | D). This corresponds to our beliefs about the parameters after observing the data. In general, the posterior distribution can be computed using Bayes’ Rule:

p(θ | D) = ∫ p(θ)p(D |^ θ) p(θ′)p(D | θ′) dθ′^

However, we don’t normally compute the denominator directly. Instead we work with unnormalized distributions as long as possible, and normalize only when we need to. Bayes’ Rule can therefore be written in a more succinct form, using the symbol ∝ to denote “proportional to”: p(θ | D) ∝ p(θ)p(D | θ).

  • The posterior predictive distribution p(D′^ | D), which is the distribution over future observables given past observations. For instance, given that we’ve observed 55 H’s and 45 T’s, what’s the probability that the next flip will land H? We can compute the posterior predictive distribution by computing the posterior over θ and then marginalizing out θ:

p(D′^ | D) =

p(θ | D)p(D′^ | θ) dθ.

Figure 1: The PDF of the beta distribution for various values of the parameters a and b. Observe that the distribution becomes more peaked as a and b become large, and the peak is near a/(a + b).

3.1 The full Bayesian approach

Let’s figure out the posterior distribution and posterior predictive distribution for our coin example. We’ve already specified the likelihood, so it remains to specify the prior. One option is to use an uninformative prior, which assumes as little as possible about the problem. In the case of the coin, this might correspond to the uniform distribution p(θ) = 1. (There is no single recipe for choosing an uninformative prior; statisticians have a few different recipes which often, but not always, agree with each other.) Alternatively, we can draw upon our lifetime of experience flipping coins. Most coins tend to be fair, i.e. the come up heads around 50% of the time. So perhaps our prior should make θ = 0.5 more likely. There are a lot of distributions which can do this, but a particularly useful one is the beta distribution, parameterized by a, b > 0, and defined as:

p(θ; a, b) =

Γ(a + b) Γ(a)Γ(b) θa−^1 (1 − θ)b−^1.

This distribution is visualized in Figure 1. Why did we choose the beta distribution, of all things? Once we work through the example, we’ll see that it’s actually pretty convenient. Observe that the first term (with all the Γ’s) is just a normalizing constant, so it doesn’t depend on θ. In most of our computations, we’ll only need to work with unnormalized distributions (i.e. ones which don’t necessarily integrate to 1), so we can drop the ugly normalizing constant and write

p(θ; a, b) ∝ θa−^1 (1 − θ)b−^1.

Figure 2: Plots of the prior, likelihood, and posterior for the coin flip example, with the prior Beta(2, 2). (Left) Small data setting, NH = 2, NT = 0. (Right) Large data setting, NH = 55, NT = 45. In this case, the data overwhelm the prior, so the posterior is determined by the likelihood. Note: for visualization purposes, the likelihood function is normalized to integrate to 1, since otherwise it would be too small to see.

OK, back to an earlier question. Where did our choice of prior come from? The key thing to notice is Eqn 3, where the posterior wound up belonging to the same family of distributions as the prior. Why did this happen? Let’s compare the formulas for the beta distribution and the likelihood:

p(θ) = Beta(θ; a, b) ∝ θa−^1 (1 − θ)b−^1 p(D | θ) ∝ θNH^ (1 − θ)NT

In other words, the prior was chosen to have the same functional form as the likelihood.^3 Since we multiply these expressions together to get the (unnormalized) posterior, the pos- terior will also have this functional form. A prior chosen in this way is called a conjugate prior. In this case, the parameters of the prior distribution simply got added to the observed counts, so they are sometimes referred to as pseudo-counts. Let’s look at some more examples. Example 4. Let’s return to our problem of estimating the mean temperature in Toronto, where our model assumes a Gaussian with unknown mean μ and known standard deviation σ = 5. The first task is to choose a conjugate prior. In order to do this, let’s look at the PMF of a single data point:

p(x | μ) =

2 πσ

exp

(x − μ)^2 2 σ^2

(^3) The ∝ notation obscures the fact that the normalizing constants in these two expressions may be completely different, since p(θ) is a distribution over parameters, while p(D | θ) is a distribution over observed data. In this example, the latter normalizing constant happens to be 1, but that won’t always be the case.

If we look at this as a function of μ (rather than x), we see that it’s still a Gaus- sian! This should lead us to conjecture that the conjugate prior for a Gaussian is a Gaussian. Let’s try it and see if it works.

Our prior distribution will be a Gaussian distribution with mean μpri and standard deviation σpri. The posterior is then given by:

p(μ | D) ∝ p(μ)p(D | μ)

=

[

2 πσpri

exp

(μ − μpri)^2 2 σ^2 pri

)] [ N

i=

2 πσ

exp

2 σ^2

∑^ N

i=

(x(i)^ − μ)^2

)]

∝ exp

(μ − μpri)^2 2 σ^2 pri

2 σ^2

∑^ N

i=

(x(i)^ − μ)^2

∝ exp

μ^2 2 σ^2 pri

μpriμ σpri^2

μ^2 pri 2 σ^2 pri

2 σ^2

∑^ N

i=

[x(i)]^2 +

σ^2

∑^ N

i=

x(i)μ −

N

2 σ^2

μ^2

= exp

μ^2 pri 2 σ^2 pri

2 σ^2

∑^ N

i=

[x(i)]^2 +

[

μpri σpri^2

∑N

i=1 x (i) σ^2

]

μ −

[

σ^2 pri

N

σ^2

]

μ^2

∝ exp

(μ − μpost)^2 σpost^2

where

σpost =

σpri^2 +^

N σ^2

μpost =

1 σ^2 pri^ μpri^ +^

N σ^2

1 N

∑N

i=1 x (i) 1 σ^2 pri^ +^

N σ^2

The last step uses a technique called completing the square. You’ve probably done this before in a probability theory class. So the posterior distribution is a Gaussian with mean μpost and standard deviation σpost.

The formulas are rather complicated, so let’s break them apart. First look how σpost changes if we vary the prior or the data.

  • As we increase the number of observations N , the denominator gets larger, so σpost gets smaller. This should be intuitive: as we observe more data, the pos- terior gets more peaked, which corresponds to the posterior standard deviation decreasing.

Figure 3: The prior, posterior, and posterior predictive distributions for the Toronto tem- peratures example.

  • When N is very large, the mean of the predictive distribution is close to the maximum likelihood mean, and the standard deviation is very close to σ. In other words, it makes almost the same predictions as the maximum likelihood estimate.

The prior, posterior, and posterior predictive distributions are all shown in Figure 3.

For both the coin and Gaussian examples, the posterior predictive distribution had the same parametric form as the model. (I.e., it was a Bernoulli distribution for the coin model, and a Gaussian distribution for the Gaussian model.) This does not happen in general; often the posterior predictive distribution doesn’t have a convenient form, which is part of what makes the full Bayesian approach difficult to apply.

3.2 The difficulty of the full Bayesian approach

We’ve seen two different ways to learn the parameters of a probabilistic model. Maximum likelihood is based on optimization, while the full Bayesian approach is based on computing integrals. In either case, for some of the commonly used distributions, we can derive a closed-form solution. However, for many important models (such as multilayer neural nets), there’s no closed-form solution. As we saw in Example 3, if we can’t find a closed form, we can still maximize the log-likelihood using gradient ascent. But for the Bayesian approach, we need to compute an integral in order to marginalize out the model parameters. If we only have a few parameters, we can do this using nu- merical quadrature methods. Unfortunately, these methods are exponential in the number of variables being integrated out. If we’re trying to fit a neural net with thousands (or

even millions) of parameters, this is completely impractical. There are other methods for integration which perform well in high dimensional spaces; we’ll discuss one such set of techniques, called Markov chain Monte Carlo, later in the course. However, integration still tends to be a much more difficult problem than optimization, so if possible we would like to formulate our learning algorithms in terms of optimization. Let’s now look at the maximum a-posteriori (MAP) approximation, a way of converting the integration problem into an optimization problem.

3.3 Maximum a-posteriori (MAP) approximation

We worked through two examples of the full Bayesian approach: Bernoulli and Gaussian models. In both cases, we saw that as more data is observed, the posterior distribution becomes more and more peaked around a single value. This suggests that maybe we can get away with summarizing the posterior with a single point estimate. The maximum a- posteriori (MAP) approximation chooses the parameters which are most likely under the posterior, i.e.

θˆMAP = arg max θ p(θ | D)

= arg max θ p(θ, D)

= arg max θ p(θ) p(D | θ) = arg max θ

log p(θ) + log p(D | θ)

Observe that maximizing log p(D | θ) is equivalent to maximum likelihood estimation, so the only difference between MAP and ML is the addition of the prior term log p(θ). The prior is therefore somewhat analogous to a regularizer. In fact, if p(θ) is a Gaussian distribution centered at 0, you get L2 regularization!

Example 5. Let’s return to our coin flip example. The joint probability is given by: log p(θ, D) = log p(θ) + log p(D | θ) = const + (a − 1) log θ + (b − 1) log(1 − θ) + NH log θ + NT log(1 − θ) = const + (NH + a − 1) log θ + (NT + b − 1) log(1 − θ)

(Here, const is a shorthand for terms which don’t depend on θ.) Let’s maximize this by finding a critical point: d dθ

log p(θ, D) = NH + a − 1 θ

NT + b − 1 1 − θ Setting this to zero, we get

θˆMAP = NH^ +^ a^ −^1 NH + NT + a + b − 2

Figure 4: Comparison of the predictions made by the ML, FB, and MAP methods about future temperatures. (Left) After observing one training case. (Right) After observing 7 training cases, i.e. one week.

On one hand, we typically use a lot more data than we did in these toy examples. In typical neural net applications, we’d have thousands or millions of training cases. On the other hand, we’d also have a lot more parameters: typically thousands or millions. Depending on the precise dataset and model architecture, there might or might not be a big difference between the methods.

3.5 Can the full Bayesian method overfit?

We motivated the Bayesian approach as a way to prevent overfitting. It’s sometimes claimed that you can’t overfit if you use the full Bayesian approach. Is this true? In a sense, it is. If your prior and likelihood model are both accurate, then Bayesian inference will average the predictions over all parameter values that are consistent with the data. Either there’s enough data to accurately pinpoint the correct values, or the predictions will be averaged over a broad posterior which probably includes values close to the true one. However, in the presence of model misspecification, the full Bayesian approach can still overfit. This term is unfortunate because it makes it sound like misspecification only happens when we do something wrong. But pretty much all the models we use in machine learning are vast oversimplifications of reality, so we can’t rely on the theoretical guarantees of the Bayesian approach (which rely on the model being correctly specified). We can see this in our Toronto temperatures example. Figure 5 shows the posterior predictive distribution given the first week of March, as well as a histogram of temperature values for the rest of the month. A lot of the temperature values are outside the range predicted by the model! There are at least two problems here, both of which result from the erroneous i.i.d. assumption:

  • The data are not identically distributed: the observed data are for the start of the

Figure 5: The full Bayesian posterior predictive distribution given the temperatures for the first week, and a histogram of temperatures for the remainder of the month. Observe that the predictions are poor because of model misspecification.

month, and temperatures may be higher later in the month.

  • The data are not independent: temperatures in subsequent days are correlated, so treating each observation as a new independent sample results in a more confident posterior distribution than is actually justified.

Unfortunately, the data are rarely independent in practice, and there are often systematic differences between the datasets we train on and the settings where we’ll need to apply the learned models in practice. Therefore, overfitting remains a real possibility even with the full Bayesian approach.

4 Summary

We’ve introduced three different methods for learning probabilistic models:

  • Maximum likelihood (ML), where we choose the parameters which maximize the likelihood:

θˆML = arg max θ

`(θ) = arg max θ

log p(D | θ)

Sometimes we can compute the optimum analytically by setting partial derivatives to 0. Otherwise, we need to optimize it using an iterative method such as SGD.