Bayesian Approach to Model Fitting in Neural Networks and Machine Learning, Slides of Computer Networks

An introduction to the bayesian framework for model fitting in neural networks and machine learning. The bayesian approach assumes a prior distribution for all parameters and combines it with the likelihood term to obtain a posterior distribution. The example of coin tossing is used to illustrate the concept. The document also discusses the advantages of using a distribution over parameter values and the use of bayes theorem and maximum likelihood learning.

Typology: Slides

2012/2013

Uploaded on 04/27/2013

sajid
sajid 🇮🇳

4.6

(7)

128 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Introduction to Neural Networks
and Machine Learning
Lecture 10: The Bayesian way to fit models
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Bayesian Approach to Model Fitting in Neural Networks and Machine Learning and more Slides Computer Networks in PDF only on Docsity!

Introduction to Neural Networks

and Machine Learning

Lecture 10: The Bayesian way to fit models

The Bayesian framework

  • The Bayesian framework assumes that we always

have a prior distribution for everything.

  • The prior may be very vague.
  • When we see some data, we combine our prior

distribution with a likelihood term to get a posterior

distribution.

  • The likelihood term takes into account how probable the

observed data is given the parameters of the model.

  • It favors parameter settings that make the data likely.
  • It fights the prior
  • With enough data the likelihood terms always win.

Some problems with picking the parameters that are most likely to generate the data

  • What if we only tossed the coin once and we

got 1 head?

  • Is p=1 a sensible answer?
    • Surely p=0.5 is a much better answer.
  • Is it reasonable to give a single answer?
  • If we don’t have much data, we are unsure about

p.

  • Our computations of probabilities will work much

better if we take this uncertainty into account.

Using a distribution over parameter values

  • Start with a prior distribution over p. In this case we used a uniform distribution.
  • Multiply the prior probability of each parameter value by the probability of observing a head given that value.
  • Then scale up all of the probability densities so that their integral comes to 1. This gives the posterior distribution.

probability density

p

area=

area=

0 1

1

1

2

probability density

probability density

Lets do it another 98 times

  • After 53 heads and 47 tails we get a very sensible posterior distribution that has its peak at 0. (assuming a uniform prior).

probability density

p

area=

0 1

1

2

Bayes Theorem

W

p W p D W

p D

p W p D W p W D

p D p W D p D W p W p D W

Prior probability of weight vector W

Posterior probability of weight vector W given training data D

Probability of observed data given W

joint probability

conditional probability

Why we maximize sums of log probs

p ( D | W ) p ( d | W ) c

= ∏ c

  • We want to maximize the product of the probabilities of the outputs on all the different training cases - Assume the output errors on different training cases, c, are independent.
  • Because the log function is monotonic, it does not change where the maxima are. So we can maximize sums of log probabilities

log p ( D | W ) log p ( d | W ) c

= ∑ c

A even cheaper trick

  • Suppose we completely ignore the prior over

weight vectors

  • This is equivalent to giving all possible weight vectors the same prior probability density.
  • Then all we have to do is to maximize:
  • This is called maximum likelihood learning. It is

very widely used for fitting models in statistics.

log p ( D | W ) log p ( D | W ) c

= ∑ c

Supervised Maximum Likelihood Learning

  • Finding a set of weights, W, that minimizes the

squared errors is exactly the same as finding a W that maximizes the log probability that the model would produce the desired outputs on all the training cases.

  • We implicitly assume that zero-mean Gaussian

noise is added to the model’s actual output.

  • We do not need to know the variance of the noise

because we are assuming it’s the same in all cases.

So it just scales the squared error.