Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Approximate Inference and Learning in Bayesian Networks, Study notes of Network Theory

Harvard University Network Theory

Methods for approximate inference in bayesian networks using sample-based and deterministic approximations. It also covers generalizations to bayesian networks with noisy-or nodes and continuous random variables, as well as learning the parameters and structure of bayesian networks. Examples and applications in hidden markov models and speech recognition.

Typology: Study notes

2010/2011

Uploaded on 10/25/2011

thecoral 🇺🇸

4.5

(30)

395 documents

1 / 7

This page cannot be seen from the preview

Don't miss anything!

CS181 Lecture 14 — Approximate Inference,

Generalizations to Bayes Nets, and Learning

David C. Parkes

March 3, 2011

In this lecture we will briefly survey methods for approximate inference us-

ing sample-based and “deterministic” (e.g., variational Bayes) approximations.

In addition, we will consider generalizations to BNs (including noisy-OR nodes,

and BNs with continuous random variables), and methods for learning the pa-

rameters and also the structure of BNs.

1 Approximate Inference

Exact inference, using variable elimination, is mpt tractable for general Bayes

Nets. For this reason, two main kinds of approximate inference are adopted.

•Stochastic approximations: exact with enough computation, but will not

scale to large networks.

•Deterministic approximations: will never be exact, but will always be fast.

In fact it is this latter class of deterministic approximations that has been

especially effective in getting BNs to scale to the real world. For example, this

is the technique used by Microsoft for the Xbox 360 Live application TrueSkill.

Another inference method that is sometimes used, and is a fast and approx-

imate method that can be effective. is Loopy belief propagation. In this

approach, the sum and product messages of variable elimination are sent as

though the BN was a polytree. The algorithm can be useful when the message

passing converges, although it will sometimes continue for ever. Some theory is

available in support of why loopy propagation is effective, although it started

as a heuristic approach for making progress.

Many of today’s applications of Bayes Net methods are only possible be-

cause of the deterministic approximation methods such as variational Bayes

and EP. Gibbs sampling can also be effective and finds wide application. But

don’t despair of the exact methods of variable elimination and the corresponding

message passing algorithms. We find they can be used to great effect for nicely

structure Bayes Nets, such as those we see in application to Hidden Markov

1

Discover Study notes of Network Theory Harvard University

Partial preview of the text

Download Approximate Inference and Learning in Bayesian Networks and more Study notes Network Theory in PDF only on Docsity!

CS181 Lecture 14 — Approximate Inference,

Generalizations to Bayes Nets, and Learning

David C. Parkes

March 3, 2011

In this lecture we will briefly survey methods for approximate inference us- ing sample-based and “deterministic” (e.g., variational Bayes) approximations. In addition, we will consider generalizations to BNs (including noisy-OR nodes, and BNs with continuous random variables), and methods for learning the pa- rameters and also the structure of BNs.

1 Approximate Inference

Exact inference, using variable elimination, is mpt tractable for general Bayes Nets. For this reason, two main kinds of approximate inference are adopted.

Stochastic approximations: exact with enough computation, but will not scale to large networks.
Deterministic approximations: will never be exact, but will always be fast.

In fact it is this latter class of deterministic approximations that has been especially effective in getting BNs to scale to the real world. For example, this is the technique used by Microsoft for the Xbox 360 Live application TrueSkill. Another inference method that is sometimes used, and is a fast and approx- imate method that can be effective. is Loopy belief propagation. In this approach, the sum and product messages of variable elimination are sent as though the BN was a polytree. The algorithm can be useful when the message passing converges, although it will sometimes continue for ever. Some theory is available in support of why loopy propagation is effective, although it started as a heuristic approach for making progress. Many of today’s applications of Bayes Net methods are only possible be- cause of the deterministic approximation methods such as variational Bayes and EP. Gibbs sampling can also be effective and finds wide application. But don’t despair of the exact methods of variable elimination and the corresponding message passing algorithms. We find they can be used to great effect for nicely structure Bayes Nets, such as those we see in application to Hidden Markov

Models and used widely for speech recognition and other natural language un- derstanding tasks. It is just that for “dirty” real-world applications, we often need something more heuristic and a lot faster.

2 Stochastic Approximations

This is nicely explained in R&N. The technical nitty gritty, for example the derivation p.537-538 of why Gibbs sampling works, is beyond the scope of this course. But you should be sure to understand the basic ideas. The basic approaches for using sample-based methods to estimate the P (Q | e) for evidence e ∈ E in a Bayes Net are as follows:

Rejection sampling. Sample x ∈ X from the prior joint distribution of the Bayes Net by going in topological order. Reject any sample in which the evidence E = e is not satisfied. For the remaining samples, use a count of when Q = q vs. when Q 6 = q, to determine an estimate of P (Q = q | e). Pro: very simple method. Con: the fraction of samples rejected grows exponentially as the amount of evidence E grows!
Importance Sampling. Rather than sample and reject, the idea here is to sample only from the random variables that are not evidence, with the sample of each variable conditioned on the values of its parents (only). The problem is that dependence on evidence variables that occur later in the order is ignored when sampling earlier variables. To address this, each complete x ∈ X so constructed is weighted by the probability of each evidence variable given its parents. For example, if the value sampled for some variable is very unlikely given the evidence then a low weight is assigned. Pro: every sample counts and so this is more efficient, and dominates the method of rejection sampling. Con: this method doesn’t scale when there is a lot of evidence late in the topological (typically, causal) order of a BN, where earlier samples are then not very likely. The basic problem is that many samples will be given a very low weight.
Gibbs sampling. This is a Markov Chain Monte Carlo (MCMC) method. Rather than sample a completely new x vector each time, the idea is to instead re-sample just a single (non evidence) variable each time, hold- ing all the other values fixed. In particular, the evidence variables are fixed throughout and the rest of the variables initialized in some arbitrary way. Gibbs sampling then proceeds by sampling each of the non evidence variables repeatedly at random, sampling each variable given the current values of all other variables. In particular, the Markov blanket of a variable is used to make this tractable– it is possible to use Bayes’ rule to enable the value of a single variable to be sampled given the values of its Markov blanket (including the values of

This problem of insufficient data arises in the Hespar II system for liver disease diagnosis (Onisko et al.), where the researchers found it useful to refor- mulate a Bayes Net for liver diagnosis to allow for less parameters. Specifically, in addition to moving from a single diagnosis to a multiple diagnosis model (see the Figure in the slides from class), which provides for a network in which nodes tend to have fewer parents, the authors also leveraged the idea of a Noisy-OR representation for CPTs. Noisy-OR represents a CPT with H parents, requiring 2H^ numbers when all random variables are binary, as a parameterized CPT that requires only H numbers. It is normal to ask an expert (or a panel of experts) whether or not a particular node can be usefully modeled as a Noisy-OR node, in which case this saving in parameters can be realized, and a predictive model learned with less data. The lecture slides present an example in which cold , flu and malaria all cause fever. Whereas logical OR would insist that fever whenever cold ∨flu ∨malaria , Noisy-OR makes the following alternative assumption:

Each cause leads to the symptom independently, but the causal effect of each cause may be inhibited.

For example, the probabilistic information about cold would be represented as

qcold = P (¬fever | cold , ¬flu, ¬malaria ) = 0. 6 ,

to indicate that the probability the fever effect of a cold is inhibited is 0.6 and so a patient is still quite likely not to have a fever. Based on similar numbers qflu and qmalaria (see the R&N handout), then the full CPT for different truth values of the three parents is constructed as follows:

for each possible set of truth value assignments to the parents, the probability the patient does NOT have fever is defined to be the prod- uct of the probabilities qj for each parent Xj that is true. For example, if qmalaria = P (¬fever | malaria, ¬cold , ¬flu) = 0. 1 ,

then

P (¬fever | cold , malaria, ¬flu) = qcold × qmalaria = 0. 6 ∗ 0 .1 = 0. 06

and P (fever | cold , malaria, ¬flu) = qcold × qmalaria = 1 − 0 .06 = 0.94. The Noisy-OR model requires that the only way the patient does not have a fever is when both the fever due to cold and the fever due to malaria is inhibited. Formulaically, we have P (Xj = 1 | Pa[Xj ]) = 1 −

k:Xk =true,k∈Pa[Xj ]

qk (2)

In Hepar II it was possible to identify 27 of the 62 nodes as Noisy-OR nodes and the reformulation of the Bayes Net required only 1488 instead of 3714 parameters to define. This provided as much as a 5% reduction in the classification error.

5 Bayes Nets with Continuous and Discrete Vari-

ables

Just to slightly complement the treatment in R&N, note that in regard to the discrete Buy random variable, the idea is that (a) the consumer has a hard threshold for whether or not she will buy the fruit, where she buys if the cost c of fruit is less than her threshold threshold (b) the threshold of a consumer is modeled as a Gaussian with mean μb and variance σ b^2. (c) based on this, the probability that a consumer will buy the fruit given cost c depends on whether or not the cost is below the threshold, and the parameters μb and σb induce a probability P (Buy = true | c) for different possible costs c. This induced distribution, which is itself a probit distribution, is in turn also parameterized by μb and σb since it is induced by the Gaussian distribution on threshold. (d) specifically, we have

P (Buy = true | c) = P (threshold ≥ c) = 1 −

∫ (^) c

z=−∞

N (Z = z | μb, σ b^2 )dz (3)

c − μb σb

μb − c σb

where Φ

c−μb σb

is the cumulative distribution function of the unit Gaussian

(mean 0, variance 1). Note that the overall model for this farm subsidy model has 11 parameters: P (subsidy) is 1, N (h | μh, σ^2 H ) for harvest is 2, the linear Gaussian for cost is N (c | ath + bt, σ t^2 ) and N (c | af h + bf , σ^2 f ) depending on whether or not the subsidy is true or false and has 6 parameters, and the probit distribution for the buy decision has 2 parameters.

5.1 Full Bayes Reasoning in Bayes Nets

Given that continuous random variables can be handled with BNs, it is possible to capture distributions on parameters in a Bayes Net. This is done by making the parameters themselves (continuous) random variables, for example with parameter θ 2 T , for whether or not X 2 is true given that its parent C is true, itself modeled as a random variable with a Beta distribution θ 2 T ∼ Beta(θ 2 T | αT 2 , β 2 T ). Note that there are now (fixed) parameters αT 2 and β 2 T to learn, which induce a distribution on θ 2 T. These (α, β) parameters of the Beta distribution may themselves have been learned by combining a prior on the distribution on θT 2 (itself expressed as a Beta distribution) with observed counts of when X 2 is true and false when its parent C is true.

expected values of the latent variables and thus expected values of the sufficient statistics, while in the M step the parameter values are chosen so as to maximize the expected likelihood, which is equivalent to maximizing the likelihood of the expected sufficient statistics. The M step in Bayesian networks is very easy: just use the above formula with the expected sufficient statistics replacing the real ones. It turns out that the E step involves a series of calls to a Bayesian network inference algorithm. For each data instance xi with observations (evidence) ei, and for each value x of unobserved (or latent) variable Xj ∈ {Xℓ+1,... , Xm}, and value μ of Pa[Xj ], we need to compute P (Xj = x, Pa[Xj ] = μ | ei). This is a standard Bayesian network computation. We can then compute the expected sufficient statistics:

E[N (^) j,xμ] =

∑^ n

i=

P (Xj = x, Pa[Xj ] = μ | ei). (6)

We have seen, in a nutshell, how to learn the parameters of a Bayesian network from data, including missing data.

6.2 Learning the Structure of a Bayesian Network

Learning the structure of a Bayesian Network is a different story. We will consider only the complete data case: learning a Bayesian network structure with missing data is quite difficult, though it can be done. With complete data, we make the following observation. For any structure, we can quickly compute (via maximum log likelihood) the best possible param- eters for that structure, by counting sufficient statistics. We can then quickly compute the likelihood of the data for that structure. This determines a score for the structure. The problem then becomes one of searching for the structure with the highest score. In practice, greedy search algorithms are often used. At each step, they take an existing structure and modify it slightly, perhaps by adding, deleting or reversing an edge. One important issue in structure search is overfitting. With a maximally connected Bayesian network, one can represent any probability distribution, but obviously that will have difficulty generalizing (not to mention be exponential in size and impossible to work with). In general one wants a network with as small a number of edges as possible, or, more precisely, where nodes have as small a number of parents as possible so that the representation size of CPTs is small. One common technique to achieve the right effect is to add a penalty term to the score of a network, to penalize those networks that are too complex.

Approximate Inference and Learning in Bayesian Networks, Study notes of Network Theory

Related documents

Partial preview of the text

Download Approximate Inference and Learning in Bayesian Networks and more Study notes Network Theory in PDF only on Docsity!

CS181 Lecture 14 — Approximate Inference,

Generalizations to Bayes Nets, and Learning

David C. Parkes

March 3, 2011

1 Approximate Inference

2 Stochastic Approximations

5 Bayes Nets with Continuous and Discrete Vari-

ables

5.1 Full Bayes Reasoning in Bayes Nets

6.2 Learning the Structure of a Bayesian Network