



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Methods for approximate inference in bayesian networks using sample-based and deterministic approximations. It also covers generalizations to bayesian networks with noisy-or nodes and continuous random variables, as well as learning the parameters and structure of bayesian networks. Examples and applications in hidden markov models and speech recognition.
Typology: Study notes
1 / 7
This page cannot be seen from the preview
Don't miss anything!




In this lecture we will briefly survey methods for approximate inference us- ing sample-based and “deterministic” (e.g., variational Bayes) approximations. In addition, we will consider generalizations to BNs (including noisy-OR nodes, and BNs with continuous random variables), and methods for learning the pa- rameters and also the structure of BNs.
Exact inference, using variable elimination, is mpt tractable for general Bayes Nets. For this reason, two main kinds of approximate inference are adopted.
In fact it is this latter class of deterministic approximations that has been especially effective in getting BNs to scale to the real world. For example, this is the technique used by Microsoft for the Xbox 360 Live application TrueSkill. Another inference method that is sometimes used, and is a fast and approx- imate method that can be effective. is Loopy belief propagation. In this approach, the sum and product messages of variable elimination are sent as though the BN was a polytree. The algorithm can be useful when the message passing converges, although it will sometimes continue for ever. Some theory is available in support of why loopy propagation is effective, although it started as a heuristic approach for making progress. Many of today’s applications of Bayes Net methods are only possible be- cause of the deterministic approximation methods such as variational Bayes and EP. Gibbs sampling can also be effective and finds wide application. But don’t despair of the exact methods of variable elimination and the corresponding message passing algorithms. We find they can be used to great effect for nicely structure Bayes Nets, such as those we see in application to Hidden Markov
Models and used widely for speech recognition and other natural language un- derstanding tasks. It is just that for “dirty” real-world applications, we often need something more heuristic and a lot faster.
This is nicely explained in R&N. The technical nitty gritty, for example the derivation p.537-538 of why Gibbs sampling works, is beyond the scope of this course. But you should be sure to understand the basic ideas. The basic approaches for using sample-based methods to estimate the P (Q | e) for evidence e ∈ E in a Bayes Net are as follows:
This problem of insufficient data arises in the Hespar II system for liver disease diagnosis (Onisko et al.), where the researchers found it useful to refor- mulate a Bayes Net for liver diagnosis to allow for less parameters. Specifically, in addition to moving from a single diagnosis to a multiple diagnosis model (see the Figure in the slides from class), which provides for a network in which nodes tend to have fewer parents, the authors also leveraged the idea of a Noisy-OR representation for CPTs. Noisy-OR represents a CPT with H parents, requiring 2H^ numbers when all random variables are binary, as a parameterized CPT that requires only H numbers. It is normal to ask an expert (or a panel of experts) whether or not a particular node can be usefully modeled as a Noisy-OR node, in which case this saving in parameters can be realized, and a predictive model learned with less data. The lecture slides present an example in which cold , flu and malaria all cause fever. Whereas logical OR would insist that fever whenever cold ∨flu ∨malaria , Noisy-OR makes the following alternative assumption:
Each cause leads to the symptom independently, but the causal effect of each cause may be inhibited.
For example, the probabilistic information about cold would be represented as
qcold = P (¬fever | cold , ¬flu, ¬malaria ) = 0. 6 ,
to indicate that the probability the fever effect of a cold is inhibited is 0.6 and so a patient is still quite likely not to have a fever. Based on similar numbers qflu and qmalaria (see the R&N handout), then the full CPT for different truth values of the three parents is constructed as follows:
for each possible set of truth value assignments to the parents, the probability the patient does NOT have fever is defined to be the prod- uct of the probabilities qj for each parent Xj that is true. For example, if qmalaria = P (¬fever | malaria, ¬cold , ¬flu) = 0. 1 ,
then
P (¬fever | cold , malaria, ¬flu) = qcold × qmalaria = 0. 6 ∗ 0 .1 = 0. 06
and P (fever | cold , malaria, ¬flu) = qcold × qmalaria = 1 − 0 .06 = 0.94. The Noisy-OR model requires that the only way the patient does not have a fever is when both the fever due to cold and the fever due to malaria is inhibited. Formulaically, we have P (Xj = 1 | Pa[Xj ]) = 1 −
k:Xk =true,k∈Pa[Xj ]
qk (2)
In Hepar II it was possible to identify 27 of the 62 nodes as Noisy-OR nodes and the reformulation of the Bayes Net required only 1488 instead of 3714 parameters to define. This provided as much as a 5% reduction in the classification error.
Just to slightly complement the treatment in R&N, note that in regard to the discrete Buy random variable, the idea is that (a) the consumer has a hard threshold for whether or not she will buy the fruit, where she buys if the cost c of fruit is less than her threshold threshold (b) the threshold of a consumer is modeled as a Gaussian with mean μb and variance σ b^2. (c) based on this, the probability that a consumer will buy the fruit given cost c depends on whether or not the cost is below the threshold, and the parameters μb and σb induce a probability P (Buy = true | c) for different possible costs c. This induced distribution, which is itself a probit distribution, is in turn also parameterized by μb and σb since it is induced by the Gaussian distribution on threshold. (d) specifically, we have
P (Buy = true | c) = P (threshold ≥ c) = 1 −
∫ (^) c
z=−∞
N (Z = z | μb, σ b^2 )dz (3)
c − μb σb
μb − c σb
where Φ
c−μb σb
is the cumulative distribution function of the unit Gaussian
(mean 0, variance 1). Note that the overall model for this farm subsidy model has 11 parameters: P (subsidy) is 1, N (h | μh, σ^2 H ) for harvest is 2, the linear Gaussian for cost is N (c | ath + bt, σ t^2 ) and N (c | af h + bf , σ^2 f ) depending on whether or not the subsidy is true or false and has 6 parameters, and the probit distribution for the buy decision has 2 parameters.
Given that continuous random variables can be handled with BNs, it is possible to capture distributions on parameters in a Bayes Net. This is done by making the parameters themselves (continuous) random variables, for example with parameter θ 2 T , for whether or not X 2 is true given that its parent C is true, itself modeled as a random variable with a Beta distribution θ 2 T ∼ Beta(θ 2 T | αT 2 , β 2 T ). Note that there are now (fixed) parameters αT 2 and β 2 T to learn, which induce a distribution on θ 2 T. These (α, β) parameters of the Beta distribution may themselves have been learned by combining a prior on the distribution on θT 2 (itself expressed as a Beta distribution) with observed counts of when X 2 is true and false when its parent C is true.
expected values of the latent variables and thus expected values of the sufficient statistics, while in the M step the parameter values are chosen so as to maximize the expected likelihood, which is equivalent to maximizing the likelihood of the expected sufficient statistics. The M step in Bayesian networks is very easy: just use the above formula with the expected sufficient statistics replacing the real ones. It turns out that the E step involves a series of calls to a Bayesian network inference algorithm. For each data instance xi with observations (evidence) ei, and for each value x of unobserved (or latent) variable Xj ∈ {Xℓ+1,... , Xm}, and value μ of Pa[Xj ], we need to compute P (Xj = x, Pa[Xj ] = μ | ei). This is a standard Bayesian network computation. We can then compute the expected sufficient statistics:
E[N (^) j,xμ] =
∑^ n
i=
P (Xj = x, Pa[Xj ] = μ | ei). (6)
We have seen, in a nutshell, how to learn the parameters of a Bayesian network from data, including missing data.
Learning the structure of a Bayesian Network is a different story. We will consider only the complete data case: learning a Bayesian network structure with missing data is quite difficult, though it can be done. With complete data, we make the following observation. For any structure, we can quickly compute (via maximum log likelihood) the best possible param- eters for that structure, by counting sufficient statistics. We can then quickly compute the likelihood of the data for that structure. This determines a score for the structure. The problem then becomes one of searching for the structure with the highest score. In practice, greedy search algorithms are often used. At each step, they take an existing structure and modify it slightly, perhaps by adding, deleting or reversing an edge. One important issue in structure search is overfitting. With a maximally connected Bayesian network, one can represent any probability distribution, but obviously that will have difficulty generalizing (not to mention be exponential in size and impossible to work with). In general one wants a network with as small a number of edges as possible, or, more precisely, where nodes have as small a number of parents as possible so that the representation size of CPTs is small. One common technique to achieve the right effect is to add a penalty term to the score of a network, to penalize those networks that are too complex.