Docsity
Docsity

Prepare-se para as provas
Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity


Ganhe pontos para baixar
Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium


Guias e Dicas
Guias e Dicas


Learning Bayesian Networks - Neapolitan R. E, Notas de estudo de Cultura

Inteligencia artificial

Tipologia: Notas de estudo

2013

Compartilhado em 03/10/2013

jucelino-cardoso-8
jucelino-cardoso-8 🇧🇷

5

(9)

14 documentos

1 / 703

Toggle sidebar

Esta página não é visível na pré-visualização

Não perca as partes importantes!

bg1
Learning Bayesian Networks
Richard E. Neapolitan
Northeastern Illinois University
Chicago, Illinois
In memory of my dad, a dicult but loving father, who raised me well.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Pré-visualização parcial do texto

Baixe Learning Bayesian Networks - Neapolitan R. E e outras Notas de estudo em PDF para Cultura, somente na Docsity!

Learning Bayesian Networks

Richard E. Neapolitan

Northeastern Illinois University

Chicago, Illinois

In memory of my dad, a difficult but loving father, who raised me well.

ii

  • I Basics Preface ix
  • 1 Introduction to Bayesian Networks
    • 1.1 Basics of Probability Theory
      • 1.1.1 Probability Functions and Spaces
      • 1.1.2 Conditional Probability and Independence
      • 1.1.3 Bayes’ Theorem
      • 1.1.4 Random Variables and Joint Probability Distributions
    • 1.2 Bayesian Inference - tions 1.2.1 Random Variables and Probabilities in Bayesian Applica- - Distributions for Bayesian Inference 1.2.2 A Definition of Random Variables and Joint Probability
      • 1.2.3 A Classical Example of Bayesian Inference
    • 1.3 Large Instances / Bayesian Networks
      • 1.3.1 The Difficulties Inherent in Large Instances
      • 1.3.2 The Markov Condition
      • 1.3.3 Bayesian Networks
      • 1.3.4 A Large Bayesian Network
    • 1.4 Creating Bayesian Networks Using Causal Edges
      • 1.4.1 Ascertaining Causal Influences Using Manipulation
      • 1.4.2 Causation and the Markov Condition
  • 2 More DAG/Probability Relationships
    • 2.1 Entailed Conditional Independencies
      • 2.1.1 Examples of Entailed Conditional Independencies
      • 2.1.2 d-Separation
      • 2.1.3 Finding d-Separations
    • 2.2 Markov Equivalence
    • 2.3 Entailing Dependencies with a DAG
      • 2.3.1 Faithfulness
      • 2.3.2 Embedded Faithfulness iv CONTENTS
    • 2.4 Minimality
    • 2.5 Markov Blankets and Boundaries
    • 2.6 More on Causal DAGs
      • 2.6.1 The Causal Minimality Assumption
      • 2.6.2 The Causal Faithfulness Assumption
      • 2.6.3 The Causal Embedded Faithfulness Assumption
  • II Inference
  • 3 Inference: Discrete Variables
    • 3.1 Examples of Inference
    • 3.2 Pearl’s Message-Passing Algorithm
      • 3.2.1 Inference in Trees
      • 3.2.2 Inference in Singly-Connected Networks
      • 3.2.3 Inference in Multiply-Connected Networks
      • 3.2.4 Complexity of the Algorithm
    • 3.3 The Noisy OR-Gate Model
      • 3.3.1 The Model
      • 3.3.2 Doing Inference With the Model
      • 3.3.3 Further Models
    • 3.4 Other Algorithms that Employ the DAG
    • 3.5 The SPI Algorithm
      • 3.5.1 The Optimal Factoring Problem
      • 3.5.2 Application to Probabilistic Inference
    • 3.6 Complexity of Inference
    • 3.7 Relationship to Human Reasoning
      • 3.7.1 The Causal Network Model
      • 3.7.2 Studies Testing the Causal Network Model
  • 4 More Inference Algorithms
    • 4.1 Continuous Variable Inference
      • 4.1.1 The Normal Distribution
      • 4.1.2 An Example Concerning Continuous Variables
      • 4.1.3 An Algorithm for Continuous Variables
    • 4.2 Approximate Inference
      • 4.2.1 A Brief Review of Sampling
      • 4.2.2 Logic Sampling
      • 4.2.3 Likelihood Weighting
    • 4.3 Abductive Inference
      • 4.3.1 Abductive Inference in Bayesian Networks
      • 4.3.2 A Best-First Search Algorithm for Abductive Inference
  • 5 Influence Diagrams CONTENTS v
    • 5.1 Decision Trees
      • 5.1.1 Simple Examples
      • 5.1.2 Probabilities, Time, and Risk Attitudes
      • 5.1.3 Solving Decision Trees
      • 5.1.4 More Examples
    • 5.2 Influence Diagrams
      • 5.2.1 Representing with Influence Diagrams
      • 5.2.2 Solving Influence Diagrams
    • 5.3 Dynamic Networks
      • 5.3.1 Dynamic Bayesian Networks
      • 5.3.2 Dynamic Influence Diagrams
  • III Learning
  • 6 Parameter Learning: Binary Variables
    • 6.1 Learning a Single Parameter
      • 6.1.1 Probability Distributions of Relative Frequencies
      • 6.1.2 Learning a Relative Frequency
    • 6.2 More on the Beta Density Function
      • 6.2.1 Non-integral Values of a and b
      • 6.2.2 Assessing the Values of a and b
      • 6.2.3 Why the Beta Density Function?
    • 6.3 Computing a Probability Interval
    • 6.4 Learning Parameters in a Bayesian Network
      • 6.4.1 Urn Examples
      • 6.4.2 Augmented Bayesian Networks
      • 6.4.3 Learning Using an Augmented Bayesian Network
        • Size 6.4.4 A Problem with Updating; Using an Equivalent Sample
    • 6.5 Learning with Missing Data Items
      • 6.5.1 Data Items Missing at Random
      • 6.5.2 Data Items Missing Not at Random
    • 6.6 Variances in Computed Relative Frequencies
      • 6.6.1 A Simple Variance Determination
      • 6.6.2 The Variance and Equivalent Sample Size
      • 6.6.3 Computing Variances in Larger Networks
      • 6.6.4 When Do Variances Become Large?
  • 7 More Parameter Learning
    • 7.1 Multinomial Variables
      • 7.1.1 Learning a Single Parameter
      • 7.1.2 More on the Dirichlet Density Function
      • 7.1.3 Computing Probability Intervals and Regions
      • 7.1.4 Learning Parameters in a Bayesian Network
      • 7.1.5 Learning with Missing Data Items vi CONTENTS
      • 7.1.6 Variances in Computed Relative Frequencies
    • 7.2 Continuous Variables
      • 7.2.1 Normally Distributed Variable
      • 7.2.2 Multivariate Normally Distributed Variables
      • 7.2.3 Gaussian Bayesian Networks
  • 8 Bayesian Structure Learning
    • 8.1 Learning Structure: Discrete Variables
      • 8.1.1 Schema for Learning Structure
      • 8.1.2 Procedure for Learning Structure
        • mental Data. 8.1.3 Learning From a Mixture of Observational and Experi-
      • 8.1.4 Complexity of Structure Learning
    • 8.2 Model Averaging
    • 8.3 Learning Structure with Missing Data
      • 8.3.1 Monte Carlo Methods
      • 8.3.2 Large-Sample Approximations
    • 8.4 Probabilistic Model Selection
      • 8.4.1 Probabilistic Models
      • 8.4.2 The Model Selection Problem
      • 8.4.3 Using the Bayesian Scoring Criterion for Model Selection
    • 8.5 Hidden Variable DAG Models - DAG Models 8.5.1 Models Containing More Conditional Independencies than - as DAG Models 8.5.2 Models Containing the Same Conditional Independencies
      • 8.5.3 Dimension of Hidden Variable DAG Models
      • 8.5.4 Number of Models and Hidden Variables
      • 8.5.5 Efficient Model Scoring
    • 8.6 Learning Structure: Continuous Variables
      • 8.6.1 The Density Function of D
      • 8.6.2 The Density function of D Given a DAG pattern
    • 8.7 Learning Dynamic Bayesian Networks
  • 9 Approximate Bayesian Structure Learning
    • 9.1 Approximate Model Selection
      • 9.1.1 Algorithms that Search over DAGs
      • 9.1.2 Algorithms that Search over DAG Patterns
      • 9.1.3 An Algorithm Assuming Missing Data or Hidden Variables
    • 9.2 Approximate Model Averaging
      • 9.2.1 A Model Averaging Example
      • 9.2.2 Approximate Model Averaging Using MCMC
  • 10 Constraint-Based Learning CONTENTS vii
    • 10.1 Algorithms Assuming Faithfulness
      • 10.1.1 Simple Examples
      • 10.1.2 Algorithms for Determining DAG patterns
      • 10.1.3 Determining if a Set Admits a Faithful DAG Representation
      • 10.1.4 Application to Probability
    • 10.2 Assuming Only Embedded Faithfulness
      • 10.2.1 Inducing Chains
      • 10.2.2 A Basic Algorithm
      • 10.2.3 Application to Probability
      • 10.2.4 Application to Learning Causal Influences
    • 10.3 Obtaining the d-separations
      • 10.3.1 Discrete Bayesian Networks
      • 10.3.2 Gaussian Bayesian Networks
    • 10.4 Relationship to Human Reasoning
      • 10.4.1 Background Theory
      • 10.4.2 A Statistical Notion of Causality
  • 11 More Structure Learning
    • 11.1 Comparing the Methods
      • 11.1.1 A Simple Example
      • 11.1.2 Learning College Attendance Influences
      • 11.1.3 Conclusions
    • 11.2 Data Compression Scoring Criteria
    • 11.3 Parallel Learning of Bayesian Networks
    • 11.4 Examples
      • 11.4.1 Structure Learning
      • 11.4.2 Inferring Causal Relationships
  • IV Applications
  • 12 Applications
    • 12.1 Applications Based on Bayesian Networks
    • 12.2 Beyond Bayesian networks
  • Bibliography
  • Index

viii CONTENTS

x PREFACE

networks and influence diagrams. Chapters 6-10 address learning. Chapters 6 and 7 concern parameter learning. Since the notation for these learning al- gorithm is somewhat arduous, I introduce the algorithms by discussing binary variables in Chapter 6. I then generalize to multinomial variables in Chapter 7. Furthermore, in Chapter 7 I discuss learning parameters when the variables are continuous. Chapters 8, 9, and 10 concern structure learning. Chapter 8 shows the Bayesian method for learning structure in the cases of both discrete and continuous variables, while Chapter 9 discusses the constraint-based method for learning structure. Chapter 10 compares the Bayesian and constraint-based methods, and it presents several real-world examples of learning Bayesian net- works. The text ends by referencing applications of Bayesian networks in Chap- ter 11. This is a text on learning Bayesian networks; it is not a text on artificial intelligence, expert systems, or decision analysis. However, since these are fields in which Bayesian networks find application, they emerge frequently throughout the text. Indeed, I have used the manuscript for this text in my course on expert systems at Northeastern Illinois University. In one semester, I have found that I can cover the core of the following chapters: 1, 2, 3, 5, 6, 7, 8, and 9. I would like to thank those researchers who have provided valuable correc- tions, comments, and dialog concerning the material in this text. They in- clude Bruce D’Ambrosio, David Maxwell Chickering, Gregory Cooper, Tom Dean, Carl Entemann, John Erickson, Finn Jensen, Clark Glymour, Piotr Gmytrasiewicz, David Heckerman, Xia Jiang, James Kenevan, Henry Kyburg, Kathryn Blackmond Laskey, Don Labudde, David Madigan, Christopher Meek, Paul-André Monney, Scott Morris, Peter Norvig, Judea Pearl, Richard Scheines, Marco Valtorta, Alex Wolpert, and Sandy Zabell. I thank Sue Coyle for helping me draw the cartoon containing the robots.

Part I

Basics

Chapter 1

Introduction to Bayesian

Networks

Consider the situation where one feature of an entity has a direct influence on another feature of that entity. For example, the presence or absence of a disease in a human being has a direct influence on whether a test for that disease turns out positive or negative. For decades, Bayes’ theorem has been used to perform probabilistic inference in this situation. In the current example, we would use that theorem to compute the conditional probability of an individual having a disease when a test for the disease came back positive. Consider next the situ- ation where several features are related through inference chains. For example, whether or not an individual has a history of smoking has a direct influence both on whether or not that individual has bronchitis and on whether or not that individual has lung cancer. In turn, the presence or absence of each of these diseases has a direct influence on whether or not the individual experiences fa- tigue. Also, the presence or absence of lung cancer has a direct influence on whether or not a chest X-ray is positive. In this situation, we would want to do probabilistic inference involving features that are not related via a direct influ- ence. We would want to determine, for example, the conditional probabilities both of bronchitis and of lung cancer when it is known an individual smokes, is fatigued, and has a positive chest X-ray. Yet bronchitis has no direct influence (indeed no influence at all) on whether a chest X-ray is positive. Therefore, these conditional probabilities cannot be computed using a simple application of Bayes’ theorem. There is a straightforward algorithm for computing them, but the probability values it requires are not ordinarily accessible; furthermore, the algorithm has exponential space and time complexity. Bayesian networks were developed to address these difficulties. By exploiting conditional independencies entailed by influence chains, we are able to represent a large instance in a Bayesian network using little space, and we are often able to perform probabilistic inference among the features in an acceptable amount of time. In addition, the graphical nature of Bayesian networks gives us a much

4 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

H

B

F

L

P ( l 1| h 1) =.

P ( l 1| h 2) =.

P ( b 1| h 1) =.

P ( b 1| h 2) =.

P ( h 1) =.

P ( f 1| b 1, l 1) =.

P ( f 1| b 1, l 2) =.

P ( f 1| b 2, l 1) =.

P ( f 1| b 2, l 2) =.

C

P ( c 1| l 1) =.

P ( c 1| l 2) =.

Figure 1.1: A Bayesian nework.

better intuitive grasp of the relationships among the features. Figure 1.1 shows a Bayesian network representing the probabilistic relation- ships among the features just discussed. The values of the features in that network represent the following:

Feature Value When the Feature Takes this Value H h 1 There is a history of smoking h 2 There is no history of smoking B b 1 Bronchitis is present b 2 Bronchitis is absent L l 1 Lung cancer is present l 2 Lung cancer is absent F f 1 Fatigue is present f 2 Fatigue is absent C c 1 Chest X-ray is positive c 2 Chest X-ray is negative

This Bayesian network is discussed in Example 1.32 in Section 1.3.3 after we provide the theory of Bayesian networks. Presently, we only use it to illustrate the nature and use of Bayesian networks. First, in this Bayesian network (called a causal network) the edges represent direct influences. For example, there is an edge from H to L because a history of smoking has a direct influence on the presence of lung cancer, and there is an edge from L to C because the presence of lung cancer has a direct influence on the result of a chest X-ray. There is no

6 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

1.1.1 Probability Functions and Spaces

In 1933 A.N. Kolmogorov developed the set-theoretic definition of probability, which serves as a mathematical foundation for all applications of probability. We start by providing that definition. Probability theory has to do with experiments that have a set of distinct outcomes. Examples of such experiments include drawing the top card from a deck of 52 cards with the 52 outcomes being the 52 different faces of the cards; flipping a two-sided coin with the two outcomes being ‘heads’ and ‘tails’; picking a person from a population and determining whether the person is a smoker with the two outcomes being ‘smoker’ and ‘non-smoker’; picking a person from a population and determining whether the person has lung cancer with the two outcomes being ‘having lung cancer’ and ‘not having lung cancer’; after identifying 5 levels of serum calcium, picking a person from a population and determining the individual’s serum calcium level with the 5 outcomes being each of the 5 levels; picking a person from a population and determining the individual’s serum calcium level with the infinite number of outcomes being the continuum of possible calcium levels. The last two experiments illustrate two points. First, the experiment is not well-defined until we identify a set of outcomes. The same act (picking a person and measuring that person’s serum calcium level) can be associated with many different experiments, depending on what we consider a distinct outcome. Second, the set of outcomes can be infinite. Once an experiment is well-defined, the collection of all outcomes is called the sample space. Mathematically, a sample space is a set and the outcomes are the elements of the set. To keep this review simple, we restrict ourselves to finite sample spaces in what follows (You should consult a mathematical probability text such as [Ash, 1970] for a discussion of infinite sample spaces.). In the case of a finite sample space, every subset of the sample space is called an event. A subset containing exactly one element is called an elementary event. Once a sample space is identified, a probability function is defined as follows:

Definition 1.1 Suppose we have a sample space Ω containing n distinct ele- ments. That is, Ω = {e 1 , e 2 ,... en}. A function that assigns a real number P (E) to each event E ⊆ Ω is called a probability function on the set of subsets of Ω if it satisfies the following conditions:

  1. 0 ≤ P ({ei}) ≤ 1 for 1 ≤ i ≤ n.
  2. P ({e 1 }) + P ({e 2 }) +... + P ({en}) = 1.
  3. For each event E = {ei 1 , ei 2 ,... eik } that is not an elementary event,

P (E) = P ({ei 1 }) + P ({ei 2 }) +... + P ({eik }).

The pair (Ω, P ) is called a probability space.

1.1. BASICS OF PROBABILITY THEORY 7

We often just say P is a probability function on Ω rather than saying on the set of subsets of Ω. Intuition for probability functions comes from considering games of chance as the following example illustrates.

Example 1.1 Let the experiment be drawing the top card from a deck of 52 cards. Then Ω contains the faces of the 52 cards, and using the principle of indifference, we assign P ({e}) = 1/ 52 for each e ∈ Ω. Therefore, if we let kh and ks stand for the king of hearts and king of spades respectively, P ({kh}) = 1 / 52 , P ({ks}) = 1/ 52 , and P ({kh, ks}) = P ({kh}) + P ({ks}) = 1/ 26.

The principle of indifference (a term popularized by J.M. Keynes in 1921) says elementary events are to be considered equiprobable if we have no reason to expect or prefer one over the other. According to this principle, when there are n elementary events the probability of each of them is the ratio 1 /n. This is the way we often assign probabilities in games of chance, and a probability so assigned is called a ratio. The following example shows a probability that cannot be computed using the principle of indifference.

Example 1.2 Suppose we toss a thumbtack and consider as outcomes the two ways it could land. It could land on its head, which we will call ‘heads’, or it could land with the edge of the head and the end of the point touching the ground, which we will call ‘tails’. Due to the lack of symmetry in a thumbtack, we would not assign a probability of 1 / 2 to each of these events. So how can we compute the probability? This experiment can be repeated many times. In 1919 Richard von Mises developed the relative frequency approach to probability which says that, if an experiment can be repeated many times, the probability of any one of the outcomes is the limit, as the number of trials approach infinity, of the ratio of the number of occurrences of that outcome to the total number of trials. For example, if m is the number of trials,

P ({heads}) = lim m→∞

#heads m

So, if we tossed the thumbtack 10 , 000 times and it landed heads 3373 times, we would estimate the probability of heads to be about. 3373.

Probabilities obtained using the approach in the previous example are called relative frequencies. According to this approach, the probability obtained is not a property of any one of the trials, but rather it is a property of the entire sequence of trials. How are these probabilities related to ratios? Intuitively, we would expect if, for example, we repeatedly shuffled a deck of cards and drew the top card, the ace of spades would come up about one out of every 52 times. In 1946 J. E. Kerrich conducted many such experiments using games of chance in which the principle of indifference seemed to apply (e.g. drawing a card from a deck). His results indicated that the relative frequency does appear to approach a limit and that limit is the ratio.

1.1. BASICS OF PROBABILITY THEORY 9

patients with these exact same symptoms, to the actual relative frequency with which they have lung cancer.

It is straightforward to prove the following theorem concerning probability spaces.

Theorem 1.1 Let (Ω, P ) be a probability space. Then

  1. P (Ω) = 1.
  2. 0 ≤ P (E) ≤ 1 for every E ⊆ Ω.
  3. For E and F ⊆ Ω such that E ∩ F = ∅,

P (E ∪ F) = P (E) + P (F).

Proof. The proof is left as an exercise.

The conditions in this theorem were labeled the axioms of probability theory by A.N. Kolmogorov in 1933. When Condition (3) is replaced by in- finitely countable additivity, these conditions are used to define a probability space in mathematical probability texts.

Example 1.5 Suppose we draw the top card from a deck of cards. Denote by Queen the set containing the 4 queens and by King the set containing the 4 kings. Then

P (Queen ∪ King) = P (Queen) + P (King) = 1/13 + 1/13 = 2/ 13

because Queen ∩ King = ∅. Next denote by Spade the set containing the 13 spades. The sets Queen and Spade are not disjoint; so their probabilities are not additive. However, it is not hard to prove that, in general,

P (E ∪ F) = P (E) + P (F) − P (E ∩ F).

So

P (Queen ∪ Spade) = P (Queen) + P (Spade) − P (Queen ∩ Spade)

=

1.1.2 Conditional Probability and Independence

We have yet to discuss one of the most important concepts in probability theory, namely conditional probability. We do that next.

Definition 1.2 Let E and F be events such that P (F) 6 = 0. Then the condi- tional probability of E given F, denoted P (E|F), is given by

P (E|F) =
P (E ∩ F)
P (F)
10 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

The initial intuition for conditional probability comes from considering prob- abilities that are ratios. In the case of ratios, P (E|F), as defined above, is the fraction of items in F that are also in E. We show this as follows. Let n be the number of items in the sample space, nF be the number of items in F, and nEF be the number of items in E ∩ F. Then

P (E ∩ F) P (F)

nEF/n nF/n

nEF nF

which is the fraction of items in F that are also in E. As far as meaning, P (E|F) means the probability of E occurring given that we know F has occurred.

Example 1.6 Again consider drawing the top card from a deck of cards, let Queen be the set of the 4 queens, RoyalCard be the set of the 12 royal cards, and Spade be the set of the 13 spades. Then

P (Queen) =

P (Queen|RoyalCard) =

P (Queen ∩ RoyalCard) P (RoyalCard)

P (Queen|Spade) =

P (Queen ∩ Spade) P (Spade)

Notice in the previous example that P (Queen|Spade) = P (Queen). This means that finding out the card is a spade does not make it more or less probable that it is a queen. That is, the knowledge of whether it is a spade is irrelevant to whether it is a queen. We say that the two events are independent in this case, which is formalized in the following definition.

Definition 1.3 Two events E and F are independent if one of the following hold:

  1. P (E|F) = P (E) and P (E) 6 = 0, P (F) 6 = 0.
  2. P (E) = 0 or P (F) = 0.

Notice that the definition states that the two events are independent even though it is based on the conditional probability of E given F. The reason is that independence is symmetric. That is, if P (E) 6 = 0 and P (F) 6 = 0, then P (E|F) = P (E) if and only if P (F|E) = P (F). It is straightforward to prove that E and F are independent if and only if P (E ∩ F) = P (E)P (F). The following example illustrates an extension of the notion of independence.

Example 1.7 Let E = {kh, ks, qh}, F = {kh, kc, qh}, G = {kh, ks, kc, kd}, where kh means the king of hearts, ks means the king of spades, etc. Then

P (E) =

P (E|F) =