




























































































Estude fácil! Tem muito documento disponível na Docsity
Ganhe pontos ajudando outros esrudantes ou compre um plano Premium
Prepare-se para as provas
Estude fácil! Tem muito documento disponível na Docsity
Prepare-se para as provas com trabalhos de outros alunos como você, aqui na Docsity
Encontra documentos específicos para os exames da tua universidade
Prepare-se com as videoaulas e exercícios resolvidos criados a partir da grade da sua Universidade
Responda perguntas de provas passadas e avalie sua preparação.
Ganhe pontos para baixar
Ganhe pontos ajudando outros esrudantes ou compre um plano Premium
Inteligencia artificial
Tipologia: Notas de estudo
1 / 703
Esta página não é visível na pré-visualização
Não perca as partes importantes!





























































































ii
viii CONTENTS
x PREFACE
networks and influence diagrams. Chapters 6-10 address learning. Chapters 6 and 7 concern parameter learning. Since the notation for these learning al- gorithm is somewhat arduous, I introduce the algorithms by discussing binary variables in Chapter 6. I then generalize to multinomial variables in Chapter 7. Furthermore, in Chapter 7 I discuss learning parameters when the variables are continuous. Chapters 8, 9, and 10 concern structure learning. Chapter 8 shows the Bayesian method for learning structure in the cases of both discrete and continuous variables, while Chapter 9 discusses the constraint-based method for learning structure. Chapter 10 compares the Bayesian and constraint-based methods, and it presents several real-world examples of learning Bayesian net- works. The text ends by referencing applications of Bayesian networks in Chap- ter 11. This is a text on learning Bayesian networks; it is not a text on artificial intelligence, expert systems, or decision analysis. However, since these are fields in which Bayesian networks find application, they emerge frequently throughout the text. Indeed, I have used the manuscript for this text in my course on expert systems at Northeastern Illinois University. In one semester, I have found that I can cover the core of the following chapters: 1, 2, 3, 5, 6, 7, 8, and 9. I would like to thank those researchers who have provided valuable correc- tions, comments, and dialog concerning the material in this text. They in- clude Bruce D’Ambrosio, David Maxwell Chickering, Gregory Cooper, Tom Dean, Carl Entemann, John Erickson, Finn Jensen, Clark Glymour, Piotr Gmytrasiewicz, David Heckerman, Xia Jiang, James Kenevan, Henry Kyburg, Kathryn Blackmond Laskey, Don Labudde, David Madigan, Christopher Meek, Paul-André Monney, Scott Morris, Peter Norvig, Judea Pearl, Richard Scheines, Marco Valtorta, Alex Wolpert, and Sandy Zabell. I thank Sue Coyle for helping me draw the cartoon containing the robots.
Consider the situation where one feature of an entity has a direct influence on another feature of that entity. For example, the presence or absence of a disease in a human being has a direct influence on whether a test for that disease turns out positive or negative. For decades, Bayes’ theorem has been used to perform probabilistic inference in this situation. In the current example, we would use that theorem to compute the conditional probability of an individual having a disease when a test for the disease came back positive. Consider next the situ- ation where several features are related through inference chains. For example, whether or not an individual has a history of smoking has a direct influence both on whether or not that individual has bronchitis and on whether or not that individual has lung cancer. In turn, the presence or absence of each of these diseases has a direct influence on whether or not the individual experiences fa- tigue. Also, the presence or absence of lung cancer has a direct influence on whether or not a chest X-ray is positive. In this situation, we would want to do probabilistic inference involving features that are not related via a direct influ- ence. We would want to determine, for example, the conditional probabilities both of bronchitis and of lung cancer when it is known an individual smokes, is fatigued, and has a positive chest X-ray. Yet bronchitis has no direct influence (indeed no influence at all) on whether a chest X-ray is positive. Therefore, these conditional probabilities cannot be computed using a simple application of Bayes’ theorem. There is a straightforward algorithm for computing them, but the probability values it requires are not ordinarily accessible; furthermore, the algorithm has exponential space and time complexity. Bayesian networks were developed to address these difficulties. By exploiting conditional independencies entailed by influence chains, we are able to represent a large instance in a Bayesian network using little space, and we are often able to perform probabilistic inference among the features in an acceptable amount of time. In addition, the graphical nature of Bayesian networks gives us a much
Figure 1.1: A Bayesian nework.
better intuitive grasp of the relationships among the features. Figure 1.1 shows a Bayesian network representing the probabilistic relation- ships among the features just discussed. The values of the features in that network represent the following:
Feature Value When the Feature Takes this Value H h 1 There is a history of smoking h 2 There is no history of smoking B b 1 Bronchitis is present b 2 Bronchitis is absent L l 1 Lung cancer is present l 2 Lung cancer is absent F f 1 Fatigue is present f 2 Fatigue is absent C c 1 Chest X-ray is positive c 2 Chest X-ray is negative
This Bayesian network is discussed in Example 1.32 in Section 1.3.3 after we provide the theory of Bayesian networks. Presently, we only use it to illustrate the nature and use of Bayesian networks. First, in this Bayesian network (called a causal network) the edges represent direct influences. For example, there is an edge from H to L because a history of smoking has a direct influence on the presence of lung cancer, and there is an edge from L to C because the presence of lung cancer has a direct influence on the result of a chest X-ray. There is no
In 1933 A.N. Kolmogorov developed the set-theoretic definition of probability, which serves as a mathematical foundation for all applications of probability. We start by providing that definition. Probability theory has to do with experiments that have a set of distinct outcomes. Examples of such experiments include drawing the top card from a deck of 52 cards with the 52 outcomes being the 52 different faces of the cards; flipping a two-sided coin with the two outcomes being ‘heads’ and ‘tails’; picking a person from a population and determining whether the person is a smoker with the two outcomes being ‘smoker’ and ‘non-smoker’; picking a person from a population and determining whether the person has lung cancer with the two outcomes being ‘having lung cancer’ and ‘not having lung cancer’; after identifying 5 levels of serum calcium, picking a person from a population and determining the individual’s serum calcium level with the 5 outcomes being each of the 5 levels; picking a person from a population and determining the individual’s serum calcium level with the infinite number of outcomes being the continuum of possible calcium levels. The last two experiments illustrate two points. First, the experiment is not well-defined until we identify a set of outcomes. The same act (picking a person and measuring that person’s serum calcium level) can be associated with many different experiments, depending on what we consider a distinct outcome. Second, the set of outcomes can be infinite. Once an experiment is well-defined, the collection of all outcomes is called the sample space. Mathematically, a sample space is a set and the outcomes are the elements of the set. To keep this review simple, we restrict ourselves to finite sample spaces in what follows (You should consult a mathematical probability text such as [Ash, 1970] for a discussion of infinite sample spaces.). In the case of a finite sample space, every subset of the sample space is called an event. A subset containing exactly one element is called an elementary event. Once a sample space is identified, a probability function is defined as follows:
Definition 1.1 Suppose we have a sample space Ω containing n distinct ele- ments. That is, Ω = {e 1 , e 2 ,... en}. A function that assigns a real number P (E) to each event E ⊆ Ω is called a probability function on the set of subsets of Ω if it satisfies the following conditions:
P (E) = P ({ei 1 }) + P ({ei 2 }) +... + P ({eik }).
The pair (Ω, P ) is called a probability space.
We often just say P is a probability function on Ω rather than saying on the set of subsets of Ω. Intuition for probability functions comes from considering games of chance as the following example illustrates.
Example 1.1 Let the experiment be drawing the top card from a deck of 52 cards. Then Ω contains the faces of the 52 cards, and using the principle of indifference, we assign P ({e}) = 1/ 52 for each e ∈ Ω. Therefore, if we let kh and ks stand for the king of hearts and king of spades respectively, P ({kh}) = 1 / 52 , P ({ks}) = 1/ 52 , and P ({kh, ks}) = P ({kh}) + P ({ks}) = 1/ 26.
The principle of indifference (a term popularized by J.M. Keynes in 1921) says elementary events are to be considered equiprobable if we have no reason to expect or prefer one over the other. According to this principle, when there are n elementary events the probability of each of them is the ratio 1 /n. This is the way we often assign probabilities in games of chance, and a probability so assigned is called a ratio. The following example shows a probability that cannot be computed using the principle of indifference.
Example 1.2 Suppose we toss a thumbtack and consider as outcomes the two ways it could land. It could land on its head, which we will call ‘heads’, or it could land with the edge of the head and the end of the point touching the ground, which we will call ‘tails’. Due to the lack of symmetry in a thumbtack, we would not assign a probability of 1 / 2 to each of these events. So how can we compute the probability? This experiment can be repeated many times. In 1919 Richard von Mises developed the relative frequency approach to probability which says that, if an experiment can be repeated many times, the probability of any one of the outcomes is the limit, as the number of trials approach infinity, of the ratio of the number of occurrences of that outcome to the total number of trials. For example, if m is the number of trials,
P ({heads}) = lim m→∞
#heads m
So, if we tossed the thumbtack 10 , 000 times and it landed heads 3373 times, we would estimate the probability of heads to be about. 3373.
Probabilities obtained using the approach in the previous example are called relative frequencies. According to this approach, the probability obtained is not a property of any one of the trials, but rather it is a property of the entire sequence of trials. How are these probabilities related to ratios? Intuitively, we would expect if, for example, we repeatedly shuffled a deck of cards and drew the top card, the ace of spades would come up about one out of every 52 times. In 1946 J. E. Kerrich conducted many such experiments using games of chance in which the principle of indifference seemed to apply (e.g. drawing a card from a deck). His results indicated that the relative frequency does appear to approach a limit and that limit is the ratio.
patients with these exact same symptoms, to the actual relative frequency with which they have lung cancer.
It is straightforward to prove the following theorem concerning probability spaces.
Theorem 1.1 Let (Ω, P ) be a probability space. Then
P (E ∪ F) = P (E) + P (F).
Proof. The proof is left as an exercise.
The conditions in this theorem were labeled the axioms of probability theory by A.N. Kolmogorov in 1933. When Condition (3) is replaced by in- finitely countable additivity, these conditions are used to define a probability space in mathematical probability texts.
Example 1.5 Suppose we draw the top card from a deck of cards. Denote by Queen the set containing the 4 queens and by King the set containing the 4 kings. Then
P (Queen ∪ King) = P (Queen) + P (King) = 1/13 + 1/13 = 2/ 13
because Queen ∩ King = ∅. Next denote by Spade the set containing the 13 spades. The sets Queen and Spade are not disjoint; so their probabilities are not additive. However, it is not hard to prove that, in general,
P (E ∪ F) = P (E) + P (F) − P (E ∩ F).
So
P (Queen ∪ Spade) = P (Queen) + P (Spade) − P (Queen ∩ Spade)
=
We have yet to discuss one of the most important concepts in probability theory, namely conditional probability. We do that next.
Definition 1.2 Let E and F be events such that P (F) 6 = 0. Then the condi- tional probability of E given F, denoted P (E|F), is given by
The initial intuition for conditional probability comes from considering prob- abilities that are ratios. In the case of ratios, P (E|F), as defined above, is the fraction of items in F that are also in E. We show this as follows. Let n be the number of items in the sample space, nF be the number of items in F, and nEF be the number of items in E ∩ F. Then
P (E ∩ F) P (F)
nEF/n nF/n
nEF nF
which is the fraction of items in F that are also in E. As far as meaning, P (E|F) means the probability of E occurring given that we know F has occurred.
Example 1.6 Again consider drawing the top card from a deck of cards, let Queen be the set of the 4 queens, RoyalCard be the set of the 12 royal cards, and Spade be the set of the 13 spades. Then
P (Queen) =
P (Queen|RoyalCard) =
P (Queen ∩ RoyalCard) P (RoyalCard)
P (Queen|Spade) =
P (Queen ∩ Spade) P (Spade)
Notice in the previous example that P (Queen|Spade) = P (Queen). This means that finding out the card is a spade does not make it more or less probable that it is a queen. That is, the knowledge of whether it is a spade is irrelevant to whether it is a queen. We say that the two events are independent in this case, which is formalized in the following definition.
Definition 1.3 Two events E and F are independent if one of the following hold:
Notice that the definition states that the two events are independent even though it is based on the conditional probability of E given F. The reason is that independence is symmetric. That is, if P (E) 6 = 0 and P (F) 6 = 0, then P (E|F) = P (E) if and only if P (F|E) = P (F). It is straightforward to prove that E and F are independent if and only if P (E ∩ F) = P (E)P (F). The following example illustrates an extension of the notion of independence.
Example 1.7 Let E = {kh, ks, qh}, F = {kh, kc, qh}, G = {kh, ks, kc, kd}, where kh means the king of hearts, ks means the king of spades, etc. Then
P (E) =