Conditional Probability - Discrete Mathematics and Probability Theory - Lecture Notes, Study notes of Discrete Structures and Graph Theory

In these lecture notes, the key points according to me are:Conditional Probability, False Negatives, Simple Probability, Sample Points, Scaled Probabilities, Same Probability Space, Medical Testing Example, Four Disjoint Subsets, Independent Events, Mutual Independence

Typology: Study notes

2012/2013

Uploaded on 04/27/2013

ascharya
ascharya 🇮🇳

4.6

(21)

166 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Conditional Probability
A pharmaceutical company is marketing a new test for a certain medical condition. According to clinical
trials, the test has the following properties:
1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10%
(these are called “false negatives”).
2. When applied to a healthy person, the test comes up negative in 80% of cases, and positive in 20%
(these are called “false positives”).
Suppose that the incidence of the condition in the US population is 5%. When a random person is tested
and the test comes up positive, what is the probability that the person actually has the condition? (Note that
this is presumably not the same as the simple probability that a random person has the condition, which is
just 1
20.)
This is an example of a conditional probability: we are interested in the probability that a person has the
condition (event A)given that he/she tests positive (event B). Let’s write this as Pr[A|B].
How should we compute Pr[A|B]? Well, since event Bis guaranteed to happen, we need to look not at the
whole sample space
, but at the smaller sample space consisting only of the sample points in B. What
should the probabilities of these sample points be? If they all simply inherit their probabilities from
, then
the sum of these probabilities will be
ω
BPr[
ω
] = Pr[B], which in general is less than 1. So we need to
scale the probability of each sample point by 1
Pr[B]. I.e., for each sample point
ω
B, the new probability
becomes
Pr[
ω
|B] = Pr[
ω
]
Pr[B].
Now it is clear how to compute Pr[A|B]: namely, we just sum up these scaled probabilities over all sample
points that lie in both Aand B:
Pr[A|B] =
ω
AB
Pr[
ω
|B] =
ω
AB
Pr[
ω
]
Pr[B]=Pr[AB]
Pr[B].
Definition 16.1 (conditional probability): For events A,Bin the same probability space, such that Pr[B]>0,
the conditional probability of Agiven Bis
Pr[A|B] = Pr[AB]
Pr[B].
Let’s go back to our medical testing example. The sample space here consists of all people in the US
denote their number by N(so N250 million). The population consists of four disjoint subsets:
CS 70, Fall 2004, Lecture 16 1
Docsity.com
pf3
pf4
pf5
pf8

Partial preview of the text

Download Conditional Probability - Discrete Mathematics and Probability Theory - Lecture Notes and more Study notes Discrete Structures and Graph Theory in PDF only on Docsity!

Conditional Probability

A pharmaceutical company is marketing a new test for a certain medical condition. According to clinical trials, the test has the following properties:

  1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called “false negatives”).
  2. When applied to a healthy person, the test comes up negative in 80% of cases, and positive in 20% (these are called “false positives”).

Suppose that the incidence of the condition in the US population is 5%. When a random person is tested and the test comes up positive, what is the probability that the person actually has the condition? (Note that this is presumably not the same as the simple probability that a random person has the condition, which is just 201 .)

This is an example of a conditional probability: we are interested in the probability that a person has the condition (event A ) given that he/she tests positive (event B ). Let’s write this as Pr[ A | B ].

How should we compute Pr[ A | B ]? Well, since event B is guaranteed to happen, we need to look not at the whole sample space Ω , but at the smaller sample space consisting only of the sample points in B. What should the probabilities of these sample points be? If they all simply inherit their probabilities from Ω , then the sum of these probabilities will be (^) ∑ ω∈ B Pr[ ω] = Pr[ B ], which in general is less than 1. So we need to scale the probability of each sample point by (^) Pr^1 [ B ]. I.e., for each sample point ω ∈ B , the new probability becomes

Pr[ ω| B ] =

Pr[ ω] Pr[ B ]

Now it is clear how to compute Pr[ A | B ]: namely, we just sum up these scaled probabilities over all sample points that lie in both A and B :

Pr[ A | B ] = ∑

ω∈ AB

Pr[ ω| B ] = ∑

ω∈ AB

Pr[ ω] Pr[ B ]

Pr[ AB ] Pr[ B ]

Definition 16.1 (conditional probability) : For events A , B in the same probability space, such that Pr[ B ] > 0, the conditional probability of A given B is

Pr[ A | B ] = Pr[ AB ] Pr[ B ]

Let’s go back to our medical testing example. The sample space here consists of all people in the US — denote their number by N (so N ≈ 250 million). The population consists of four disjoint subsets:

CS 70, Fall 2004, Lecture 16 1

T P : the true positives (90% of 20 N = 2009 N of them);

FP : the false positives (20% of 1920 N = 19100 N of them);

T N : the true negatives (80% of 1920 N = 76100 N of them);

FN : the false negatives (10% of 20 N = 200 N of them).

Now let A be the event that a person chosen at random is affected, and B the event that he/she tests positive. Note that B is the union of the disjoint sets T P and FP , so

| B | = | T P | + | FP | = 2009 N + 19100 N = 47200 N.

Thus we have Pr[ A ] = 201 and Pr[ B ] = 20047.

Now when we condition on the event B , we focus in on the smaller sample space consisting only of those 47 N 200 individuals who test positive. To compute Pr[ A | B ], we need to figure out Pr[ A^ ∩^ B ]^ (the part of^ A^ that lies in B ). But AB is just the set of people who are both affected and test positive, i.e., AB = T P. So we have

Pr[ AB ] =

| T P |

N

Finally, we conclude from Definition 17.1 that

Pr[ A | B ] =

Pr[ AB ] Pr[ B ]

This seems bad: if a person tests positive, there’s only about a 19% chance that he/she actually has the condition! This sounds worse than the original claims made by the pharmaceutical company, but in fact it’s just another view of the same data.

[Incidentally, note that Pr[ B | A ] = 91 //^20020 = 109 ; so Pr[ A | B ] and Pr[ B | A ] can be very different. Of course, Pr[ B | A ] is just the probability that a person tests positive given that he/she has the condition, which we knew from the start was 90%.]

To complete the picture, what’s the (unconditional) probability that the test gives a correct result (positive or negative) when applied to a random person? Call this event C. Then

Pr[ C ] = | T P |+ N | T N |= 2009 + 10076 = 161200 ≈ 0. 8.

So the test is about 80% effective overall, a more impressive statistic.

But how impressive is it? Suppose we ignore the test and just pronounce everybody to be healthy. Then we would be correct on 95% of the population (the healthy ones), and wrong on the affected 5%. I.e., this trivial test is 95% effective! So we might ask if it is worth running the test at all. What do you think?

Here are a couple more examples of conditional probabilities, based on some of our sample spaces from the previous lecture.

  1. Balls and bins. Suppose we toss m = 3 balls into n = 3 bins; this is a uniform sample space with 33 = 27 points. We already know that the probability the first bin is empty is ( 1 − 13 )^3 = ( 23 )^3 = 278. What is the probability of this event given that the second bin is empty? Call these events A , B

CS 70, Fall 2004, Lecture 16 2

Theorem 16.2 : If events A 1 ,... , An are mutually independent, then

Pr[ A 1 ∩... ∩ An ] = Pr[ A 1 ] × Pr[ A 2 ] × · · · × Pr[ An ].

We won’t prove this theorem here because it is a special case of the more general Theorem 17.3, which we will prove below (check this!). Note that it is possible to construct three events A , B , C such that each pair is independent but the triple A , B , C is not mutually independent.

Combinations of events

In most applications of probability in Computer Science, we are interested in things like Pr[

n i = 1 Ai ]^ and Pr[ ⋂ ni =^1 Ai ], where the^ Ai^ are simple events (i.e., we know, or can easily compute, the Pr[ Ai ]). The intersection i Ai corresponds to the logical^ AND^ of the events^ Ai , while the union^

i Ai corresponds to their logical^ OR. As an example, if Ai denotes the event that a failure of type i happens in a certain system, then ⋃ i Ai is the event that the system fails.

In general, computing the probabilities of such combinations can be very difficult. In this section, we discuss some situations where it can be done.

Intersections of events

From the definition of conditional probability, we immediately have the following product rule (sometimes also called the chain rule) for computing the probability of an intersection of events.

Theorem 16.3 : [Product Rule] For events A , B, we have

Pr[ AB ] = Pr[ A ] Pr[ B | A ].

More generally, for events A 1 ,... , An,

Pr[

n i = 1 Ai ] =^ Pr[ A 1 ]^ ×^ Pr[ A 2 | A 1 ]^ ×^ Pr[ A 3 | A 1 ∩^ A 2 ]^ × · · · ×^ Pr[ An |^

n − 1 i = 1 Ai ].

Proof : The first assertion follows directly from the definition of Pr[ B | A ] (and is in fact a special case of the second assertion with n = 2).

To prove the second assertion, we will use simple induction on n (the number of events). The base case is n = 1, and corresponds to the statement that Pr[ A ] = Pr[ A ], which is trivially true. For the inductive step, let n > 1 and assume (the inductive hypothesis) that

Pr[

n − 1 i = 1 Ai ] =^ Pr[ A 1 ]^ ×^ Pr[ A 2 | A 1 ]^ × · · · ×^ Pr[ An − 1 |

n − 2 i = 1 Ai ].

Now we can apply the definition of conditional probability to the two events An and ⋂ n − 1 i = 1 Ai to deduce that

Pr[ ⋂ n i = 1 Ai ] =^ Pr[ An^ ∩^ (

n − 1 i = 1 Ai )] =^ Pr[ An |

n − 1 i = 1 Ai ]^ ×^ Pr[

n − 1 i = 1 Ai ] = Pr[ An | ⋂ n − 1 i = 1 Ai ]^ ×^ Pr[ A 1 ]^ ×^ Pr[ A 2 | A 1 ]^ × · · · ×^ Pr[ An − 1 |

n − 2 i = 1 Ai ],

where in the last line we have used the inductive hypothesis. This completes the proof by induction. 2

Note that Theorems 17.1 and 17.2 are special cases of the product rule for independent events.

CS 70, Fall 2004, Lecture 16 4

Sequences of trials

Many experiments can be viewed as sequences of simpler experiments, or trials. In these cases, it is more natural to define the probability space in terms of conditional probabilities , using the product rule. As an illustration, consider the sample space Ω of n tosses of a biased coin discussed in the previous lecture. We can write Ω as the product Ω = Ω 1 × Ω 2 ×... × Ω n , where Ω i = { H , T } is the sample space of the i th coin toss.^1

How should we define probabilities in Ω? Each sample point in Ω is an n -tuple ω = ( ω 1 , ω 2 ,... , ω n ), where ω i ∈ Ω i is the outcome of the i th toss. Using the product rule, we must have

Pr[ ω] = Pr[ ω 1 ] × Pr[ ω 2 | ω 1 ] × · · · × Pr[ ω n | ω 1 , ω 2 ,... , ω n − 1 ]. (1)

So if we define all the conditional probabilities Pr[ ω i | ω 1 , ω 2 ,... , ω i − 1 ], we will in fact have defined the entire probability space!

Now the key fact in this example is that the coin tosses are supposed to be independent , so each conditional probability is just the same as the corresponding unconditional probability for a coin. Thus equation (1) becomes Pr[ ω] = Pr[ ω 1 ] × Pr[ ω 2 ] × · · · × Pr[ ω n ].

Thus we see that the probability of any sample point ω is Pr[ ω] = pr ( 1 − p ) nr^ , where r is the number of Heads in ω. Of course, this is exactly how we defined this sample space in the previous lecture. However, the point is that now we have a rational basis for our definition: it is an inevitable consequence of the fact that the coin tosses are independent.

Here are some more examples.

  1. Coin tosses. Toss a fair coin three times. Let A be the event that all three tosses are heads. Then A = A 1 ∩ A 2 ∩ A 3 , where Ai is the event that the i th toss comes up heads. We have

Pr[ A ] = Pr[ A 1 ] × Pr[ A 2 | A 1 ] × Pr[ A 3 | A 1 ∩ A 2 ] = Pr[ A 1 ] × Pr[ A 2 ] × Pr[ A 3 ] = 12 × 12 × 12 = 18.

The second line here follows from the fact that the tosses are mutually independent. Of course, we already know that Pr[ A ] = 18 from our definition of the probability space in the previous lecture. The above is really a check that the space behaves as we expect.^2 If the coin is biased with heads probability p , we get, again using independence,

Pr[ A ] = Pr[ A 1 ] Pr[ A 2 ] Pr[ A 3 ] = p^3.

And more generally, the probability of any sequence of n tosses containing r heads and nr tails is pr ( 1 − p ) nr^. This is in fact the reason we defined the probability space this way in the previous lecture: we defined the sample point probabilities so that the coin tosses would behave independently.

  1. Balls and bins. The sample space here is the product Ω = Ω 1 × Ω 2 ×· · ·× Ω m , where Ω i = { 1 , 2 ,... , n } is the set of n bins that could be chosen by the i th ball. (Recall that there are m balls and n bins.) Since (^1) Recall that, for sets A , B , the Cartesian product A × B is the set consisting of all (ordered) pairs ( a , b ) with aA and bB. (^2) Strictly speaking, we should really also have checked from our original definition of the probability space that Pr[ A 1 ], Pr[ A 2 | A 1 ] and Pr[ A 3 | A 1 ∩ A 2 ] are all equal to 12.

CS 70, Fall 2004, Lecture 16 5

Let’s use this view to compute the probability of a flush in a different way. Clearly this is 4 × Pr[ A ], where A is the probability of a Hearts flush. And we can write A = ⋂ 5 i = 1 Ai , where^ Ai is the event that the i th card we pick is a Heart. So we have

Pr[ A ] = Pr[ A 1 ] × Pr[ A 2 | A 1 ] × · · · × Pr[ A 5 | ⋂ 4 i = 1 Ai ].

Clearly Pr[ A 1 ] = 1352 = 14. What about Pr[ A 2 | A 1 ]? Well, since we are conditioning on A 1 (the first card is a Heart), there are only 51 remaining possibilities for the second card, 12 of which are Hearts. So Pr[ A 2 | A 1 ] = 1251. Similarly, Pr[ A 3 | A 1 ∩ A 2 ] = 1150 , and so on. So we get

4 × Pr[ A ] = 4 ×

×

×

×

×

which is exactly the same fraction we computed in the previous lecture. So now we have two methods of computing probabilities in many of our sample spaces. It is useful to keep these different methods around, both as a check on your answers and because in some cases one of the methods is easier to use than the other.

  1. Monty Hall. Recall that we defined the probability of a sample point by multiplying the probabilities of the sequence of choices it corresponds to; thus, e.g.,

Pr[( 1 , 1 , 2 )] = 13 × 13 × 12 = 181.

The reason we defined it this way is that we knew (from our model of the problem) the probabilities for each choice conditional on the previous one. Thus, e.g., the 12 in the above product is the proba- bility that Monty opens door 2 conditional on the prize door being door 1 and the contestant initially choosing door 1. Once again, we used these conditional probabilities to define the probabilities of our sample points.

Unions of events

You are in Las Vegas, and you spy a new game with the following rules. You pick a number between 1 and 6. Then three dice are thrown. You win if and only if your number comes up on at least one of the dice.

The casino claims that your odds of winning are 50%, using the following argument. Let A be the event that you win. We can write A = A 1 ∪ A 2 ∪ A 3 , where Ai is the event that your number comes up on die i. Clearly Pr[ Ai ] = 16 for each i. Therefore,

Pr[ A ] = Pr[ A 1 ∪ A 2 ∪ A 3 ] = Pr[ A 1 ] + Pr[ A 2 ] + Pr[ A 3 ] = 3 ×

Is this calculation correct? Well, suppose instead that the casino rolled six dice, and again you win iff your number comes up at least once. Then the analogous calculation would say that you win with probability 6 × 16 = 1, i.e., certainly! The situation becomes even more ridiculous when the number of dice gets bigger than 6.

The problem is that the events Ai are not disjoint : i.e., there are some sample points that lie in more than one of the Ai. (We could get really lucky and our number could come up on two of the dice, or all three.) So if we add up the Pr[ Ai ] we are counting some sample points more than once.

Fortunately, there is a formula for this, known as the Principle of Inclusion/Exclusion:

CS 70, Fall 2004, Lecture 16 7

Theorem 16.4 : [Inclusion/Exclusion] For events A 1 ,... , An in some probability space, we have

Pr[

n i = 1 Ai ] =

ni = 1

Pr[ Ai ] − (^) ∑ { i , j }

Pr[ AiA (^) j ] + (^) ∑ { i , j , k }

Pr[ AiA (^) jAk ] − · · · ± Pr[

n i = 1 Ai ].

[In the above summations, { i , j } denotes all unordered pairs with i 6 = j, { i , j , k } denotes all unordered triples of distinct elements, and so on.]

I.e., to compute Pr[

i Ai ], we start by summing the event probabilities Pr[ Ai ], then we^ subtract^ the probabil- ities of all pairwise intersections, then we add back in the probabilities of all three-way intersections, and so on.

We won’t prove this formula here; but you might like to verify it for the special case n = 3 by drawing a Venn diagram and checking that every sample point in A 1 ∪ A 2 ∪ A 3 is counted exactly once by the formula. You might also like to prove the formula for general n by induction (in similar fashion to the proof of Theorem 17.3).

Taking the formula on faith, what is the probability we get lucky in the new game in Vegas?

Pr[ A 1 ∪ A 2 ∪ A 3 ] = Pr[ A 1 ] + Pr[ A 2 ] + Pr[ A 3 ] − Pr[ A 1 ∩ A 2 ] − Pr[ A 1 ∩ A 3 ] − Pr[ A 2 ∩ A 3 ] + Pr[ A 1 ∩ A 2 ∩ A 3 ].

Now the nice thing here is that the events Ai are mutually independent (the outcome of any die does not depend on that of the others), so Pr[ AiA (^) j ] = Pr[ Ai ] Pr[ A (^) j ] = ( 16 )^2 = 361 , and similarly Pr[ A 1 ∩ A 2 ∩ A 3 ] = ( 16 )^3 = 2161. So we get

Pr[ A 1 ∪ A 2 ∪ A 3 ] =

3 × 16

3 × 361

So your odds are quite a bit worse than the casino is claiming!

When n is large (i.e., we are interested in the union of many events), the Inclusion/Exclusion formula is essentially useless because it involves computing the probability of the intersection of every non-empty subset of the events: and there are 2 n^ − 1 of these! Sometimes we can just look at the first few terms of it and forget the rest: note that successive terms actually give us an overestimate and then an underestimate of the answer, and these estimates both get better as we go along.

However, in many situations we can get a long way by just looking at the first term:

  1. Disjoint events. If the events Ai are all disjoint (i.e., no pair of them contain a common sample point — such events are also called mutually exclusive ), then

Pr[

n i = 1 Ai ] =

ni = 1

Pr[ Ai ].

[Note that we have already used this fact several times in our examples, e.g., in claiming that the probability of a flush is four times the probability of a Hearts flush — clearly flushes in different suits are disjoint events.]

  1. Union bound. Always, it is the case that

Pr[

n i = 1 Ai ]^ ≤

ni = 1

Pr[ Ai ].

This merely says that adding up the Pr[ Ai ] can only over estimate the probability of the union. Crude as it may seem, in the next lecture we’ll see how to use the union bound effectively in a Computer Science example.

CS 70, Fall 2004, Lecture 16 8