




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
In these lecture notes, the key points according to me are:Conditional Probability, False Negatives, Simple Probability, Sample Points, Scaled Probabilities, Same Probability Space, Medical Testing Example, Four Disjoint Subsets, Independent Events, Mutual Independence
Typology: Study notes
1 / 8
This page cannot be seen from the preview
Don't miss anything!





A pharmaceutical company is marketing a new test for a certain medical condition. According to clinical trials, the test has the following properties:
Suppose that the incidence of the condition in the US population is 5%. When a random person is tested and the test comes up positive, what is the probability that the person actually has the condition? (Note that this is presumably not the same as the simple probability that a random person has the condition, which is just 201 .)
This is an example of a conditional probability: we are interested in the probability that a person has the condition (event A ) given that he/she tests positive (event B ). Let’s write this as Pr[ A | B ].
How should we compute Pr[ A | B ]? Well, since event B is guaranteed to happen, we need to look not at the whole sample space Ω , but at the smaller sample space consisting only of the sample points in B. What should the probabilities of these sample points be? If they all simply inherit their probabilities from Ω , then the sum of these probabilities will be (^) ∑ ω∈ B Pr[ ω] = Pr[ B ], which in general is less than 1. So we need to scale the probability of each sample point by (^) Pr^1 [ B ]. I.e., for each sample point ω ∈ B , the new probability becomes
Pr[ ω| B ] =
Pr[ ω] Pr[ B ]
Now it is clear how to compute Pr[ A | B ]: namely, we just sum up these scaled probabilities over all sample points that lie in both A and B :
ω∈ A ∩ B
ω∈ A ∩ B
Pr[ ω] Pr[ B ]
Pr[ A ∩ B ] Pr[ B ]
Definition 16.1 (conditional probability) : For events A , B in the same probability space, such that Pr[ B ] > 0, the conditional probability of A given B is
Pr[ A | B ] = Pr[ A ∩ B ] Pr[ B ]
Let’s go back to our medical testing example. The sample space here consists of all people in the US — denote their number by N (so N ≈ 250 million). The population consists of four disjoint subsets:
CS 70, Fall 2004, Lecture 16 1
T P : the true positives (90% of 20 N = 2009 N of them);
FP : the false positives (20% of 1920 N = 19100 N of them);
T N : the true negatives (80% of 1920 N = 76100 N of them);
FN : the false negatives (10% of 20 N = 200 N of them).
Now let A be the event that a person chosen at random is affected, and B the event that he/she tests positive. Note that B is the union of the disjoint sets T P and FP , so
| B | = | T P | + | FP | = 2009 N + 19100 N = 47200 N.
Thus we have Pr[ A ] = 201 and Pr[ B ] = 20047.
Now when we condition on the event B , we focus in on the smaller sample space consisting only of those 47 N 200 individuals who test positive. To compute Pr[ A | B ], we need to figure out Pr[ A^ ∩^ B ]^ (the part of^ A^ that lies in B ). But A ∩ B is just the set of people who are both affected and test positive, i.e., A ∩ B = T P. So we have
Pr[ A ∩ B ] =
Finally, we conclude from Definition 17.1 that
Pr[ A | B ] =
Pr[ A ∩ B ] Pr[ B ]
This seems bad: if a person tests positive, there’s only about a 19% chance that he/she actually has the condition! This sounds worse than the original claims made by the pharmaceutical company, but in fact it’s just another view of the same data.
[Incidentally, note that Pr[ B | A ] = 91 //^20020 = 109 ; so Pr[ A | B ] and Pr[ B | A ] can be very different. Of course, Pr[ B | A ] is just the probability that a person tests positive given that he/she has the condition, which we knew from the start was 90%.]
To complete the picture, what’s the (unconditional) probability that the test gives a correct result (positive or negative) when applied to a random person? Call this event C. Then
Pr[ C ] = | T P |+ N | T N |= 2009 + 10076 = 161200 ≈ 0. 8.
So the test is about 80% effective overall, a more impressive statistic.
But how impressive is it? Suppose we ignore the test and just pronounce everybody to be healthy. Then we would be correct on 95% of the population (the healthy ones), and wrong on the affected 5%. I.e., this trivial test is 95% effective! So we might ask if it is worth running the test at all. What do you think?
Here are a couple more examples of conditional probabilities, based on some of our sample spaces from the previous lecture.
CS 70, Fall 2004, Lecture 16 2
Theorem 16.2 : If events A 1 ,... , An are mutually independent, then
Pr[ A 1 ∩... ∩ An ] = Pr[ A 1 ] × Pr[ A 2 ] × · · · × Pr[ An ].
We won’t prove this theorem here because it is a special case of the more general Theorem 17.3, which we will prove below (check this!). Note that it is possible to construct three events A , B , C such that each pair is independent but the triple A , B , C is not mutually independent.
In most applications of probability in Computer Science, we are interested in things like Pr[
⋃ n i = 1 Ai ]^ and Pr[ ⋂ n ⋂ i =^1 Ai ], where the^ Ai^ are simple events (i.e., we know, or can easily compute, the Pr[ Ai ]). The intersection i Ai corresponds to the logical^ AND^ of the events^ Ai , while the union^
⋃ i Ai corresponds to their logical^ OR. As an example, if Ai denotes the event that a failure of type i happens in a certain system, then ⋃ i Ai is the event that the system fails.
In general, computing the probabilities of such combinations can be very difficult. In this section, we discuss some situations where it can be done.
From the definition of conditional probability, we immediately have the following product rule (sometimes also called the chain rule) for computing the probability of an intersection of events.
Theorem 16.3 : [Product Rule] For events A , B, we have
Pr[ A ∩ B ] = Pr[ A ] Pr[ B | A ].
More generally, for events A 1 ,... , An,
Pr[
⋂ n i = 1 Ai ] =^ Pr[ A 1 ]^ ×^ Pr[ A 2 | A 1 ]^ ×^ Pr[ A 3 | A 1 ∩^ A 2 ]^ × · · · ×^ Pr[ An |^
⋂ n − 1 i = 1 Ai ].
Proof : The first assertion follows directly from the definition of Pr[ B | A ] (and is in fact a special case of the second assertion with n = 2).
To prove the second assertion, we will use simple induction on n (the number of events). The base case is n = 1, and corresponds to the statement that Pr[ A ] = Pr[ A ], which is trivially true. For the inductive step, let n > 1 and assume (the inductive hypothesis) that
Pr[
⋂ n − 1 i = 1 Ai ] =^ Pr[ A 1 ]^ ×^ Pr[ A 2 | A 1 ]^ × · · · ×^ Pr[ An − 1 |
⋂ n − 2 i = 1 Ai ].
Now we can apply the definition of conditional probability to the two events An and ⋂ n − 1 i = 1 Ai to deduce that
Pr[ ⋂ n i = 1 Ai ] =^ Pr[ An^ ∩^ (
⋂ n − 1 i = 1 Ai )] =^ Pr[ An |
⋂ n − 1 i = 1 Ai ]^ ×^ Pr[
⋂ n − 1 i = 1 Ai ] = Pr[ An | ⋂ n − 1 i = 1 Ai ]^ ×^ Pr[ A 1 ]^ ×^ Pr[ A 2 | A 1 ]^ × · · · ×^ Pr[ An − 1 |
⋂ n − 2 i = 1 Ai ],
where in the last line we have used the inductive hypothesis. This completes the proof by induction. 2
Note that Theorems 17.1 and 17.2 are special cases of the product rule for independent events.
CS 70, Fall 2004, Lecture 16 4
Many experiments can be viewed as sequences of simpler experiments, or trials. In these cases, it is more natural to define the probability space in terms of conditional probabilities , using the product rule. As an illustration, consider the sample space Ω of n tosses of a biased coin discussed in the previous lecture. We can write Ω as the product Ω = Ω 1 × Ω 2 ×... × Ω n , where Ω i = { H , T } is the sample space of the i th coin toss.^1
How should we define probabilities in Ω? Each sample point in Ω is an n -tuple ω = ( ω 1 , ω 2 ,... , ω n ), where ω i ∈ Ω i is the outcome of the i th toss. Using the product rule, we must have
Pr[ ω] = Pr[ ω 1 ] × Pr[ ω 2 | ω 1 ] × · · · × Pr[ ω n | ω 1 , ω 2 ,... , ω n − 1 ]. (1)
So if we define all the conditional probabilities Pr[ ω i | ω 1 , ω 2 ,... , ω i − 1 ], we will in fact have defined the entire probability space!
Now the key fact in this example is that the coin tosses are supposed to be independent , so each conditional probability is just the same as the corresponding unconditional probability for a coin. Thus equation (1) becomes Pr[ ω] = Pr[ ω 1 ] × Pr[ ω 2 ] × · · · × Pr[ ω n ].
Thus we see that the probability of any sample point ω is Pr[ ω] = pr ( 1 − p ) n − r^ , where r is the number of Heads in ω. Of course, this is exactly how we defined this sample space in the previous lecture. However, the point is that now we have a rational basis for our definition: it is an inevitable consequence of the fact that the coin tosses are independent.
Here are some more examples.
Pr[ A ] = Pr[ A 1 ] × Pr[ A 2 | A 1 ] × Pr[ A 3 | A 1 ∩ A 2 ] = Pr[ A 1 ] × Pr[ A 2 ] × Pr[ A 3 ] = 12 × 12 × 12 = 18.
The second line here follows from the fact that the tosses are mutually independent. Of course, we already know that Pr[ A ] = 18 from our definition of the probability space in the previous lecture. The above is really a check that the space behaves as we expect.^2 If the coin is biased with heads probability p , we get, again using independence,
Pr[ A ] = Pr[ A 1 ] Pr[ A 2 ] Pr[ A 3 ] = p^3.
And more generally, the probability of any sequence of n tosses containing r heads and n − r tails is pr ( 1 − p ) n − r^. This is in fact the reason we defined the probability space this way in the previous lecture: we defined the sample point probabilities so that the coin tosses would behave independently.
CS 70, Fall 2004, Lecture 16 5
Let’s use this view to compute the probability of a flush in a different way. Clearly this is 4 × Pr[ A ], where A is the probability of a Hearts flush. And we can write A = ⋂ 5 i = 1 Ai , where^ Ai is the event that the i th card we pick is a Heart. So we have
Pr[ A ] = Pr[ A 1 ] × Pr[ A 2 | A 1 ] × · · · × Pr[ A 5 | ⋂ 4 i = 1 Ai ].
Clearly Pr[ A 1 ] = 1352 = 14. What about Pr[ A 2 | A 1 ]? Well, since we are conditioning on A 1 (the first card is a Heart), there are only 51 remaining possibilities for the second card, 12 of which are Hearts. So Pr[ A 2 | A 1 ] = 1251. Similarly, Pr[ A 3 | A 1 ∩ A 2 ] = 1150 , and so on. So we get
4 × Pr[ A ] = 4 ×
which is exactly the same fraction we computed in the previous lecture. So now we have two methods of computing probabilities in many of our sample spaces. It is useful to keep these different methods around, both as a check on your answers and because in some cases one of the methods is easier to use than the other.
Pr[( 1 , 1 , 2 )] = 13 × 13 × 12 = 181.
The reason we defined it this way is that we knew (from our model of the problem) the probabilities for each choice conditional on the previous one. Thus, e.g., the 12 in the above product is the proba- bility that Monty opens door 2 conditional on the prize door being door 1 and the contestant initially choosing door 1. Once again, we used these conditional probabilities to define the probabilities of our sample points.
You are in Las Vegas, and you spy a new game with the following rules. You pick a number between 1 and 6. Then three dice are thrown. You win if and only if your number comes up on at least one of the dice.
The casino claims that your odds of winning are 50%, using the following argument. Let A be the event that you win. We can write A = A 1 ∪ A 2 ∪ A 3 , where Ai is the event that your number comes up on die i. Clearly Pr[ Ai ] = 16 for each i. Therefore,
Pr[ A ] = Pr[ A 1 ∪ A 2 ∪ A 3 ] = Pr[ A 1 ] + Pr[ A 2 ] + Pr[ A 3 ] = 3 ×
Is this calculation correct? Well, suppose instead that the casino rolled six dice, and again you win iff your number comes up at least once. Then the analogous calculation would say that you win with probability 6 × 16 = 1, i.e., certainly! The situation becomes even more ridiculous when the number of dice gets bigger than 6.
The problem is that the events Ai are not disjoint : i.e., there are some sample points that lie in more than one of the Ai. (We could get really lucky and our number could come up on two of the dice, or all three.) So if we add up the Pr[ Ai ] we are counting some sample points more than once.
Fortunately, there is a formula for this, known as the Principle of Inclusion/Exclusion:
CS 70, Fall 2004, Lecture 16 7
Theorem 16.4 : [Inclusion/Exclusion] For events A 1 ,... , An in some probability space, we have
Pr[
⋃ n i = 1 Ai ] =
n ∑ i = 1
Pr[ Ai ] − (^) ∑ { i , j }
Pr[ Ai ∩ A (^) j ] + (^) ∑ { i , j , k }
Pr[ Ai ∩ A (^) j ∩ Ak ] − · · · ± Pr[
⋂ n i = 1 Ai ].
[In the above summations, { i , j } denotes all unordered pairs with i 6 = j, { i , j , k } denotes all unordered triples of distinct elements, and so on.]
I.e., to compute Pr[
⋃ i Ai ], we start by summing the event probabilities Pr[ Ai ], then we^ subtract^ the probabil- ities of all pairwise intersections, then we add back in the probabilities of all three-way intersections, and so on.
We won’t prove this formula here; but you might like to verify it for the special case n = 3 by drawing a Venn diagram and checking that every sample point in A 1 ∪ A 2 ∪ A 3 is counted exactly once by the formula. You might also like to prove the formula for general n by induction (in similar fashion to the proof of Theorem 17.3).
Taking the formula on faith, what is the probability we get lucky in the new game in Vegas?
Pr[ A 1 ∪ A 2 ∪ A 3 ] = Pr[ A 1 ] + Pr[ A 2 ] + Pr[ A 3 ] − Pr[ A 1 ∩ A 2 ] − Pr[ A 1 ∩ A 3 ] − Pr[ A 2 ∩ A 3 ] + Pr[ A 1 ∩ A 2 ∩ A 3 ].
Now the nice thing here is that the events Ai are mutually independent (the outcome of any die does not depend on that of the others), so Pr[ Ai ∩ A (^) j ] = Pr[ Ai ] Pr[ A (^) j ] = ( 16 )^2 = 361 , and similarly Pr[ A 1 ∩ A 2 ∩ A 3 ] = ( 16 )^3 = 2161. So we get
Pr[ A 1 ∪ A 2 ∪ A 3 ] =
So your odds are quite a bit worse than the casino is claiming!
When n is large (i.e., we are interested in the union of many events), the Inclusion/Exclusion formula is essentially useless because it involves computing the probability of the intersection of every non-empty subset of the events: and there are 2 n^ − 1 of these! Sometimes we can just look at the first few terms of it and forget the rest: note that successive terms actually give us an overestimate and then an underestimate of the answer, and these estimates both get better as we go along.
However, in many situations we can get a long way by just looking at the first term:
Pr[
⋃ n i = 1 Ai ] =
n ∑ i = 1
Pr[ Ai ].
[Note that we have already used this fact several times in our examples, e.g., in claiming that the probability of a flush is four times the probability of a Hearts flush — clearly flushes in different suits are disjoint events.]
Pr[
⋃ n i = 1 Ai ]^ ≤
n ∑ i = 1
Pr[ Ai ].
This merely says that adding up the Pr[ Ai ] can only over estimate the probability of the union. Crude as it may seem, in the next lecture we’ll see how to use the union bound effectively in a Computer Science example.
CS 70, Fall 2004, Lecture 16 8