




























































































Estude fácil! Tem muito documento disponível na Docsity
Ganhe pontos ajudando outros esrudantes ou compre um plano Premium
Prepare-se para as provas
Estude fácil! Tem muito documento disponível na Docsity
Prepare-se para as provas com trabalhos de outros alunos como você, aqui na Docsity
Encontra documentos específicos para os exames da tua universidade
Prepare-se com as videoaulas e exercícios resolvidos criados a partir da grade da sua Universidade
Responda perguntas de provas passadas e avalie sua preparação.
Ganhe pontos para baixar
Ganhe pontos ajudando outros esrudantes ou compre um plano Premium
An Introduction to Bayesian Thinking.pdf
Tipologia: Resumos
1 / 197
Esta página não é visível na pré-visualização
Não perca as partes importantes!





























































































This book was written as a companion for the Course Bayesian Statistics from the Statistics with R specialization available on Coursera. Our goal in develop- ing the course was to provide an introduction to Bayesian inference in decision making without requiring calculus, with the book providing more details and background on Bayesian Inference. In writing this, we hope that it may be used on its own as an open-access introduction to Bayesian inference using R for anyone interested in learning about Bayesian statistics. Materials and examples from the course are discussed more extensively and extra examples and exer- cises are provided. While learners are not expected to have any background in calculus or linear algebra, for those who do have this background and are inter- ested in diving deeper, we have included optional sub-sections in each Chapter to provide additional mathematical details and some derivations of key results.
This book is written using the R package bookdown; any interested learners are welcome to download the source code from github to see the code that was used to create all of the examples and figures within the book. Learners should have a current version of R (3.5.0 at the time of this version of the book) and will need to install Rstudio in order to use any of the shiny apps.
Those that are interested in running all of the code in the book or building the book locally, should download all of the following packages from CRAN:
# R packages used to create the book
library(statsr) library(BAS) library(ggplot2) library(dplyr) library(BayesFactor) library(knitr) library(rjags) library(coda) library(latex2exp) library(foreign) library(BHH2)
Bayesian statistics mostly involves conditional probability , which is the the probability of an event A given event B, and it can be calculated using the Bayes rule. The concept of conditional probability is widely used in medical testing, in which false positives and false negatives may occur. A false positive can be defined as a positive outcome on a medical test when the patient does not actually have the disease they are being tested for. In other words, it’s the probability of testing positive given no disease. Similarly, a false negative can be defined as a negative outcome on a medical test when the patient does have the disease. In other words, testing negative given disease. Both indicators are critical for any medical decisions.
For how the Bayes’ rule is applied, we can set up a prior, then calculate pos- terior probabilities based on a prior and likelihood. That is to say, the prior probabilities are updated through an iterative process of data collection.
This section introduces how the Bayes’ rule is applied to calculating conditional probability, and several real-life examples are demonstrated. Finally, we com- pare the Bayesian and frequentist definition of probability.
Consider Table 1.1. It shows the results of a poll among 1,738 adult Americans. This table allows us to calculate probabilities.
For instance, the probability of an adult American using an online dating site
Table 1.1: Results from a 2015 Gallup poll on the use of online dating sites by age group
18-29 30-49 50-64 65+ Total Used online dating site 60 86 58 21 225 Did not use online dating site 255 426 450 382 1513 Total 315 512 508 403 1738
can be calculated as
𝑃 (using an online dating site) = Number that indicated they used an online dating site Total number of people in the poll
This is the overall probability of using an online dating site. Say, we are now interested in the probability of using an online dating site if one falls in the age group 30-49. Similar to the above, we have
𝑃 (using an online dating site ∣ in age group 30-49) = Number in age group 30-49 that indicated they used an online dating site Total number in age group 30-
Here, the pipe symbol ‘|’ means conditional on. This is a conditional prob- ability as one can consider it the probability of using an online dating site conditional on being in age group 30-49.
We can rewrite this conditional probability in terms of ‘regular’ probabilities by dividing both numerator and the denominator by the total number of people in the poll. That is,
𝑃 (using an online dating site ∣ in age group 30-49)
= Number in age group 30-49 that indicated they used an online dating site Total number in age group 30-
=
Number in age group 30-49 that indicated they used an online dating site Total number of people in the poll Total number in age group 30- Total number of people in the poll
= 𝑃 (using an online dating site & falling in age group 30-49) 𝑃 (Falling in age group 30-49)
It turns out this relationship holds true for any conditional probability and is known as Bayes’ rule:
Definition 1.1 (Bayes’ Rule). The conditional probability of the event 𝐴 con- ditional on the event 𝐵 is given by
the early 1980s) has HIV if ELISA tests positive. For this, we need the following information. ELISA’s true positive rate (one minus the false negative rate), also referred to as sensitivity, recall, or probability of detection, is estimated as
𝑃 (ELISA is positive ∣ Person tested has HIV) = 93% = 0.93.
Its true negative rate (one minus the false positive rate), also referred to as specificity, is estimated as
𝑃 (ELISA is negative ∣ Person tested has no HIV) = 99% = 0.99.
Also relevant to our question is the prevalence of HIV in the overall population, which is estimated to be 1.48 out of every 1000 American adults. We therefore assume
𝑃 (Person tested has HIV) =
Note that the above numbers are estimates. For our purposes, however, we will treat them as if they were exact.
Our goal is to compute the probability of HIV if ELISA is positive, that is 𝑃 (Person tested has HIV ∣ ELISA is positive). In none of the above numbers did we condition on the outcome of ELISA. Fortunately, Bayes’ rule allows is to use the above numbers to compute the probability we seek. Bayes’ rule states that
𝑃 (Person tested has HIV ∣ ELISA is positive) =
𝑃 (Person tested has HIV & ELISA is positive) 𝑃 (ELISA is positive)
This can be derived as follows. For someone to test positive and be HIV positive, that person first needs to be HIV positive and then sec- ondly test positive. The probability of the first thing happening is 𝑃 (HIV positive) = 0.00148. The probability of then testing positive is 𝑃 (ELISA is positive ∣ Person tested has HIV) = 0.93, the true positive rate. This yields for the numerator
𝑃 (Person tested has HIV & ELISA is positive) = 𝑃 (Person tested has HIV)𝑃 (ELISA is positive ∣ Person tested has HIV) = 0.00148 ⋅ 0.93 = 0.0013764. (1.3)
The first step in the above equation is implied by Bayes’ rule: By multiplying the left- and right-hand side of Bayes’ rule as presented in Section 1.1.1 by 𝑃 (𝐵), we obtain 𝑃 (𝐴 ∣ 𝐵)𝑃 (𝐵) = 𝑃 (𝐴 & 𝐵).
The denominator in (1.2) can be expanded as
𝑃 (ELISA is positive) = 𝑃 (Person tested has HIV & ELISA is positive) + 𝑃 (Person tested has no HIV & ELISA is positive) = 0.0013764 + 0.0099852 = 0.
where we used (1.3) and
𝑃 (Person tested has no HIV & ELISA is positive) = 𝑃 (Person tested has no HIV)𝑃 (ELISA is positive ∣ Person tested has no HIV) = (1 − 𝑃 (Person tested has HIV)) ⋅ (1 − 𝑃 (ELISA is negative ∣ Person tested has no HIV)) = (1 − 0.00148) ⋅ (1 − 0.99) = 0.0099852.
Putting this all together and inserting into (1.2) reveals
𝑃 (Person tested has HIV ∣ ELISA is positive) =
So even when the ELISA returns positive, the probability of having HIV is only 12%. An important reason why this number is so low is due to the prevalence of HIV. Before testing, one’s probability of HIV was 0.148%, so the positive test changes that probability dramatically, but it is still below 50%. That is, it is more likely that one is HIV negative rather than positive after one positive ELISA test.
Questions like the one we just answered (What is the probability of a disease if a test returns positive?) are crucial to make medical diagnoses. As we saw, just the true positive and true negative rates of a test do not tell the full story, but also a disease’s prevalence plays a role. Bayes’ rule is a tool to synthesize such numbers into a more useful probability of having a disease after a test result.
Example 1.2. What is the probability that someone who tests positive does not actually have HIV?
We found in (1.4) that someone who tests positive has a 0.12 probability of having HIV. That implies that the same person has a 1−0.12 = 0.88 probability of not having HIV, despite testing positive.
Example 1.3. If the individual is at a higher risk for having HIV than a randomly sampled person from the population considered, how, if at all, would you expect 𝑃 (Person tested has HIV ∣ ELISA is positive) to change?
𝑃 (Person tested has HIV ∣ Second ELISA is also positive)
=
𝑃 (Person tested has HIV)𝑃 (Second ELISA is positive ∣ Person tested has HIV) 𝑃 (Second ELISA is also positive)
=
𝑃 (Person tested has HIV)𝑃 (Second ELISA is positive ∣ Has HIV)
=
Since we are considering the same ELISA test, we used the same true positive and true negative rates as in Section 1.1.2. We see that two positive tests makes it much more probable for someone to have HIV than when only one test comes up positive.
This process, of using Bayes’ rule to update a probability based on an event affecting it, is called Bayes’ updating. More generally, the what one tries to update can be considered ‘prior’ information, sometimes simply called the prior. The event providing information about this can also be data. Then, updating this prior using Bayes’ rule gives the information conditional on the data, also known as the posterior , as in the information after having seen the data. Going from the prior to the posterior is Bayes updating.
The probability of HIV after one positive ELISA, 0.12, was the posterior in the previous section as it was an update of the overall prevalence of HIV, (1.1). However, in this section we answered a question where we used this posterior information as the prior. This process of using a posterior as prior in a new problem is natural in the Bayesian framework of updating knowledge based on the data.
Example 1.5. What is the probability that one actually has HIV after test- ing positive 3 times on the ELISA? Again, assume that all three ELISAs are independent.
Analogous to what we did in this section, we can use Bayes’ updating for this. However, now the prior is the probability of HIV after two positive ELISAs, that is 𝑃 (Person tested has HIV) = 0.93. Analogous to (1.5), the answer follows as
𝑃 (Person tested has HIV ∣ Third ELISA is also positive)
=
𝑃 (Person tested has HIV)𝑃 (Third ELISA is positive ∣ Person tested has HIV) 𝑃 (Third ELISA is also positive)
=
𝑃 (Person tested has HIV)𝑃 (Third ELISA is positive ∣ Has HIV) +𝑃 (Person tested has no HIV)𝑃 (Third ELISA is positive ∣ Has no HIV)
=
The frequentist definition of probability is based on observation of a large num- ber of trials. The probability for an event 𝐸 to occur is 𝑃 (𝐸), and assume we get 𝑛𝐸 successes out of 𝑛 trials. Then we have
𝑃 (𝐸) = lim 𝑛→∞
On the other hand, the Bayesian definition of probability 𝑃 (𝐸) reflects our prior beliefs, so 𝑃 (𝐸) can be any probability distribution, provided that it is consistent with all of our beliefs. (For example, we cannot believe that the probability of a coin landing heads is 0.7 and that the probability of getting tails is 0.8, because they are inconsistent.)
The two definitions result in different methods of inference. Using the frequentist approach, we describe the confidence level as the proportion of random samples from the same population that produced confidence intervals which contain the true population parameter. For example, if we generated 100 random samples from the population, and 95 of the samples contain the true parameter, then the confidence level is 95%. Note that each sample either contains the true parameter or does not, so the confidence level is NOT the probability that a given interval includes the true population parameter.
Example 1.6. Based on a 2015 Pew Research poll on 1,500 adults: “We are 95% confident that 60% to 64% of Americans think the federal government does not do enough for middle class people.
The correct interpretation is: 95% of random samples of 1,500 adults will pro- duce confidence intervals that contain the true proportion of Americans who think the federal government does not do enough for middle class people.
Here are two common misconceptions:
size. If the treatment and control are equally effective, then the probability that a pregnancy comes from the treatment group (𝑝) should be 0.5. If RU-486 is more effective, then the probability that a pregnancy comes from the treatment group (𝑝) should be less than 0.5.
Therefore, we can form the hypotheses as below:
A p-value is needed to make an inference decision with the frequentist approach. The definition of p-value is the probability of observing something at least as extreme as the data, given that the null hypothesis (𝐻 0 ) is true. “More extreme” means in the direction of the alternative hypothesis (𝐻𝐴).
Since 𝐻 0 states that the probability of success (pregnancy) is 0.5, we can cal- culate the p-value from 20 independent Bernoulli trials where the probability of success is 0.5. The outcome of this experiment is 4 successes in 20 trials, so the goal is to obtain 4 or fewer successes in the 20 Bernoulli trials.
This probability can be calculated exactly from a binomial distribution with 𝑛 = 20 trials and success probability 𝑝 = 0.5. Assume 𝑘 is the actual number of successes observed, the p-value is
sum(dbinom(0:4, size = 20, p = 0.5))
According to R, the probability of getting 4 or fewer successes in 20 trials is 0.0059. Therefore, given that pregnancy is equally likely in the two groups, we get the chance of observing 4 or fewer preganancy in the treatment group is 0.0059. With such a small probability, we reject the null hypothesis and conclude that the data provide convincing evidence for the treatment being more effective than the control.
This section uses the same example, but this time we make the inference for the proportion from a Bayesian approach. Recall that we still consider only the 20 total pregnancies, 4 of which come from the treatment group. The question we would like to answer is that how likely is for 4 pregnancies to occur
Table 1.2: Prior, likelihood, and posterior probabilities for each of the 9 models
Model (p) 0.1000 0.2000 0.3000 0.4000 0.5000 0.6000 0.70 0.80 0. Prior P(model) 0.0600 0.0600 0.0600 0.0600 0.5200 0.0600 0.06 0.06 0. Likelihood P(data|model) 0.0898 0.2182 0.1304 0.0350 0.0046 0.0003 0.00 0.00 0. P(data|model) x P(model) 0.0054 0.0131 0.0078 0.0021 0.0024 0.0000 0.00 0.00 0. Posterior P(model|data) 0.1748 0.4248 0.2539 0.0681 0.0780 0.0005 0.00 0.00 0.
in the treatment group. Also remember that if the treatment and control are equally effective, and the sample sizes for the two groups are the same, then the probability (𝑝) that the pregnancy comes from the treatment group is 0.5.
Within the Bayesian framework, we need to make some assumptions on the models which generated the data. First, 𝑝 is a probability, so it can take on any value between 0 and 1. However, let’s simplify by using discrete cases – assume 𝑝, the chance of a pregnancy comes from the treatment group, can take on nine values, from 10%, 20%, 30%, up to 90%. For example, 𝑝 = 20% means that among 10 pregnancies, it is expected that 2 of them will occur in the treatment group. Note that we consider all nine models, compared with the frequentist paradigm that whe consider only one model.
Table 1.2 specifies the prior probabilities that we want to assign to our assump- tion. There is no unique correct prior, but any prior probability should reflect our beliefs prior to the experiement. The prior probabilities should incorpo- rate the information from all relevant research before we perform the current experiement.
This prior incorporates two beliefs: the probability of 𝑝 = 0.5 is highest, and the benefit of the treatment is symmetric. The second belief means that the treatment is equally likely to be better or worse than the standard treatment. Now it is natural to ask how I came up with this prior, and the specification will be discussed in detail later in the course.
Next, let’s calculate the likelihood – the probability of observed data for each model considered. In mathematical terms, we have
𝑃 (data|model) = 𝑃 (𝑘 = 4|𝑛 = 20, 𝑝)
The likelihood can be computed as a binomial with 4 successes and 20 trials with 𝑝 is equal to the assumed value in each model. The values are listed in Table 1.2.
After setting up the prior and computing the likelihood, we are ready to calculate the posterior using the Bayes’ rule, that is,
0.10.20.30.40.50.60.70.80.
Prior
0.10.20.30.40.50.60.70.80.
Likelihood
0.10.20.30.40.50.60.70.80.
Posterior
Figure 1.2: More data: sample size 𝑛 = 40 and number of successes 𝑘 = 8
finally put these two together to obtain the posterior distribution. The posterior also has a peak at p is equal to 0.20, but the peak is taller, as shown in Figure 1.2. In other words, there is more mass on that model, and less on the others.
0.10.20.30.40.50.60.70.80.
Prior
0.10.20.30.40.50.60.70.80.
Likelihood
0.10.20.30.40.50.60.70.80.
Posterior
Figure 1.3: More data: sample size 𝑛 = 200 and number of successes 𝑘 = 40
To illustrate the effect of the sample size even further, we are going to keep increasing our sample size, but still maintain the the 20% ratio between the sample size and the number of successes. So let’s consider a sample with 200 observations and 40 successes. Once again, we are going to use the same prior and the likelihood is again centered at 20% and almost all of the probability mass in the posterior is at p is equal to 0.20. The other models do not have zero probability mass, but they’re posterior probabilities are very close to zero.
Figure 1.3 demonstrates that as more data are collected, the likelihood ends up dominating the prior. This is why, while a good prior helps, a bad prior can be overcome with a large sample. However, it’s important to note that this will only work as long as we do not place a zero probability mass on any of the models in the prior.
1.3 Frequentist vs. Bayesian Inference
In this section, we will solve a simple inference problem using both frequentist and Bayesian approaches. Then we will compare our results based on decisions
based on the two methods, to see whether we get the same answer or not. If we do not, we will discuss why that happens.
Example 1.9. We have a population of M&M’s, and in this population the percentage of yellow M&M’s is either 10% or 20%. You have been hired as a statistical consultant to decide whether the true percentage of yellow M&M’s is 10% or 20%.
Payoffs/losses: You are being asked to make a decision, and there are associated payoff/losses that you should consider. If you make the correct decision, your boss gives you a bonus. On the other hand, if you make the wrong decision, you lose your job.
Data: You can “buy” a random sample from the population – You pay $200 for each M&M, and you must buy in $1,000 increments (5 M&Ms at a time). You have a total of $4,000 to spend, i.e., you may buy 5, 10, 15, or 20 M&Ms.
Remark: Remember that the cost of making a wrong decision is high, so you want to be fairly confident of your decision. At the same time, though, data collection is also costly, so you don’t want to pay for a sample larger than you need. If you believe that you could actually make a correct decision using a smaller sample size, you might choose to do so and save money and resources.
Let’s start with the frequentist inference.
Note that the p-value is the probability of observed or more extreme outcome given that the null hypothesis is true.
Therefore, we fail to reject 𝐻 0 and conclude that the data do not provide con- vincing evidence that the proportion of yellow M&M’s is greater than 10%. This means that if we had to pick between 10% and 20% for the proportion of M&M’s, even though this hypothesis testing procedure does not actually con- firm the null hypothesis, we would likely stick with 10% since we couldn’t find evidence that the proportion of yellow M&M’s is greater than 10%.
The Bayesian inference works differently as below.