Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

For each uploaded document

Answer questions

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Statistical Inference: Hypothesis Testing and Confidence Intervals, Study Guides, Projects, Research of Statistics

Oklahoma State University (OSU) - Oklahoma City Statistics

This document delves into statistical inference, focusing on hypothesis testing and confidence intervals. it uses case studies to illustrate the application of binomial and normal distributions in real-world scenarios, such as analyzing death rates and estimating population proportions. key concepts like p-values, null hypotheses, and the interpretation of confidence intervals, making it a valuable resource for students learning statistical methods.

Typology: Study Guides, Projects, Research

2022/2023

Uploaded on 04/23/2025

bobby-35 🇺🇸

5 documents

1 / 28

This page cannot be seen from the preview

Don't miss anything!

bg1

Case Study #1: Postponement of Death Theory

Introduction

If an event of particular significance is upcoming in one’s life, is it within an elderly person’s

ability to actually postpone death until the event has passed? For instance, each birthday can hold

greater and greater significance for the aged. We often have special celebrations for those who are

fortunate to live 80, 85, 90, even 100 years old. Is it possible that the anticipation of such

celebrations can be incorporated into a “will to live” for those that are very old? This

“postponement of death theory” is the context for our first case study. Several data sets will be

explored in this first case, but consider the following scenario for our first inquiry:

Case Question 1A

During the last week of May 2014, there were 18 published obituaries in the Lufkin/Nacogdoches

newspapers for persons who were over 70 years old when they died. If there is any true

significance attached to the postponement of death theory, then one would expect an increased rate

of death shortly after a significant date – such as a birthday. Of the 18 obituaries, 6 people died

within three months after having a birthday. Is the fraction observed in this sample suggestive of

the postponement theory?

Population and Sample

During the first few class meetings of the semester we discussed that the words population and

sample are key in statistical studies. For Case Question 1A,

The population of interest is taken to be all people in the Lufkin/Nacogdoches area that

were over 70 years old when they died.

The sample consists of 18 published obituaries in the two main local newspapers for one

week in late May. All 18 people in the sample were over 70 years old when they died.

Recall, to make sound statistical decisions based on collected data, our sample needs to be as

representative of the population of interest as possible. Discuss and think on the following

questions:

Is it believable that the collected sample is a random sample?

Do you think the sample collected is representative of the population of interest listed

above?

What difficulties might we face in collecting a random sample? Could the population be

sampled in a different way that might be better than what was done in Case Question 1A?

You may very well have doubts about the representative nature of the collected sample. Despite

this, we often are faced with having to work with samples such as this one under the umbrella of

disclaimers. We issue such a disclaimer – or assumption – now: the following analysis will

1 | P a g e

pf3

pf4

pf5

pf8

pf9

pfa

pfd

pfe

pff

pf12

pf13

pf14

pf15

pf16

pf17

pf18

pf19

pf1a

pf1b

pf1c

Discover Study Guides, Projects, Research of Statistics Oklahoma State University (OSU) - Oklahoma City

Related documents

Statistical Inference: Confidence Intervals and Hypothesis Tests

Statistical Inference: Confidence Intervals and Hypothesis Testing

Statistical Inference: Hypothesis Testing and Confidence Intervals

Statistical Inference: Hypothesis Testing and Confidence Intervals

Statistical Inference: Understanding Confidence Intervals and Hypothesis Testing

Statistical Inference: Hypothesis Testing and Confidence Intervals

Statistical Inference: Confidence Intervals and Hypothesis Testing

Statistical Inference: Confidence Intervals and Hypothesis Testing

Statistical Inference: Confidence Intervals and Hypothesis Testing

Statistical Inference: Confidence Intervals and Hypothesis Testing

Statistical Inference: Understanding Confidence Intervals and Hypothesis Tests

Statistical Inference: Confidence Intervals and Hypothesis Tests - Prof. Nancy M. Pfenning

Partial preview of the text

Download Statistical Inference: Hypothesis Testing and Confidence Intervals and more Study Guides, Projects, Research Statistics in PDF only on Docsity!

Case Study #1: Postponement of Death Theory

Introduction If an event of particular significance is upcoming in one’s life, is it within an elderly person’s ability to actually postpone death until the event has passed? For instance, each birthday can hold greater and greater significance for the aged. We often have special celebrations for those who are fortunate to live 80, 85, 90, even 100 years old. Is it possible that the anticipation of such celebrations can be incorporated into a “will to live” for those that are very old? This “postponement of death theory” is the context for our first case study. Several data sets will be explored in this first case, but consider the following scenario for our first inquiry: Case Question 1A During the last week of May 2014, there were 18 published obituaries in the Lufkin/Nacogdoches newspapers for persons who were over 70 years old when they died. If there is any true significance attached to the postponement of death theory, then one would expect an increased rate of death shortly after a significant date – such as a birthday. Of the 18 obituaries, 6 people died within three months after having a birthday. Is the fraction observed in this sample suggestive of the postponement theory? Population and Sample During the first few class meetings of the semester we discussed that the words population and sample are key in statistical studies. For Case Question 1A,  The population of interest is taken to be all people in the Lufkin/Nacogdoches area that were over 70 years old when they died.  The sample consists of 18 published obituaries in the two main local newspapers for one week in late May. All 18 people in the sample were over 70 years old when they died. Recall, to make sound statistical decisions based on collected data, our sample needs to be as representative of the population of interest as possible. Discuss and think on the following questions:  Is it believable that the collected sample is a random sample?  Do you think the sample collected is representative of the population of interest listed above?  What difficulties might we face in collecting a random sample? Could the population be sampled in a different way that might be better than what was done in Case Question 1A? You may very well have doubts about the representative nature of the collected sample. Despite this, we often are faced with having to work with samples such as this one under the umbrella of disclaimers. We issue such a disclaimer – or assumption – now: the following analysis will

assume the 18 published obituaries are representative of the targeted population. If this assumption could be proven to be grossly untrue, then any statistical significance that we may attach to the results could be in question. It is important to question the representative nature of samples. It is also important before data collection to consider the best possible way that we could reasonably sample our population. Despite such considerations, statisticians often have to work with less than perfect samples. This is just a realistic feature of data analysis. So, caution is advised when working with samples that could be argued to be unrepresentative. In the current case, we may have some doubts, but they may not be so grave as to nullify all that follows. Random Variable, Parameter and Statistic of Interest Go back and carefully read how the data is presented in Case Question 1A. What we know about the 18 people that died is whether or not they died within three months after having a birthday. Six people passed away during the three months which followed a birthday. The other 12 did not. The “information” that each person contributes to the sample is essentially either a “yes” or a “no”:  Yes (Success): The person did die during the three months which followed a birthday.  No (Failure): The person did NOT die during the three months which followed a birthday. Data such as this is said to have come from a Bernoulli Trial. Bernoulli trials are investigations in which the resulting data has only two possible outcomes. We call the two possible outcomes “success” and “failure” even though there is no positive or negative connotation attached to those two words. Bernoulli trials result in “1’s” and “0’s”. The ones are attached to the outcome “success” and the zeroes are attached to the “failures”. Label the “success” as the feature you are interested in studying. In this case, a success will be associated with death during the three months following a birthday. Data recorded as yes/no, up/down, left/right, on/off, in/out, etc. are examples of Bernoulli trials. We have 18 Bernoulli Trials in Case Question 1A. Imagine that each person in the sample was assigned either a “1” or a “0” based on whether or not they were a successful Bernoulli trial or not. This type of assignment is an example of a random variable.  A random variable is an assignment – one that obeys the laws of what in mathematics we call a “function” – in which each experimental result is assigned a meaningful number. For instance, one person from the list of obituaries died on May 29, 2014 and they had just had a birthday on May 13. The person is assigned a “1” – a success. Random variables are often given notation such as X, Y or Z. Sometimes if the same random variable is observed multiple times, we use subscripts in our notation, such as etc. For us, let’s denote our random variable of interest as X, where X is either a “1” or a “0” based on whether or not the person from the list of obituaries was assigned a success or a failure when considering them as a Bernoulli Trial. We have 18 obituaries, so our random variables will be denoted. Specifically,

So, just what conclusion are we trying to reach about p in Case Question 1A? Well, we are trying to see if the postponement of death theory is substantiated by our data! That’s our main inquiry. We should believe this theory is substantiated ONLY if our data indicates it. We should place the burden of proof on the data and begin our investigation with the hypothesis that the postponement of death theory is not suggested by the data. If an investigator is trying to establish the relevance of the theory, he or she should NOT start off assuming the theory is true. That would be terrible science. Instead, we should test the theory by taking data and then letting the data tell us if the weight of evidence is so large that the theory is believable. Now, if the postponement theory isn’t suggested by the data, then it would make sense that the deaths occur randomly throughout the year. Since we are looking at the three months after a birthday and three months is one quarter (or 25%) of a year, then it makes sense that if the postponement theory isn’t correct for our population, that p = .25. That is, if the postponement theory isn’t correct for our population, the fraction of deaths in the three months following a birthday is 25%. This is called the null hypothesis in our statistical investigation. We denote it like this: (the population proportion is 25%) On the other hand, if the postponement theory is substantiated, then this would mean that it would be more likely for a death to occur right after a birthday. Specifically, if the postponement theory is relevant for our population, then we would expect more than 25% of deaths to occur in the three months following a birthday. Proponents of the postponement theory want us to believe this. The burden of proof is on them to show that the data suggests that their hypothesis is indeed correct. We will call the hypothesis that is trying to be substantiated the alternative hypothesis. We denote it like this: (the population proportion is greater than 25%) Test Statistic and Sampling Distribution We’ve previously noted that we want to use the statistic in order to make a statistical inference about p. We want to use in order to test vs.. Now, from the information given in Case Question 1A, we can easily calculate that. Of course, to the nearest percentage point, this means that. The question is this: Is the truth that sufficient enough evidence to reject and claim that the postponement theory is substantiated? Should we reject and decide to believe on the basis of the statement that? Well, it is true that. But, 33% may not seem much larger than 25%. Additionally, was calculated on the basis of only 18 deaths. So, we have SOME evidence for on the basis that

. But, is this evidence convincing enough to reject and claim that the postponement theory is substantiated? If so, we would call the evidence presented by the sample statistically significant. We should immediately state that just because evidence is statistically significant does not mean that the results are biologically significant or environmentally significant or psychologically significant, etc. We will discuss this point more later. Suffice it to say, that the statistical significance or non-significance of a result should always be cross examined with the significance from other perspectives. But, this is a course in statistics – so, we are learning what it means for results to be statistically significant. In order to know whether or not is a statistically significant result that is indicative of , we must ask ourselves two very important questions:  What other values of could we have obtained in other potential samples?  How likely are the these values of if the null hypothesis is true? In particular, we need to know just how likely it is to observe if the postponement theory is not true for our population. If we could calculate that is a very likely occurrence under the assumption that the postponement theory is not true, then our sample has indicated evidence to retain. That is, we wouldn’t reject it based on the observed data and the results gathered from the sample are not statistically significant in regards to the postponement theory. However, what would it mean if we were to calculate that is very unlikely if is presumed correct? It would mean that would be a very rare event to observe, if in fact, is legitimate. This would make us wonder why, if is correct, did we see a rare occurrence? This rare occurrence would cast doubt on the truth of and would ultimately lead to ’s rejection in favor of. The above paragraphs are motivation for what statisticians call a p-value. The concept of p-value is one of the most important ideas in all of statistical science. We will repeatedly use this concept and calculate p-values. Your understanding of them will round into form as we progress through more and more case studies. For now, reflect on the logic of the above paragraphs and the following definition. p -value : The chance of observing the value of the statistic from your sample (or one more extreme) if, in fact, the null hypothesis is true. Again, this definition and associated calculation will be reinforced all throughout the course. Learn the definition now as your instructor will surely utilize the concept of p-value repeatedly.

18 Bernoulli trials as independent when considering that the date of one person’s death doesn’t generally alter the date of someone else’s death. Now, there are unfortunate situations where there are simultaneous deaths of several people due to an accident of some sort, but there was no indication of this among the 18 obituaries in the sample. A set of Bernoulli Trials is said to have come from a binomial experiment if  There are n independent Bernoulli trials all of which have the same likelihood of resulting in success  The experimenter has interest in the total number of successes among the n trials. The data collected from the obituaries can reasonably be assumed to have come from a Binomial experiment:  We have independent Bernoulli trials. The time of each death that determines whether the trial is classified as a success or failure can reasonably be assumed to be independent from person to person. If the null hypothesis is true, then the chance of each trial resulting in a success can be claimed to be. This likelihood is reasonable to apply to each trial.  We are interested in the total number of successes since our statistic, , utilizes this total. Let. Then, from the above argument we state that Y is a binomial random variable with parameters and. A binomial random variable counts the total number of successes in a binomial experiment. The probabilities associated with binomial random variables can be obtained from the formula for the binomial mass function. All discrete random variables have mass functions and it is the job a mass function to provide the probabilities associated with all possible outcomes of the random variable. The formula for the binomial mass function is . This formula gives the chance of exactly y successes among the n trials. So, if Y is a binomial random variable with parameters n and p , then we denote the “probability of exactly y successes” by the notation. Here, the capital P denotes “probability”, the capital Y is the random variable of interest and the lowercase y is the particular numerical outcome of the random variable that is relevant in the calculation. The exclamation point (!) indicates the operation of “factorial”, which is simply the successive product of all integers from the value preceding the symbol down to

For instance,. Applying the binomial mass function to our case study, suppose Y is a binomial random variable with parameters and. Then, the chance of exactly 6 successes among the 18 trials is

Since the binomial mass function is quite prevalent in probability and statistics, it is a good idea to get used to making calculations by hand using the formula. However, after a bit of practice, the popular software Excel (as well as other computer programs) can make the calculations more expeditious. The binomial mass function in Excel can be invoked by clicking in any particular cell in the spreadsheet and then typing the following command: For instance, clicking in a cell in Excel and typing and then hitting the “Enter” key will produce the value .143564 in the cell. Replacing the word “FALSE” with the word “TRUE” will calculate the cumulative probability , rather than the individual probability. Try it for our case study and you’ll see that . Looking back at the binomial mass function we can see that it is made up of three parts:  counts the number of ways in which we could observe y successes among n trials. Often, we use the symbol to denote and this symbol is read “ n choose y ”. If you have n distinguishable objects, and you want to choose y of them to put in a set, then the number of ways to do this is “ n choose y ”.  The second piece of the binomial mass function is and it incorporates the fact that we need exactly y successes each of which has probability p.  The final piece of the binomial mass function is and it incorporates the fact that if we need exactly y successes, then this means that there are failures. Also, since the chance of a success is p , them the chance of failure is. This last point introduces the important complement rule of probability. For our relevant binomial random variable, notice that the list of possible outcomes for Y is. We will denote this complete list of possibilities as S and write. The calculation only involves a subset of S ; namely,. The set E is an example of what is called an event. The complement rule of probability states that for any event E , we have that.

Make sure you realize that the ellipsis, “ ” means that we could have three events, ten events, one thousand events, etc. It doesn’t matter. As long as all the events in question are pairwise mutually exclusive, Axiom 3 applies. But, it ONLY is good for calculating probabilities that involves the notion of “one set or another or another, etc.”. If you need to calculate then we may need other probability rules that will emerge later. Case Question 1A Concluding Decision The question in Case Study 1A can be answered by using the development presented to this point. The work above has outlined the statistical procedure known as a hypothesis test for a proportion. We have two competing hypotheses and the data will let us know which is most plausible. As a recap, our two hypotheses are (the population proportion is 25%) (the population proportion is greater than 25%) The focal question is regarding the level of evidence for the postponement of death theory. Since this theory is under scrutiny, we should not presume it is true. We should assume it is false until the data suggests otherwise, if at all. This is why and are set up in the manner that they are. What the experimenter is hoping to establish is generally placed in the alternative hypothesis. The parameter being tested is p , the population proportion. It was estimated by the sample proportion. In order to know whether or not this result is suggestive of rejecting , we needed to obtain a sampling distribution for our statistic. The sampling distribution describes the behavior of in repeated samples. Even though we are unlikely to observe these theoretical “other” samples, knowledge of the sampling distribution provides a context for our observed value of. When we assume is true, the distribution of a statistic used in a hypothesis test is more specifically called the null distribution. The relevant null distribution for Case Study 1A is the binomial distribution with and. Notice that large values of - and in turn – large values of are indicative of rejecting and deciding is the most relevant hypothesis based on the evidence in the data. Since large values indicate rejection, the p-value associated with our hypothesis test is the chance that we observe or a value larger. We calculate this chance using the null distribution. If you take a moment to reflect on the definition of p-value, you will recall the phrase “or more extreme” appears. Here, “more extreme” is interpreted as “larger than”. In general, “more extreme” is interpreted in light of the alternative hypothesis (notice the “>” sign in )

So, our p-value is. Of course, the only way that can be greater than or equal to 6/18 is if. We make this calculation using the null distribution, which is binomial with and. We know from previous work that this probability is. Now, we are at the point of decision. The p-value is a percentile. Specifically, the p-value we obtained is the upper 28th^ percentile of the null distribution. This means that if is true, we can expect to see our sample proportion or one larger 28% of the time. Does this 28% seem overly rare to you? Probably not. While we all inherently have different personal interpretation of the word “rare”, statisticians that are involved in statistical inference problems generally don’t label p-values as rare until they fall in the lower or upper 5%-10% of the null distribution. This will be discussed more in other case studies. Since our p-value of 28% isn’t particularly rare, we don’t particularly have overwhelming evidence to reject the null hypothesis. You can think of a p-value as a barometer of sorts. The lower the p- value, the more evidence exists in the data to reject and instead, conclude that is most statistically reasonable. If the p-value isn’t low, then you have observed data that is common place if in fact is true. This is what has happened to us. We didn’t observe a low p-value, so we don’t have sufficient evidence to reject. So, since we don’t reject , we will retain it as the most reasonable conclusion. Therefore, the we retain the null hypothesis that. That is, the claim that the population proportion is 25% can’t be refuted. It appears we do NOT have sufficient data- driven evidence to believe the postponement of death theory in our population. For a visual look at our null distribution and p-value, look at the following chart made in Excel. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0

Possible Values of Y P r o b a b i l i t y

randomly during the year, we would expect 25% of people to pass away in the three month prior to their birthday. Notice that 60 of 747 is only 8%. Case Question 1B A total of 747 obituaries were collected from a Salt Lake City newspaper. These obituaries were scattered throughout the period of one year. Among these 747 obituaries, only 60 people died in the three months prior to their birthday. Is this decrease in what would have been expected if the deaths were randomly occurring throughout the year statistically significant? That is, do the Salt Lake City obituaries provide statistical evidence for the postponement of death theory? Population and Sample On the surface, Case Question 1B is quite similar to Case Question 1A. The data is collected in a slightly different way, but the issue still boils down to whether or not the disproportionate (8% vs. 25%) fraction of deaths in a particular time interval is substantial enough to lend credence to the postponement theory. The population of interest here could reasonably be taken to be the greater Salt Lake City area. Similar to the Lufkin/Nacogdoches data, all of the people featured in the obituaries were within one region of the United States. Extrapolation to all of Utah, or all of the United States would seem to be a stretch since Salt Lake City newspapers don’t tend to contain birth and death announcements from around the country. The sample is different from the Lufkin/Nacogdoches data in several ways. First, there is no minimum age used to define the deaths. In the Lufkin/Nacogdoches data, all 18 of the deaths pertain to people over 70 years old when they died. This isn’t the case in the Salt Lake City data. Secondly, the data is taken over a longer time interval – all of the people in Lufkin/Nacogdoches data set died within one week of each other. Third, and possibly most central to the mathematics in this section, the Salt Lake City sample contains 747 people, whereas the Lufkin/Nacogdoches sample was small – only 18 people. Like with Case Question 1A, we should ponder the assumption that the 747 deaths that make up the sample constitute a random sample of deaths across a year from the greater Salt Lake City area. Some of the same issues that arose in assuming a representative sample in the Lufkin/Nacogdoches area may be relevant in the Salt Lake City data as well. Going forward, we will assume the 747 obituaries represent a random sample of all deaths during one year in the greater Salt Lake City area. Random Variable, Parameter and Statistic of Interest Like Case Question 1A, each person in the sample can be represented by a Bernoulli Trial. For define .

Here, a success represents a person dying within the three months prior to their birthday. A failure corresponds to the person dying at some other time during the year. In corresponding fashion, our relevant parameter and statistic are and as defined below. Let p = the proportion of all people in the greater Salt Lake City area that die within the three months prior to their birthday. Let = the proportion of people in our collected sample that died within the three months prior to their birthday. If the deaths occur in random fashion throughout the year, then. The data provided in Case Question 1B tells us that among the 747 Bernoulli Trials, 60 of them were successes. Namely, we have observed that. Hypotheses To Be Tested At this point, we have defined a population of interest and identified the collected sample. Additionally, we have a random variable of interest and similar to Case Question 1A, we will use sample proportion to estimate a population proportion. Our statistical inference procedure now begins with assuming that the postponement of death theory is not true and forces the data to present sufficient evidence to the contrary. Proponents of the postponement theory will point to the fact that in their argument. Is this sufficient evidence? We should not presume it is – instead, we should test to see if this small value of is statistically significant. If the postponement of death theory is correct, then based on the way that the data are collected and the way that p is defined, we would expect. Therefore, our null and alternative hypothesis for Case Question 1B are (the population proportion is 25%) (the population proportion is less than 25%). Make sure you realize that although we are testing the same “postponement theory”, the data has been collected in a different way in Case Question 1B. This necessitates the alternative hypothesis being lower tailed (notice the “less than” symbol), whereas was upper tailed in Case Question 1A. Sampling Distribution In order to know whether or not is a statistically significant result that is indicative of , we must ask ourselves two very important questions:

it. Dealing with values this large should be done with extreme caution. This computational concern will give us a chance to examine the features of the binomial distribution when the value of n is large. When calculating the p-value for the hypothesis test in Case 1A, we were able to create a graph of the sampling distribution for. Recall, a sampling distribution is a description or list of the possible outcomes of a statistic along with the likelihoods of these outcomes. The graph of the sampling distribution included all of the possible values of Y which were. The graph of the sampling distribution for must include all of the integers from 0 to 747. Excel is able to generate this plot and it is printed below. Notice that the horizontal axis only includes the values of Y from 125 to 250. This is because the probabilities for all other values of Y are so small that they are effectively zero. So, to focus on the shape of the plot, these values were excluded. Just think of the plot extending to the left and the right of what you actually see, but hugging the horizontal axis very, very near zero out towards 0 in the left tail and out towards 747 in the right tail. 125 145 165 185 205 225 245 0

Possible Values of Y P r o b a b i l i t y The Normal Approximation to the Binomial Distribution The shape of this sampling distribution is unmistakable. The sampling distribution looks like a bell-shaped curve. Now, make sure you understand: the plot above consists of just 748 isolated points. That is, the sampling distribution is graphically represented by the 748 points (or diamonds that Excel uses). But, because there are so many points represented in the graph, it has the appearance of being a smooth curve. This fact can help us resolve the question in Case 1B. The exact sampling distribution of is binomial with a very large value of n. From the picture above, it appears as though this exact sampling distribution could be accurately

approximated by a smooth curve. What we will do is replace the exact sampling distribution with this approximate smooth curve and then, our computational difficulties discussed earlier will be completely resolved. Finally, we will be in a position to easily calculate the (approximate) p-value for our hypothesis test and resolve the question from Case 1B. Recall that random variables that have a finite or countably infinite number of possible outcomes are called discrete. Random variables that have an uncountably infinite number of possible outcomes are called continuous. The binomial is a discrete random variable, but when n is large, it can be approximated by the continuous random variable known as the normal random variable. When calculating probabilities associated with the binomial random variable (or any other discrete random variable, for that matter), we can just plug appropriate values into a mass function. Recall that all discrete random variables have mass functions and it is the job of a mass function to provide the probabilities associated with all possible outcomes of the random variable. The formula for the binomial mass function is . This formula gives the chance of exactly y successes among the n trials. Probability associated with continuous random variables must come from what is known as a density function. All continuous random variables have density functions. It is the job of a density function to provide all the probabilities associated with outcomes of the continuous random variable. However, density functions achieve this goal in a different manner than mass functions do for discrete random variables. Probability associated with continuous random variables is calculated as area under the density curve, rather than by “plugging in” to the density function. The chart below compares and contrasts how to find when using random variables. Type of Random Variable Relevant Function How to Find Discrete Mass Function Plug in all values between (and including) a and b into the mass function and add up the results Continuous Density Function Find the area under the graph of the density function over the interval . Characteristics of Normal Random Variables

Despite their being a different normal curve for each value of and all for all positive values of , the normal curve with and is really the only one important for calculations and statistical inference. This is because any probability (area under the curve) calculation that is required for an arbitrary normal curve can be converted to a problem with the same answer that uses this standard normal curve. The standard normal random variable is the normal random variable with and. This “conversion” process is called standardizing a normal random variable and is summarized below. Standardization Theorem for a Single Normal Random Variable : If X is a normal random variable with mean and standard deviation , then is a standard normal random variable. Calculations With Normal Random Variables We know that for a continuous random variable X , finding requires that we find the area under the graph of the density function over the interval. Finding areas under density curves is generally a calculus problem. For normal random variables, these calculus problems have been solved by others and placed in tables for our use. Alternatively, modern computer software such as Excel can calculate these areas for us. It is important to understand the general philosophy that “standardizing” a random variable involves subtracting the mean and dividing the result by the standard deviation. To solidify this philosophy, we will briefly describe how to make calculations with normal random variables using

the standardization theorem. Once this philosophy is understood, then the quickest way to make the calculations is in Excel. Going straight to the Excel code without an understanding of the standardization process is dangerous and should be avoided. Standardization is a concept that will emerge again in Case Studies 2A, 2B and 3A. As an example of how to perform calculations with normal random variable, consider that the height of a mature pine tree in the East Texas region could be modeled with a normal random variable having mean and standard deviation. What is the meaning of the phrase “could be modeled with”? What we are thinking about here is the entire population of pine trees (say, of one particular species, like loblolly) in East Texas. We have no hope of measuring the heights of all the mature pine trees in East Texas. But if we could, we might surmise that the average height would be 90 feet and that the heights would “crowd” around the value 90 in such a way as that values near 90 would be quite popular to see. Then as we moved away from 90 feet, the values would become less and less frequent, yet in a symmetric way. By this we mean that a tree over 100 feet tall could be imagined to be just as likely to observe as a tree that is 80 feet or less. Finally, since it is rare to see observations more than three standard deviations away from the mean when dealing with normal random variables, we might imagine that mature trees above 114 feet or below 66 feet are quite rare to observe around East Texas. All of these facts could be combined into a model – the model being the normal curve with mean 90 and standard deviation 8. Next, suppose with this model in place, we were asked for the chance that a mature pine tree in East Texas grows to exceed 105 feet. How could this calculation be made? The answer is: we need to calculate an area under the density curve. The area we need to obtain is shaded and shown below. Let the continuous random variable X represent the height of a mature pine tree in the East Texas region. Then, we can write to represent the fact that our model for X is normal with mean and standard deviation. We need to calculate. This is done by using the standardization theorem. Once the random variable X is standardized, we denote the resulting random variable by Z. The capital letter Z will be reserved notation just for standard