Lecture Notes on Binomial and Normal Random Variables | STA 100, Study notes of Data Analysis & Statistical Methods

Material Type: Notes; Class: Statistical Methods; Subject: Statistics; University: SUNY Institute of Technology at Utica-Rome; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 08/09/2009

koofers-user-k5v
koofers-user-k5v 🇺🇸

10 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
STA100 Binomial and Normal Random Variables Lecture12
1 | P a g e
Recall: We have started working with random variables, which we think of as experiments with
numerical outcomes. For example, as I mentioned in a previous lecture, I planted 30 trees in my
backyard a few weeks ago. If I wait till early July and count the number of plants which are still alive I
obtain a number. This is a discrete situation (random variable) because possible outcomes are whole
numbers (0, 1… 30). When we count things we use discrete models. Other examples of experiments
which are discrete would be: count how many phone calls are received in Utica between noon and
2PM tomorrow; draw a liter of water from a swamp and count the number of insect larvae in your
sample; take a minute right now to see how many heart beats you count in the next 60 seconds.
These situations are different from experiments which involve length or time. We call these random
variables continuous because they involve physical quantities which can assume numbers on a
continuum. For example, I could consider the height of each plant. Admittedly, my ruler doesn’t allow
me to measure infinitely finely, however the concept if height is a continuous one. This lecture
attempts to remind you of some discrete ideas and begin to introduce continuous ones as well.
Recall: If a population or experiment has only two types of outcomes (e.g. coin toss: HEADS/TAILS)
which are traditionally called success or failure, and if these outcomes are independent from trial to
trial, then the probability of obtaining r successes on n trials is
𝑃 𝑟 =𝐶𝑛,𝑟𝑝𝑟 1 𝑝 𝑛−𝑟
Here 𝑝 is the constant probability of success and 1 𝑝= 𝑞 is the constant probability of failure.
Example: Suppose that a population is composed of 20% smokers and 80% nonsmokers. You form a
random sample of 15 individuals. What is the probability that, in your sample,
1. Exactly 4 will be smokers?
2. At least 4 will be smokers?
3. At least 6 will be nonsmokers?
4. What is the expected number of smokers in your sample?
5. What is the standard deviation of this random variable?
Sometimes we compute probabilities using the formula and sometimes we find it easier to use
software or a table. I’ve made the table below using software, but you can use the table in the
appendix:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0.035
0.132
0.231
0.250
0.188
0.103
0.043
0.014
0.003
0.001
0.000
0.000
0.000
0.000
0.000
0.000
So, the probability that exactly 4 will be smokers is 0.188. To get the probability of at least 4
consider the shaded region and add up the individual probabilities:
0
1
2
3
4
5
6
8
9
10
11
12
14
15
0.035
0.132
0.231
0.250
0.188
0.103
0.043
0.003
0.001
0.000
0.000
0.000
0.000
0.000
So, 𝑝𝑟𝑜𝑏 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 4 𝑠𝑚𝑜𝑘𝑒𝑟𝑠 = 0.352.
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Lecture Notes on Binomial and Normal Random Variables | STA 100 and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Recall : We have started working with random variables , which we think of as experiments with

numerical outcomes. For example, as I mentioned in a previous lecture, I planted 30 trees in my backyard a few weeks ago. If I wait till early July and count the number of plants which are still alive I obtain a number. This is a discrete situation (random variable) because possible outcomes are whole numbers (0, 1… 30). When we count things we use discrete models. Other examples of experiments which are discrete would be: count how many phone calls are received in Utica between noon and 2PM tomorrow; draw a liter of water from a swamp and count the number of insect larvae in your sample; take a minute right now to see how many heart beats you count in the next 60 seconds.

These situations are different from experiments which involve length or time. We call these random variables continuous because they involve physical quantities which can assume numbers on a continuum. For example, I could consider the height of each plant. Admittedly, my ruler doesn’t allow me to measure infinitely finely, however the concept if height is a continuous one. This lecture attempts to remind you of some discrete ideas and begin to introduce continuous ones as well.

Recall : If a population or experiment has only two types of outcomes (e.g. coin toss: HEADS/TAILS)

which are traditionally called success or failure, and if these outcomes are independent from trial to trial, then the probability of obtaining r successes on n trials is 𝑃 𝑟 = 𝐶𝑛,𝑟 𝑝𝑟^1 − 𝑝 𝑛−𝑟

Here 𝑝 is the constant probability of success and 1 – 𝑝 = 𝑞 is the constant probability of failure.

Example: Suppose that a population is composed of 20% smokers and 80% nonsmokers. You form a

random sample of 15 individuals. What is the probability that, in your sample,

  1. Exactly 4 will be smokers?
  2. At least 4 will be smokers?
  3. At least 6 will be nonsmokers?
  4. What is the expected number of smokers in your sample?
  5. What is the standard deviation of this random variable? Sometimes we compute probabilities using the formula and sometimes we find it easier to use software or a table. I’ve made the table below using software, but you can use the table in the appendix:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.035 0.132 0.231 0.250 0.188 0.103 0.043 0.014 0.003 0.001 0.000 0.000 0.000 0.000 0.000 0.

So, the probability that exactly 4 will be smokers is 0.188. To get the probability of at least 4 consider the shaded region and add up the individual probabilities: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.035 0.132 0.231 0.250 0.188 0.103 0.043 0.014 0.003 0.001 0.000 0.000 0.000 0.000 0.000 0.

So, 𝑝𝑟𝑜𝑏 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 4 𝑠𝑚𝑜𝑘𝑒𝑟𝑠 = 0..

The next one is a little tricky. Notice that exactly 6 nonsmokers is the same as 9 smokers (there are

15 people, 6+9=15) and exactly 7 nonsmokers is the same as 8 smokers (there are 15 people, 7+8=15), so “ At least 6 will be nonsmokers” is the same as 9 or 8 or … or 0 smokers.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.035 0.132 0.231 0.250 0.188 0.103 0.043 0.014 0.003 0.001 0.000 0.000 0.000 0.000 0.000 0.

This gives us 𝑝𝑟𝑜𝑏 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 6 𝑛𝑜𝑛𝑠𝑚𝑜𝑘𝑒𝑟𝑠 = 𝑝𝑟𝑜𝑏 𝑎𝑡 𝑚𝑜𝑠𝑡 9 𝑠𝑚𝑜𝑘𝑒𝑟𝑠 ≈ 1. The expected value we

talked about in the previous lecture is “easy” to calculate,

I’ll assume you are using Excel for this:

𝑟 𝑝 𝑟 𝑟 𝑝 𝑟 0 0.035 0 1 0.132 0. 2 0.231 0. 3 0.250 0. 4 0.188 0. 5 0.103 0. 6 0.043 0. 7 0.014 0. 8 0.003 0. 9 0.001 0. 10 0.000 0. 11 0.000 0. 12 0.000 1.15E- 05 13 0.000 7.16E- 07 14 0.000 2.75E- 08 15 0.000 4.92E- 10 sums 1 3

If you’ve noticed that 20% of 15 is equal to 3 you’ve got a nice result. This gives us a fast way to calculate expected values, but only for the binomial case.

As far as spread, we can define the standard deviation for a discrete probability distribution in a similar way to expected value.

A little scary. Luckily we have a quick formula, but this is only good for the binomial random variable

One last example and the third “class presentation” for this week: Suppose you will roll a fair die 10

times and you consider a “4” to be a success. Calculate the probabilities of obtaining 0, 1… 10

successes. Set these up in a table and show how to calculate the standard deviation using the

formula

.Also show that this gives the same result as when multiplying

Continuous Distributions: The Normal Random Variable

One of the most important continuous distributions in probability and statistics is the normal

distribution. As we will see later, the distribution is found to describe a wide variety of natural

phenomena and also models many sampling situations. For example, as we will see in the next

lecture, a binomial random variable (discrete) may be fairly well approximated by the normal

distribution (continuous) when 𝑛𝑝 > 5 and 𝑛(1 − 𝑝) > 5.

Important Properties Of The Normal Distribution:

 It is symmetrical about its mean

 It is "bell shaped"

 It is defined on the whole real line.

 It has a wild formula 𝑓 𝑥 = 1 2 𝜋𝜎^2 𝑒

−^12 𝑥 −𝜇𝜎 2

Some of us have to work with this formula all the time. In STA100 we see it once and run away

quickly. Luckily for us the probabilities we need from this distribution are tabulated. Before looking at

the tables, remember the empirical rule. If a set of data is approximately normally distributed we will

see

 Approximately 68% of the data within one standard deviation of the mean.

 Approximately 95% of the data within two standard deviations of the mean.

 Approximately 99.7% of the data within three standard deviations of the mean.

The graph below shows us what a typical normal distribution looks like and how we might use it.

One second to last (penultimate) “student presentation” for this week. This may take a few minutes,

but will give us some data to play with. Find 30 people willing to give you 1 minute of their time (at

work, in a crowded theater, wherever you’d like). For each of these 30 people time their number of

0 5 10 15 20 25 30 35 40 45 50

0

Binomial Probabilities n=50, p=0.5 with Normal Distribution Superimposed

normal binomial

As a quick example, suppose we have a population whose histogram looks as shown:

From the figure we can see that the smallest data points in the population are near zero and the largest are near 2 (I’m of course looking along the horizontal axis for this). Also, since the graph gets higher to the right we have more data points near 2 than near 0. From this figure, if we want the probability that a data point falls between 0 and 1.5 (or, put another way, the proportion of data between 0 and 1.5) we must find the area under the curve between 0 and 1.5. Recalling that the area

under a triangle is 𝐴 = 1 2 𝑏𝑎𝑠𝑒^ ∙ 𝑕𝑒𝑖𝑔𝑕𝑡^ we see that:  The total area under the curve is 𝑡𝑜𝑡𝑎𝑙 𝑎𝑟𝑒𝑎 = 1 2 2 1 = 1^ which is what we want (100% of the data lie between 0 and 2.  The area between 0 and 1.5 is 𝑃 0 < 𝑋 < 1.5 = 1 2 1.5^ 0.75^ =^ 0.5625^ meaning that a little more than half the population lies between 0 and 1.5.

When a curve is more complicated than a triangle it becomes challenging to obtain areas/probabilities. Luckily for us someone has done the work for us and placed all the areas under the normal curve we could reasonably want in a table for us. If you look in the front cover of your book you should see the famous “z-table”. The pictures which accompany the table show you how to obtain areas.

For example, if you have a data set which follows a standard normal curve (bell shaped, centered at x=0 and with standard deviation=1) you can find out what proportion of data sit below 2.25. The table lists “z values” as X.XX meaning it considers z values in the form of units, tenths and hundredths. Our number 2.25 is equal to 2.2 + 0.05 so look down the first column till you see a 2.2 (about 2/3 down the

0 0.5 1 1.5 2

-0.

-0.

0

1

Simple Probability Density Function

second page of the table). Now look across columns till you find the area under the column labeled 0.05. You should find an area of 0.9878.

Try this again for a z value of -1.38 and obtain an area of 0.0838. We are almost done. If you want to know what proportion of the area under a standard normal distribution lies between 1.45 and 2.20 you can set up a little table:

z2= 2.20 A2= 0. z1=1.45 A1= 0. A = 0.

We would write 𝑃 1.45 < 𝑍 < 2.20 = 𝐴 2 − 𝐴1 = 0.

At this point we can say that if our data is mound shaped and centered at the origin with a spread or standard deviation of 1 then we can find areas/proportions/probabilities. Suppose you instead have IQ data. A common IQ test has a mean or center of 100 (meaning 100 is the average IQ) and a spread of 15. If these scores are normally distributed then the histogram would look like this:

Suppose you would like to know what proportion of the population has an IQ between 90 and 120. The Z-table can help us, but first we have to rephrase our question. It turns out that we can use our Z-

20 40 60 80 100 120 140 160 180

0

Histogram of IQ Scores, =100 and =