Understanding the Concept of Sampling Distributions and Mean in Statistics, Study notes of Data Analysis & Statistical Methods

An explanation of the concept of sampling distributions and mean in statistics. It includes examples using a deck of playing cards and the central limit theorem. The document also covers calculating means and standard deviations for sample proportions.

Typology: Study notes

Pre 2010

Uploaded on 08/09/2009

koofers-user-x4s
koofers-user-x4s 🇺🇸

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
STA100
Sampling Distributions
Lecture 14
Sampling Distributions Text Sections 7.1 and 7.2
If you think of almost any interesting question you would like to answer about almost any real world
situation you are immediately drawn to the notion of uncertainty. You might wonder whether taking
niacin will raise your HDL cholesterol. Until you try you won’t really know, but you can consult the
literature and find that, on average, “NIASPAN increases HDL cholesterol an average of 14% for men
and 20% (http://www.niaspan.com/About_Niaspan/index.asp). But do they really know this? Did
they give Niaspan to all women in America and report the number? Did they give it to a sample? How
large a sample? Did they use 10 people? Did they use 1000 people? How many should you use to
get an estimate for the real average increase in HDL? Why is the average interesting?
To make some of these ideas a little bit more concrete, you need to get out a deck of playing cards.
Please take a second and do this. As Euclid told King Ptolemy "there is no royal road to geometry"
(i.e. no easy way to learn) and as you’ve undoubtedly found out by now there is none to statistics,
either.
Now that you have your deck of 52 playing cards, think of the “average” card. Let’s agree that we will
count an Ace as 1, 2 as 2, 10 as 10, Jack as 11, Queen as 12, and King as 13. The deck will be
our population. The mean of this population is . The table
below shows how to calculate the mean for our grouped data:
x
f
x*f
x*(f/n)
1
4
4
0.076923
2
4
8
0.153846
3
4
12
0.230769
4
4
16
0.307692
5
4
20
0.384615
6
4
24
0.461538
7
4
28
0.538462
8
4
32
0.615385
9
4
36
0.692308
10
4
40
0.769231
11
4
44
0.846154
12
4
48
0.923077
13
4
52
1
sum
364
7
If you don’t like the table, just think about where your population distribution would “balance”
In this simple situation we know the average, so we are in a position to experiment a little. Then,
when we want to estimate the average blood pressure for “smokers” we will know how to work.
pf3
pf4
pf5
pf8

Partial preview of the text

Download Understanding the Concept of Sampling Distributions and Mean in Statistics and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

STA100 Sampling Distributions Lecture 14

Sampling Distributions Text Sections 7.1 and 7.

If you think of almost any interesting question you would like to answer about almost any real world situation you are immediately drawn to the notion of uncertainty. You might wonder whether taking niacin will raise your HDL cholesterol. Until you try you won’t really know, but you can consult the literature and find that, on average, “NIASPAN increases HDL cholesterol an average of 14% for men and 20%” (http://www.niaspan.com/About_Niaspan/index.asp). But do they really know this? Did they give Niaspan to all women in America and report the number? Did they give it to a sample? How large a sample? Did they use 10 people? Did they use 1000 people? How many should you use to get an estimate for the real average increase in HDL? Why is the average interesting?

To make some of these ideas a little bit more concrete, you need to get out a deck of playing cards. Please take a second and do this. As Euclid told King Ptolemy "there is no royal road to geometry" (i.e. no easy way to learn) and as you’ve undoubtedly found out by now there is none to statistics, either.

Now that you have your deck of 52 playing cards, think of the “average” card. Let’s agree that we will count an Ace as 1, 2 as 2, … 10 as 10, Jack as 11, Queen as 12, and King as 13. The deck will be

our population. The mean of this population is. The table

below shows how to calculate the mean for our grouped data:

x f xf x(f/n) 1 4 4 0. 2 4 8 0. 3 4 12 0. 4 4 16 0. 5 4 20 0. 6 4 24 0. 7 4 28 0. 8 4 32 0. 9 4 36 0. 10 4 40 0. 11 4 44 0. 12 4 48 0. 13 4 52 1 sum 364 7

If you don’t like the table, just think about where your population distribution would “balance”

In this simple situation we know the average, so we are in a position to experiment a little. Then, when we want to estimate the average blood pressure for “smokers” we will know how to work.

First Experiment

Make sure your deck is well shuffled. This means shuffling 6 or 7 times. This simulates drawing from

a population “randomly”. In order to get a feel for why we sample randomly, suppose you want to

know the average height for men at SUNYIT. Since you are at the gym anyway, why not just ask

those basketball players to be your sample?

Once your deck is well shuffled, deal out 3 cards and note the numerical values (1 through 13). Place

your cards back in the deck and shuffle again 6 or 7 times. Deal out another 3 cards and calculate the

average. Repeat this 10 times and fill in the table below.

trial First Card Second Card Third Card Sum Average 1 2 3 4 5 6 7 8 9

I did this experiment myself and obtained the following:

trial First Card Second Card

Third Card

Sum Average

Here’s the same histogram for a sample size of 10. Compare the two histograms. What is the same? What is different?

Here’s the framework for most of the rest of the course: We will want to know the mean, or the

standard deviation, or the proportion of a population but, lacking the time and money to obtain data

from everyone in the population we will have to work with a sample. Thus while we want the

population mean or the population standard deviation or the population proportion we will

have to work with the sample mean or the sample standard deviation or the sample proportion

instead and we will have to figure out how to do this intelligently.

(^02 3 4 5 6 7 8 9 10 11 )

1

2

3

3.5 x 10

5

Sampling Distribution of Sample Proportions : Suppose you would like to know what the

proportion of lime Skittles is (the green ones). This is exactly the same (mathematically) as polling to determine what proportion of low income families in your school district receive free school lunches.

Here’s an experiment you can perform and share with us for a class presentation.

1. Buy one of the large bags of “snack pack” skittles (a large bag containing many smaller sized _bags like you would pass out to kids on Halloween).

  1. Open each little bag and count the number of green skittles and also the total number of_ _skittles.
  2. Complete a table like the following and present a histogram of sample proportions._

Bag Number Number of Green Skittles Total Number in Bag Sample Proportion 1 2 3 4 5 6 7 8

etc

40 or so

Now we need some numerical results. Let’s concentrate on sample means first. We can show the following:

  1. If you are conducting a simple random sample from a population then when we look at the histogram of sample means we will find that it is centered right over the population mean. That is, the mean of the “sampling distribution of the sample means” is the same as the population mean itself. We say that we have an “unbiased estimator”. Notationally, we use the symbol for the mean (that’s the mu part) of the sample means (that’s the xbar part). Note that for homework, you may wish to write “xbar” rather than learn to format the symbol. Choose whichever you prefer. Our first result then is
  2. If you are conducting a simple random sample from a finite population (like a deck of cards) then when we look at the histogram of sample means we will find that it is not as spread out as the original population. In fact, we already know that we prefer large samples to small (we somehow feel that we have “more information”). You can show that the spread of the sampling

The Central Limit Theorem

Here’s a little numerical experiment to show (again) why the Central Limit Theorem is the “600 pound gorilla” of statistics. Suppose you have a simple population which is uniformly distributed between 0 and 1. If you’ve played with

the rand() function in your spreadsheet you see that it kicks out numbers between 0 and 1 for you. You might want to check this with a histogram. I’ve taken a population which is uniformly distributed and sampled with sample sizes 1 through 6 and presented histograms of sample means. In each graph I superimpose the normal distribution. Notice by

the time sample sizes (your “n” values) get even modestly large (say 5 or so) the distribution of sample means starts to look fairly mound shaped and symmetrical.

0 0.5 1

0

1

Sample Size =

0 0.5 1

0

1

2

3

Sample Size =

0 0.5 1

0

1

2

3

Sample Size =

0 0.5 1

0

1

2

3

Sample Size =

0 0.5 1

0

1

2

3

4

Sample Size =

0 0.5 1

0

1

2

3

4

Sample Size =

So, what can we do with this? Right now we just make predictions about sampling procedures. In the next lecture we see how to sample intelligently and say useful things about our samples. For now, try to answer the example questions. Assume in each case that both the population and the distribution of sample means are normally distributed.

  1. Scores of students on the ACT college entrance exam in a recent year had the normal distribution with mean and standard deviation. What is the probability that a single student randomly chosen from all those taking the test scores 21 or higher? What is the probability that the mean score for 36 students randomly selected from all those who took the test nationally is 21 or higher?

a. Since our population is normally distributed, we can get the probability that a data point will lie above 21 easily. If we know we have a normal distribution, and if we know the mean and the standard deviation, we can look up probabilities or areas as usual. Convert to z and use the table. Obtain

Now our table likes to see z values rounded to the nearest hundredth, so look up the area for z=0.41 and get A=0.6591. Our table gives areas to the left or below. We want an area to the right so get prob=1 - 0.6591 = 0.3409.

b. Let’s think this part through. If our population has a mean of 18.6 and a standard deviation of 5.9, then when we sample from this population with a sample size of we will have a sampling distribution of the means with center and standard deviation of sample means of These numbers are what we’ll use to convert to z and look up in our table.

Look up the area for z=2.44 and get A=0.9927. Our table gives areas to the left or below. We want an area to the right so get prob=1 - 0.9927= 0.0073.

So what’s the point of all this? Evidently it’s not at all uncommon for an individual to score 21 or above (it happens roughly 34% of the time). For a collection of 36 randomly selected individuals to have a same mean at or above 21 is relatively rare (occurring less than 1% of the time).

And here’s another possible presentation: A bottling company uses a filling machine to fill plastic bottles with a popular cola. The bottles are supposed to contain 300 milliliters (ml). In fact, the contents vary according to a normal distribution with mean μ = 304 ml and a standard deviation of 2 ml. What is the probability that an individual bottle contains less than 301 ml? What is the probability that the mean contents of the bottles in a six pack is less than 301 ml? What is the probability that the mean contents of the bottles in a case (24 bottles) is less than 301 ml?