














Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An introduction to sampling from populations, discussing the importance of statistical inference, random sampling, bias, and variability. Through examples and case studies, learners will understand how to recognize sources of bias and the importance of representative samples.
Typology: Study Guides, Projects, Research
1 / 22
This page cannot be seen from the preview
Don't miss anything!















Ryan Miller
Today I’ve brought with me a bag containing 100 pieces of candy, it is your job to correctly determine the weight of the bag With your group, you will:
The group whose estimate is closest to the bag’s weight will be given the entire bag to consume or distribute as they see fit
Every statistical analysis begins with a question - ie: How much does the bag of candy weigh?
I (^) The best approach is to weigh the entire bag I (^) But what if your access to the bag is limited? I (^) In our example, the 100 pieces of candy in the bag represent a population - all of the cases we want to learn about I (^) I didn’t allow you access to the entire population, but rather a sample - a subset of cases from the population
We denote the size of a sample using n , ie: n = 5
In a study on hand washing, researchers in several cities across the United States pretended to comb their hair in public restrooms while observing whether or not people washed their hands after going to the bathroom. They found that 85% of the 6,000 individuals they observed washed their hands.
What is the population? What is the sample?
I (^) We could say the population is all people in the US that use public restrooms I (^) But people are likely to behave differently when someone else is in the restroom with them I (^) It would be wise to restrict the population to people in the US using a restroom with another occupant
Statisticians use different notation to distinguish population parameters (things we want to know) from estimates (things derived from a sample). For a few common measures, this notation is summarized below:
Statistic Population Parameter Estimate (from sample) Mean μ ¯ x Standard Deviation σ s Proportion p ˆ p Correlation ρ r
For example, μ is the mean of the population, while ¯ x is the mean of the cases that ended up in the sample.
I (^) Any given sample, regardless of how it was collected, only contains a subset of cases from the population I (^) This introduces variability when trying to use the sample to estimate a population parameter I (^) Just by random chance, some samples will yield more accurate estimates than other samples, even if an ideal sampling protocol is used I (^) Next week we’ll approach the goal of trying to understand this variability, today we’ll continue learning about sampling
To summarize, there are two reasons why an estimate might not accurately represent a population parameter, bias and variability :
Variance decreases with larger sample sizes Bias is not improved by a larger sample
I (^) Since 1916, the Literary Digest magazine had correctly predicted the winner of 5 straight presidential elections I (^) Prior to the 1936 election, the Literary Digest sampled 2. million people and predicted a landslide victory for Landon: 57% - 43% I (^) In the actual election, Roosevelt won by a landslide: 62% - 38%
How could the Digest have been so far off?
I (^) Take a minute to discuss this with your group I (^) Consider whether the inaccurate estimate could be due to bias or variability
Selection Bias
I (^) The Literary Digest sent 10 million questionnaires to addresses gathered from telephone books and club memberships I (^) This disproportionately screened out the poor; Only 1 in 4 households owned a telephone at the time, and club members tended to be upper class I (^) Selection bias resulted in a non-representative sample
Non-response Bias I (^) Of the 10 million questionnaires, only 2.4 million were returned I (^) Responders tend to be different from non-responders I (^) The 2.4 million respondents likely weren’t even representative of the 10 million people polled
That was 1936, surely today we understand the importance of representative samples... right?
With your group take a look at the NY Times article: https://www. nytimes.com/interactive/2017/07/25/sports/football/nfl-cte.html
Article Link: “I’m a brain scientist and I let my son play football” The study population in the most recent CTE paper represents a biased sample, as stated by the authors themselves. This means only the brains of self-selecting people who displayed neurological symptoms while living were studied. This is important because this sample was not a reflection of the general football population. The study was based on 202 brains out of the millions of people who have played football all of which are former NFL players.
So, when you hear 99 percent of football players had CTE, that doesn’t mean that almost every football player will get CTE, and it doesn’t mean your child has a 99-percent chance of developing CTE if he or she plays football. It means 99 percent of a specifically selected study sample had some degree of CTE; not 99 percent of the general football population. This is an important distinction.
When collecting data it is crucial to be aware of potential sources of bias, some examples include:
This isn’t a complete list, there are countless reasons for data not being representative of the population of interest
With your group, discuss whether each of the following are a sample or a population. If the data are a sample, describe the target population and whether the sample is biased