Download Statistics 101: Understanding Descriptive and Inferential Statistics and Data Collection - and more Study notes Statistics in PDF only on Docsity!
Stat 104
Section B
Anna Peterson
1
Stat 104
- Office Hours: Snedecor 2211 MWF 10:00-11:
- Lecture: MWF 8:40-9:40 MoleBio 1420
- Laboratory: TF 1:20-3:20 Carver 0305
- Required Text: Just the Essentials of Elementary Statistics,
10th Edition, Robert Johnson & Patricia Kuby, Thomson:
Brooks/Cole
2
Prerequisites
• Make sure you can do basic algebra.
- There will be a pre-test passed out.
- Understand summation notation
- Order of operations
• Make sure you can use a calculator.
- Bring your calculator to class and lab.
3
How can you do well in this class?
- Attend all lectures and pay attention.
- Attend all labs and participate.
- Complete all assignments.
- Go over answers to assignments.
- READ and STUDY the textbook.
- Come to office hours with questions.
- Form study groups with fellow students.
4
Course Information and Policies
- Exams: In class exams will be given during the lab period:
Bring a Calculator, pencil/pen, formula paper, and tables
handed out in class.
- Exams will be given during the 2 hour lab period on Fridays.
- The exam is closed notes and book. One 8x11 sheet of paper, typed/written on one side may be used for the first exam. Two 8x11 sheets of paper, typed/written on one side each for the second exam.
- Final Exam: August 7th
- Three 8x11 sheets of paper, typed/written on one side each may be used on the final exam. One from each of the two previous exams and one for the new material.
5
Course Information and Policies
Lab: There are two weekly two-hour laboratory scheduled. Bring
your book, class notes, tables and a calculator to the lab.
- Class participation points are given for presenting a homework
problem solution during lab. Each student is only required to
present one solution. You will receive 10 points for presenting.
- There are 11 labs this summer I will take your top 8 lab scores
towards your lab grade. Each lab is worth 5 points
- Extra presentations and better lab attendance will influence
boundary grades at the end of the semester.
6
Course Information and Policies
- Homework: Individual practice is an important part of
learning. For this reason homework problems will be
assigned throughout the semester.
- I encourage everyone to attempt the problems
- Please note that homework will not be collected
- Attach your completed homework to a quiz to receive a minimum of 5 points on a quiz.
7
Course Information and Policies
Quizzes:
- Quizzes will be administered during lab/class after homework is due.
- The quizzes will be open note, open book.
- You will have 10 minutes to complete the quiz.
- The quizzes will be composed of questions based on the homework.
- Each quiz is worth 10 points. Up to 8 quizzes will be
given. Only your top five scores will be used to determine your total points. (Max 50 points)
- Attach your completed homework to the quiz and
receive a minimum of 5 points on the quiz.
8
Course Information and Policies
- Project: A project will be assigned during the
semester. This project is intended to expose students to the collection and statistical analysis of data to solve real world problems. Students will
work in groups. Specific details will be given later in the semester.
- Grading: Letter grades including plus/minus will be
given based on performance on exams, class participation, and the project.
9
Tentative Schedule/Grading
- Exam l 100 points
- Exam ll 100 points
- Quizzes 50 points (10 pts each)
- Project 50 points
- Class Participation 50 points (10 for the
presentation and 40 for Labs)
- Final Exam 150 points
- Total Points 500 points
10
What Is Statistics?
- Wikipedia: A mathematical science pertaining
to the collection, analysis, interpretation or
explanation, and presentation of data. It is
applicable to a wide variety of academic
disciplines, from the natural and social sciences
to the humanities, government and business.
11
What is Statistics?
- Statistics is the science of collecting,
describing and interpreting data allowing for
data-based decision making.
- I like to think of statistics as the science of
learning from data..."
(Jon Kettenring, ASA President 1997)
12
What is Statistics?
In Business and Industry, statistics can be used to quantify unknowns in order to optimize resources,
e.g.
- Predict the demand for products and services.
- Check the quality of items manufactured in a facility.
In Agriculture, statistics can be used to:
- Predict the crop yields.
- Estimate minimum fertilizer needed.
13
What Is Statistics?
- Statistics is about …variation.
- The world is full of data.
- Data exhibit variation.
- Recognizing, displaying and quantifying variation in data can help us make sense of the world.
- Try to explain variation.
14
We distinguish between descriptive and
inferential Statistics:
- look for patterns
- summarize and present data
- quick information
- compare several groups, i.e. one can easily look for
differences and similarities
15
Descriptive Statistics: The collection,
presentation and description of data in form
of graphs, tables and numerical summaries
such as averages, variances etc.
We distinguish between descriptive and
inferential Statistics:
- making data-based decisions
- generalizing information obtained from descriptive analysis to
a larger group of individuals
16
Inferential Statistics: Deals with the
interpretation of data as well as drawing
conclusions and making generalizations
based on data for a larger group of subjects.
Example: Before movies are released they are previewed
by a selected audience. Assume 200 people are asked to provide an overall rating for a movie. Results:
- 24% very satisfied
- 26% satisfied
- 33% in between
- 12% dissatisfied
- 5% very dissatisfied
24% of the 200 previewers were very satisfied with the movie
{this is a descriptive statement based on a sample of 200
previewers.}
24% of all people going to see the movie will be very satisfied
{this is an inferential statement for the entire population of
individuals.}
17
18
Statistics is the science of:
- Collecting
- Describing (displaying)
- Interpreting Data
19
We collect data to answer a specific question of
interest.
- Does nitrogen improve corn yield?
- What seed is best?
- What is the relationship between rainfall and
yield?
- Does this new drug cure the disease? Is it
safe?
- What do voters think about a candidate or an
issue?
yield
nitrogen
20
Does nitrogen improve corn yield?
We have a question that we would like to answer. Are we interested in all corn, just one brand of corn, or only corn grown in Iowa?
The group that is to be studied is called the population and each
element of the population is called an individual.
We now decide that we are specifically interested in all corn types grown in Iowa. Is it feasible to collect data from every single corn field in the state of Iowa? No. Not enough time or money.
We look for a reasonable subset of the population called a sample.
Perhaps one farm from each county in Iowa. Population: All farms in Iowa. Individual: A single farm.
21
of interest.
Example: All farms in Iowa
Parameter – numerical value
summarizing all the data of the entire
population.
Example: population mean
yield of corn
few items from
the population.
Example: 10
farms in Iowa.
Statistic – numerical value
summarizing the sample.
Example: sample mean
yield of corn.
Population
Sample
22
Once we have data collected from our sample we can look at the
statistics.
Statistics are the numbers summarizing the data in a sample.
e.g. Sometimes referred to as point estimates
We hope these statistics are good estimates of the parameters.
Parameters are the numbers summarizing the population.
e.g.
x,s, pˆ
μ, σ, p
What do we need to measure to answer the question of interest?
Variables are the characteristics of the individuals within the population. What would be the variable for the proposed question? (Does nitrogen improve corn yield?) yield/acre
Is this variable qualitative or quantitative? (^) quantitative
Data
- Information
- Context is important
- Who are we collecting data on?
- Cases: Rows in a data table.
- What data are we collecting?
- Variables: Columns in a data table.
23
24
Gender Age Job
Happiness (1-5) 5=very happy
Number of Children
Average Salary/ hour F 25 Accountant 3 0 25. M 30 Sales 4 2 10. F 19 Student 2 0 7. F 24 Marketing 3 1 20. F 56 Teacher 5 4 15. M 32 Librarian 5 0 20. M 34 Accountant 3 0 25 M 45 Realtor 2 2 15. M 18 Student 5 1 7. F 62 Bus driver 4 2 12. F 34 Cashier 2 3 6.
Data
- Who?
- What?
- Age, Number of Children
- Happiness, Job title, Gender,
25
What?
- Variables: characteristic of interest
about each individual element of a
population or sample
- Categorical v. Numeric
- Nominal v. Ordinal
- Continuous v. Discrete
26
27
Variable
Quantitative (Numerical):
quantifies an element of a population
Examples: age, height, shoe size
Qualitative (Categorical):
describes or categorizes an element of a population
Examples: Gender, hair color, eye color
28
Nominal: names an element
Examples: Gender, hair color, type of vehicle you own, favorite color
Qualitative
(Categorical)
Ordinal: incorporates an ordered
position, or ranking
Examples: level of satisfaction with a product, heat setting on a microwave (low, med, high)
29
Continuous: assumes an
uncountable number of values
Examples: Height, weight, distance (measurements)
Quantitative
(Numerical)
Discrete: assumes a countable
number of values
Examples: Age, number of siblings, dozens of eggs (things you can count)
30
Variable
Quantitative
(Numerical):
quantifies an element of a population
Nominal:
names an element
Qualitative
(Categorical):
describes or categorizes an element of a population
Ordinal:
incorporates an ordered position, or ranking
Discrete:
assumes a countable number of values
Continuous:
assumes an uncountable number of values
31
Gender Age Job
Happiness (1-5) 5=very happy
Number of Children
Average Salary/ hour F 25 Accountant 3 0 25. M 30 Sales 4 2 10. F 19 Student 2 0 7. F 24 Marketing 3 1 20. F 56 Teacher 5 4 15. M 32 Librarian 5 0 20. M 34 Accountant 3 0 25 M 45 Realtor 2 2 15. M 18 Student 5 1 7. F 62 Bus driver 4 2 12. F 34 Cashier 2 3 6.
32
Variable
Quantitative
(Numerical):
quantifies an element of a population
Nominal: Gender,
Job
Qualitative
(Categorical):
describes or categorizes an element of a population
Ordinal: Happiness (ordered
by amount of happiness)
Discrete: Age, Number of
Children
Continuous:
Average Wage
33
Example:
Gallup News Service conducted a survey of 1012 adults aged 18 years or older, August 29-September 5, 2000. The respondents were asked, “Has anyone in your household been the victim of a crime in the past 12 months?” Of the 1012 adults surveyed, 24% said they or someone in the household had experienced some type of crime during the preceding year. Gallup News Service concluded that 24% of all households had been victimized by crime during the past year. a) Identify the research objective To determine the proportion of households in the US that have been a victim of a crime in the past 12 months. b) Identify the sample 1012 adults aged 18 yrs or older c) List the descriptive statistics : 24% of respondents stated that they or someone in the household experienced some type of crime. d) What is the corresponding parameter p: The proportion of households that experienced some type of crime in the past 12 months. e) State the conclusions made in the study 24% of all households in the US have been victimized by crime in the past 12 months.
Notice that the conclusions are made (inferred) toward the entire population. We hope the statistic ( ) is a good point estimate of the parameter (p).
pˆ
pˆ
Data Collection
- Sampling studies (surveys)
- Experiments
34
Sample Surveys
- Idea 1: Examine a part of the whole.
- Easier to obtain
- Easier to work with
35
Population – all items
of interest.
Sample – a
few items from
the population.
Properties of a Sample (part of a whole)
- Would like the sample to be representative of the
population.
- Should look like a smaller version of the population
- This may not be possible, but at least we would like a
sample that is not biased.
- A biased sample is one that over (or under) represents a
certain portion of the population
- Telephone Surveys (how biased?)
36
Sample Surveys
- Idea 2: Random selection
- Selecting items from the population should be
done at random so as to reduce the chance of getting a biased sample.
- Random selection is a KEY idea in data collection
37
Sample Surveys
- Idea 3: It’s the sample size!
- What fraction of the population is sampled is not important.
- The size of the sample is the important thing
- ie: 1000 items in a sample tells you just as much
about a population of size 1,000,000 as it does about a population of size 100,000,
38
What about a census?
- Would a census (complete enumeration) of
the population be a better way to go?
- Difficult to do.
- Populations are often dynamic.
- Can be more complex.
- More expensive.
- Example: the Decennial US Census (next one in
39
Example
- Population: All students at ISU.
- Question: Have you posted a video on YouTube?
- Population parameter: Proportion of all ISU students who would answer yes.
- Sample: 400 ISU students.
- Sample statistic: the proportion of the 400 students in the sample who say yes.
40
How should we select the 400?
Two weak approaches…(don’t do these!)
- Voluntary Response…
- Put an ad in the ISU Daily with the question and ask
students to drop off their answers.
- Problems?
- Convenience Sampling…
- Go to computer labs across campus and ask the first
400 students you meet.
- Ask every 100 people entering a football game about
their favorite sport.
41
Sampling studies
- Single–stage: the elements of the sampling frame are treated equally and there is no subdividing or partitioning of the frame - Simple random sample, Systematic
- Multiple–stage: the elements are subdivided and the sample is chosen in more than one stage - Statified, Cluster
42
Simple Random Sample (SRS)
- We want a representative sample but will settle
for one that is not biased.
- Representative: sample resembles smaller version of
the population
- Unbiased: no group is under (or over) represented
- SRS for example – Each combination of 400 ISU
students has the same chance of being the sample selected.
43
SRS
- Sampling Frame: a list, or set, of the elements
belonging to the population from which the
sample will be drawn
- From example…Frame: A list of all students at ISU (the Registrar has such a list)
- Use random numbers to select 400 students at random from this list using unique ID numbers
44
45
Similar Random Number Table can be found in Appendix B, Table 1 p
So we would sample individuals 162, 091, 170, 196, 130, 216, 336, 235, 027, 011. 46
Problem: From a population of 400 individuals, we wish to select 10 individuals for our sample. Assign students a number from 0 to 399.
Simple Random Sample
- If one were to do this more than once
- Different random numbers will give different samples of
400 students.
- We have introduced variability by sampling! (Remember
statistics is about variability!!)
47
48
We can obtain a random sample by sampling with or without replacement.
Sampling without replacement
Once an individual is selected to be in the sample, it cannot be selected again. For instance, if we are using a deck of cards as the population, if I draw a card and set it aside before selecting the next card, this is sampling without replacement.
Sampling with replacement
Once an individual is selected to be in the sample, the appropriate measurements are taken and then the individual is placed back into the population before selecting the next individual. Here it is possible for an individual to be selected more than one time. For example, if we are using a deck of cards as the population, if I draw a card and record its suit and then place it back in the deck before the next card is selected, this is sampling with replacement.
Other Sampling Plans
- Systematic: Select in a systematic way from
the sampling frame. Ie: choose every kth
individual
- From example: Select every 60th^ student
on the list from the Registrar.
- Caution the list should be in random order and the starting point should be
selected at random.
- Single-stage sampling plan 49
Other Sampling Plans
- Stratified Sample
- Multi-stage sampling plan
- Separate into strata
- SRS of each stratum
- Can do comparisons across strata
- Reduce error by grouping into strata
- From example: Divide ISU students into
colleges and select a SRS from each college.
50
51
Stratified Sample
A stratified sample is obtained by separating the population into non- overlapping groups called strata and then obtaining a simple random sample from each stratum. The individuals within each stratum should be homogeneous in some way. (strata: Geographical Regions)
Stage 1: Divide Stage 2: SRS into alike strata from each strata
Stratum
Stratum
Stratum
Stratum
Stratum Stratum
52
Cluster Sampling
A cluster sample is obtained by dividing the population into clusters, each cluster having a mix of items representing the sample. A few clusters are then selected at random and a
census is taken of the chosen clusters.
Stage 1: SRS
Stage 2: Census
Stratified Vs. Cluster
- Stratified
- take SRS from each strata
- “like stratum”
- Cluster
- randomly select a few clusters and
sample entire cluster
53
54
Example:
Identify the type of sampling used.
- A radio station asks its listeners to call in their opinion regarding the use of American forces in peacekeeping missions.
Convenience
- A farmer divides his orchard into 50 subsections, randomly selects 4 and samples all of the trees within the 4 subsections in order to approximate the yield of his orchard.
Cluster
- A school official divides the student population into five classes: freshman, sophomore, junior, senior, graduate student. The official takes a random sample from each class and asks the members’ opinion regarding student services.
Stratified
55
Nonsampling errors are errors that result from the survey
process. They are due to the nonresponse of individuals selected to be in the survey, to inaccurate responses, to poorly worded questions, to bias in the selection of individuals to be given the survey and so on.
Multitude of reasons. We want to eliminate them if possible.
Sampling errors is the error that results from using sampling to
estimate information regarding a population. This type of error occurs because a sample gives incomplete information about the population.
56
Example:
The following surveys are flawed. Determine whether the sampling method or the survey itself is flawed. For flawed surveys, identify the cause of the error and suggest a remedy.
- A magazine is conducting a study on the effects of infidelity in a marriage. The editors randomly select 400 women whose husbands were unfaithful and ask, “Do you believe a marriage can survive when the husband destroys the trust that must exist between husband and wife?”
Poorly worded. “Do you believe that a marriage can be maintained after an extramarital relation?” (More neutral question) Also potential bias based on asking women with unfaithful husbands.
- A college vice president wants to conduct a study regarding student achievement of undergraduate students. He selects the first 50 students who enter the building on a given day and administers his survey.
Flawed survey. Should do a SRS.
57
- A polling organization is going to conduct a study to estimate the percentage of households that speak a foreign language as the primary language. It mails a questionnaire to 1023 randomly selected households throughout the United States and asks the head of household if a foreign language is the primary language spoken in the home. Of the 1023 households selected, 12 responded.
Nonresponse and flawed survey. Hard to reply if primary language is not English. Do personal interviews or phone calls.
Observational Studies
- Observational studies are those in which the researcher is a passive observer
- Simply observing what happens
- A sample survey is an observational study.
- There are other observational studies that are not
surveys.
- Can’t make cause and effect inference based on
observations
58
Tanning and Skin Cancer
- 1,500 people.
- Some had skin cancer and some did not have skin
cancer.
- Asked all participants whether they used tanning
beds.
59
Diet and Blood Pressure
- Enroll 100 individuals in the study.
- Give each a diet diary. Everything eaten each day is recorded. From the diary entries the amount of sodium in the diet is calculated.
- Measure blood pressure.
60
Differences
- Retrospective – look at past records and historical data. - Tanning and Skin Cancer
- Prospective – identify subjects and collect data as
events unfold.
61
Experiments
- Intentionally apply a treatment to individuals
(referred to as experimental units)
- Attempts to isolate the effects of the treatment on a
response variable
- Terminology…
- Explanatory variable – Factor.
- Response variable.
- Subjects – Participants – Experimental Units.
- Treatments.
62
- (Designed) Experiment: a controlled study in which
one or more treatments are applied to experimental units. The experimenter then observes the effect of varying these treatments on a response variable.
- Experimental unit: a person, object, or some other
well-defined item upon which a treatment is applied.
- Predictor (explanatory) variables are the factors that
affect the response variable. Also referred to as independent variables.
- Treatment: a condition applied to the experimental
unit. (levels of the factors) 63
- Response variable is a quantitative or qualitative variable that represents the variable of interest. Also referred to as dependent variables.
- Extraneous variables are neither response nor predictor variables. These are variables that may affect the outcome of the experiment, but are not controlled by the experimenter.
64
Experiments
- The experimenter must actively and deliberately manipulate the factor(s) to establish the method of treatment.
- Interested in “What might happen if I change this factor?”
- Experimental units are assigned at random to the treatments.
65
Controlling Cholesterol
- Does a higher dose of a new drug lower cholesterol
more?
- 30 participants.
- Factor – drug dose.
- Treatments: 10 mg or 20 mg.
- 15 subjects randomly assigned to each treatment.
- Response – change in cholesterol.
66
3 Experimental Principles
- Control
- Outside variables
- Control group
- Random assignment
- Replication
- within an experiment.
- repeating an entire experiment.
67
1a. Control (outside variables)
- Control outside variables that may affect the
response.
- Have subjects of the same age, gender, general health,
ethnic group.
- By controlling outside variables you
- prevent those variables from causing variation in the response.
- limit the extent that you can generalize results
68
1b. Control Group
- Have a group that receives 0 mg.
- The 0 mg pill is called a placebo (no
active ingredient).
- The control group allows the
experimenter to establish whether the
drug is effective at all in reducing
cholesterol by providing a means to
see what the natural change in average
cholesterol is during the experiment 69
- Random Assignment
- Tends to spread the effects of uncontrolled
outside variables evenly across the treatment
groups.
- Reduces the chance that an uncontrolled
outside variable will bias the results.
- Does NOT prevent uncontrolled variables from
changing, but does lessen the impact of these
changes due to the even spread
70
- Replication
- Within an experiment.
- Have several experimental units in each treatment group.
- Able to assess the natural variation in the
response for units treated the same way.
71
“Replication”
- Repeating the entire experiment.
- This is especially important if the subjects in an
experiment are not randomly selected from a population.
- Are the results of the entire experiment repeatable?
72
Diagram of an experiment
73
Subjects
random
assignment
Group 1
several
subjects
Group 2
several
subjects
Treatment 1
Treatment 2
Compare
Notice that we have both random assignment AND replication within the experiment (several subjects within each group)
Example: A school psychologist wants to test the effectiveness of a new method for teaching reading. She selects five hundred first grade students in District 203 and randomly divides them into two groups. Group 1 is taught by means of the new method, while Group 2 is taught via traditional methods. The same teacher is assigned to teach both groups. At the end of the year, an achievement test is administered and the results of the two groups compared.
- What is the response variable in this experiment? Achievement test score
- What is the treatment? How many levels does the treatment have? Method of teaching, 2 levels
- Are any of the predictor variables controlled? grade, teacher, school district
- How does the researcher’s design handle students from different socioeconomic levels? hold constant
- Identify the experimental units. 500 students
74
Example:
An experiment is run to study the effect of cooking time on the number of
unpopped kernels in bags of microwave popcorn. A package containing
eight bags of the same brand of microwave popcorn is used. The first bag,
selected at random, is popped for 2 minutes on high. The bag is opened
and the number of unpopped kernels counted. The next bag, again
selected at random, is popped for 2 minutes and 15 seconds, the bag is
opened and the number of unpopped kernels counted. Subsequent bags
are selected at random and popped for 15 seconds longer than the
previous bag and the number of unpopped kernels is counted. The same
microwave, set on high, is used for all bags.
75