






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An introduction to statistics, focusing on sampling methods, types of experiments, and numerical and categorical summary statistics. It covers concepts such as simple random sampling, one-sample and multi-sample experiments, quantitative and qualitative data, sample mean, median, mode, range, and sample proportion.
Typology: Assignments
1 / 12
This page cannot be seen from the preview
Don't miss anything!







1. Introduction:
Definition: Statistics is the science of (or a collection of techniques for)
Collecting (sampling, census)
Classifying (descriptive statistics)
Analyzing (e.g., regression analysis)
Generalizing (statistical inference)
A set of data for a special purpose
Each of these activities is based on probability.
We will emphasize making inferences about population parameters based on data from random
samples.
Some new terms need to be defined:
Population: a set of well-defined units (objects or outcomes) about which information is sought.
Sample: A subset of the population, containing objects or outcomes that are actually observed.
Random sample: A sample selected according to some rules of probability
Simple Random Sample (SRS): A sample of size n, selected in such a way that every sample of
size n (from the population of size N) has an equal chance of being the selected sample. As a
result of this property every element in the population has an equal chance (n/N) of being in
the random sample.
SRS Selected with replacement: Some population units may appear more than once. This
is used in theoretical studies.
SRS selected without replacement: Any population unit may appear in the sample at most
once. This method is used in real life problems.
Although the two selection methods are different, the difference becomes negligible when
the population size (N) is extremely large, relative to the sample size (n).
In this course whenever we talk about a sample we mean a SRS selected with
replacement.
A SRS selected with replacement gives independent observations, i.e., knowing the
value of any one element in the sample does not help in predicting the value of the of he
elements.
Other sampling methods (such as stratified random sampling, cluster sampling, multi-stage
sampling, etc., are used in real-life problems because they are usually more efficient. In such a
case the formulas given in this course need to be modified. Such sampling methods will not be
covered in this course. [You may take STA4222 Sampling and Survey Design if you are
interested.]
Types of Experiments:
One-sample experiment: When there is one population of interest, we select one sample (of n
elements) and make inferences about the population parameter(s).
Multi-sample Experiment: When we are interested in comparing two or ore populations, we
select a random sample from each population and make inferences about the parameters of
these populations.
Types of Data
Quantitative (Numerical) Data obtained as a result of some measurement or counting process
of population (or sample) elements. The results can be used in arithmetic operations.
Qualitative (Categorical) Data obtained as a result of observations on some characteristic of
the population (or sample) element. An element either belongs to a category or does not belong
to that category. Such data cannot be used in arithmetic operations.
1.2 Numerical Summaries of quantitative sample data:
Sample Data will be denoted by X 1 , X 2 , …, Xn.
Sample Mean: 1 1 2
n
i
i n
n n
. It is used as a measure of location or the
“center” of the data.
Sample Variance:
(^2 2 )
2 1 1
n^ n
i^ i
i^ i
X X^ X^ nX
n n
^
Sample Standard Deviation:
2 2
2 1
n
i
i
X nX
n
. It is used as a measure of
dispersion (scatter) of the data.
Sample Median: The middle value of the ordered sample data. Sample median is preferred
to the sample mean when the distribution of the population data is not symmetric.
Sample mode: The most frequently observed value in the sample. This gives a quick and easy
way of locating the center but is not much used.
Sample range: Difference between the largest and the smallest observed values. It gives a
quick way of getting some information about the dispersion (scatter) of the data.
Two important results:
1. If X 1 , X 2 , …, Xn is a sample of measurements on a quantitative variable and for some
constants a and b, Yi = a + bXi, then Y a bX.
2. If X 1 , X 2 , …, Xn is a sample of measurements on a quantitative variable and for some
constants a and b, Yi = a + bXi, then
2 2 2
Y X Y X
S b S and S b S (^).
Sample statistics and Population parameters:
A sample statistic is a function of sample data. Some examples are:
Sample mean =
1
n
i
i
X X n
Sample standard deviation =
2 2
2 1
n
i
i
X nX
n
Sample proportion = Y/n =
1
n
i
i
p X n
1
n
i
i
Note that a sample statistic is a random variable and its values change from sample to sample.
A population parameter is a function of population data. Some examples are:
Population mean = 1
N
i
i
X
and
Population standard deviation =
2
1 2 2
1
N
i X N
i
X i X
i
Population variance =
2
(^2 )
N
i X
i
X
Population proportion = p
1
N
i
i
Note that a population parameter is a fixed number. Its value does not change from sample to
sample.
We will make statistical inferences about unknown population parameters based on one or more
sample statistics. [An exception will be made in Chapter 4, Probability, where we will assume that we
know the parameter values and study the behavior of some sample statistics.]
Example [Part of problem 12 of Section 1.2, modified]
A random sample of 16 students “measured” the circumference of a tennis ball “by eye” giving the
following results:
Use your calculator to find the mean, median and the standard deviation without using any formula.
Interpret what we have found.
The following is an output from Minitab:
Variable N Mean StDev Minimum Q1 Median Q3 Maximum
C1 16 22.744 2.872 18.000 20.500 23.500 25.000 26.
Here are a few questions you should ask yourselves before starting the interpretation:
statistics?
Note: To answer this question we need to use what is called the empirical rule of probability
[to be used more, later] stated below:
Empirical rule: If the distribution of a population is mound-shaped, then
a) Approximately 68% of all population measurements are within one standard deviation of
the population mean.
b) Approximately 95% of all population measurements are within two standard deviations of
the population mean.
c) Almost all (99.7%) of all population measurements are within three standard deviations of
the population mean.
Homework: Solve problems 1, 5, 7 to 14. Give reasons in each case.
When the number of observations is large we first need to tabulate the data, using equal intervals
whenever possible.
Table – 1. Distribution of STA6125 Students by Age
Ages Number of Students
17.5 up to 22.5 3
22.5 up to 27.5 32
27.5 up to 32.5 14
32.5 up to 37.5 4
37.5 up to 42.5 3
42.5 up to 47.5 1
47.5 up to 52.5 2
52.5 up to 57.5 0
57.5 up to 62.5 0
62.5 up to 67.5 0
67.5 up to 72.5 1
Total 60
Now we can draw a histogram in a similar way, with boxes centered at the center of each interval (20,
25, 30, 35, 40, 45, 50, 55, 60, 65 and 70) and the heights of the boxes are proportional to the
frequencies (number of observations in each interval).
20 30 40 50 60 70
35
30
25
20
15
10
5
0
Ages of Students
N
u
m
b
e
r
o
f
S
t
u
d
e
n
t
s
Figure - 3. Distiribution of STA 61125 Students by Age
Observe the following when drawing a histogram:
your life easier.
5 or 10, etc. so that there won’t be too many or too few intervals. The usual number of intervals
is 5 to 15.
observations (frequency) in the interval. Some times relative frequencies (= number of
observations in each interval divided by total number of observations) or percentage (= relative
frequency times 100) are used for the height of the intervals.
divided by the length of the interval) and that will be the height of the rectangles (thus the area
of each interval is proportional to the frequency).
Note that in a histogram we lose some information. For example, looking at Figure – 3, we cannot tell
how many students there are at each age. This is price we have to pay for summarizing the data.
d) Box-and-Whisker Plot (or Boxplot):
This is a graph that has a box with whiskers. The box starts at the first quartile (
1 ) and extends to the
third quartile (
2 ), with a horizontal line drawn at the median. One of the whiskers goes from the top of
the box to the maximum observation, that is not an outlier (
3 ) and the other one goes from the smallest
observation [that is not an outlier] to the bottom of the box. That’s all.
70
60
50
40
30
20
A
g
e
s
Box plot of Age Data
1 The first quartile (also called the 25
th percentile) of a set of ordered data is a number below which at most 25% the
observations lie. It is denoted by Q 1
2 The third quartile (also called the 75
th percentile) of a set of ordered data is a number below which at most 75% of the
observations lie. It is denoted by Q 3. The difference between these two quartiles is called the interquartile range , IQR =
Q 3 – Q 1. Can you guess what Q 2 represents?
3 An observation is called an outlier if it is 1.5×IQR below Q 1 or 1.5×IQR above Q 3. An extreme outlier is 3×IQR away
from the top or the bottom of the box.
36
34
32
30
28
26
24
22
20
A
g
e
s
Boxplot of Age Data
Ages > 36 are deleted.