Statistics: Sampling, Types of Experiments, and Summary Statistics, Assignments of Statistics

An introduction to statistics, focusing on sampling methods, types of experiments, and numerical and categorical summary statistics. It covers concepts such as simple random sampling, one-sample and multi-sample experiments, quantitative and qualitative data, sample mean, median, mode, range, and sample proportion.

Typology: Assignments

Pre 2010

Uploaded on 09/17/2009

koofers-user-vx5
koofers-user-vx5 🇺🇸

10 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 1
Sampling and Descriptive Statistics
1. Introduction:
Definition: Statistics is the science of (or a collection of techniques for)
Collecting (sampling, census)
Classifying (descriptive statistics)
Analyzing (e.g., regression analysis)
Generalizing (statistical inference)
A set of data for a special purpose
Each of these activities is based on probability.
We will emphasize making inferences about population parameters based on data from random
samples.
Some new terms need to be defined:
Population: a set of well-defined units (objects or outcomes) about which information is sought.
Sample: A subset of the population, containing objects or outcomes that are actually observed.
Random sample: A sample selected according to some rules of probability
Simple Random Sample (SRS): A sample of size n, selected in such a way that every sample of
size n (from the population of size N) has an equal chance of being the selected sample. As a
result of this property every element in the population has an equal chance (n/N) of being in
the random sample.
SRS Selected with replacement: Some population units may appear more than once. This
is used in theoretical studies.
SRS selected without replacement: Any population unit may appear in the sample at most
once. This method is used in real life problems.
Although the two selection methods are different, the difference becomes negligible when
the population size (N) is extremely large, relative to the sample size (n).
In this course whenever we talk about a sample we mean a SRS selected with
replacement.
A SRS selected with replacement gives independent observations, i.e., knowing the
value of any one element in the sample does not help in predicting the value of the of he
elements.
Other sampling methods (such as stratified random sampling, cluster sampling, multi-stage
sampling, etc., are used in real-life problems because they are usually more efficient. In such a
case the formulas given in this course need to be modified. Such sampling methods will not be
covered in this course. [You may take STA4222 Sampling and Survey Design if you are
interested.]
Types of Experiments:
One-sample experiment: When there is one population of interest, we select one sample (of n
elements) and make inferences about the population parameter(s).
Multi-sample Experiment: When we are interested in comparing two or ore populations, we
select a random sample from each population and make inferences about the parameters of
these populations.
STA3032 Chapter 1, Page 1 of 12
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Statistics: Sampling, Types of Experiments, and Summary Statistics and more Assignments Statistics in PDF only on Docsity!

Chapter 1

Sampling and Descriptive Statistics

1. Introduction:

Definition: Statistics is the science of (or a collection of techniques for)

Collecting (sampling, census)

Classifying (descriptive statistics)

Analyzing (e.g., regression analysis)

Generalizing (statistical inference)

A set of data for a special purpose

Each of these activities is based on probability.

We will emphasize making inferences about population parameters based on data from random

samples.

Some new terms need to be defined:

Population: a set of well-defined units (objects or outcomes) about which information is sought.

Sample: A subset of the population, containing objects or outcomes that are actually observed.

Random sample: A sample selected according to some rules of probability

Simple Random Sample (SRS): A sample of size n, selected in such a way that every sample of

size n (from the population of size N) has an equal chance of being the selected sample. As a

result of this property every element in the population has an equal chance (n/N) of being in

the random sample.

SRS Selected with replacement: Some population units may appear more than once. This

is used in theoretical studies.

SRS selected without replacement: Any population unit may appear in the sample at most

once. This method is used in real life problems.

 Although the two selection methods are different, the difference becomes negligible when

the population size (N) is extremely large, relative to the sample size (n).

 In this course whenever we talk about a sample we mean a SRS selected with

replacement.

A SRS selected with replacement gives independent observations, i.e., knowing the

value of any one element in the sample does not help in predicting the value of the of he

elements.

Other sampling methods (such as stratified random sampling, cluster sampling, multi-stage

sampling, etc., are used in real-life problems because they are usually more efficient. In such a

case the formulas given in this course need to be modified. Such sampling methods will not be

covered in this course. [You may take STA4222 Sampling and Survey Design if you are

interested.]

Types of Experiments:

One-sample experiment: When there is one population of interest, we select one sample (of n

elements) and make inferences about the population parameter(s).

Multi-sample Experiment: When we are interested in comparing two or ore populations, we

select a random sample from each population and make inferences about the parameters of

these populations.

Types of Data

Quantitative (Numerical) Data obtained as a result of some measurement or counting process

of population (or sample) elements. The results can be used in arithmetic operations.

Qualitative (Categorical) Data obtained as a result of observations on some characteristic of

the population (or sample) element. An element either belongs to a category or does not belong

to that category. Such data cannot be used in arithmetic operations.

1.2 Numerical Summaries of quantitative sample data:

Sample Data will be denoted by X 1 , X 2 , …, Xn.

Sample Mean: 1 1 2

n

i

i n

X

X X X

X

n n

. It is used as a measure of location or the

“center” of the data.

Sample Variance:

(^2 2 )

2 1 1

n^ n

i^ i

i^ i

X X^ X^ nX

S

n n

^ 

^ 

Sample Standard Deviation:

2 2

2 1

n

i

i

X nX

S S

n

. It is used as a measure of

dispersion (scatter) of the data.

Sample Median: The middle value of the ordered sample data. Sample median is preferred

to the sample mean when the distribution of the population data is not symmetric.

Sample mode: The most frequently observed value in the sample. This gives a quick and easy

way of locating the center but is not much used.

Sample range: Difference between the largest and the smallest observed values. It gives a

quick way of getting some information about the dispersion (scatter) of the data.

Two important results:

1. If X 1 , X 2 , …, Xn is a sample of measurements on a quantitative variable and for some

constants a and b, Yi = a + bXi, then YabX.

2. If X 1 , X 2 , …, Xn is a sample of measurements on a quantitative variable and for some

constants a and b, Yi = a + bXi, then

2 2 2

Y X Y X

SbS and SbS (^).

Sample statistics and Population parameters:

A sample statistic is a function of sample data. Some examples are:

Sample mean =

1

n

i

i

X X n

Sample standard deviation =

2 2

2 1

n

i

i

X nX

S S

n

Sample proportion = Y/n =

1

n

i

i

p X n

 , where^

1

n

i

i

Y X

Note that a sample statistic is a random variable and its values change from sample to sample.

A population parameter is a function of population data. Some examples are:

Population mean = 1

N

i

i

X

X

N

 

and

Population standard deviation =

2

1 2 2

1

N

i X N

i

X i X

i

X

X N

N

 

Population variance =

2

(^2 )

N

i X

i

X

X

N

Population proportion = p

1

N

i

i

X N

Note that a population parameter is a fixed number. Its value does not change from sample to

sample.

We will make statistical inferences about unknown population parameters based on one or more

sample statistics. [An exception will be made in Chapter 4, Probability, where we will assume that we

know the parameter values and study the behavior of some sample statistics.]

Example [Part of problem 12 of Section 1.2, modified]

A random sample of 16 students “measured” the circumference of a tennis ball “by eye” giving the

following results:

Use your calculator to find the mean, median and the standard deviation without using any formula.

Interpret what we have found.

The following is an output from Minitab:

Variable N Mean StDev Minimum Q1 Median Q3 Maximum

C1 16 22.744 2.872 18.000 20.500 23.500 25.000 26.

Here are a few questions you should ask yourselves before starting the interpretation:

  1. Are the mean and the standard deviation ( StDev ) in the output population parameters or sample

statistics?

  1. That is, do we have μ = 22.744 or (^) X =22.744? Similarly is 2.872 σ or S?
  2. What do the numbers 22.744 and 2.872 tell us?
  3. What is the proportion of observations below 23.5 in the sample?
  4. So, what is 23.5 called?
  5. What do Q 1 = 20.5 and Q 3 =25.0 tell us?
  6. What is the sample range?
  7. Can we say anything about the shape of the population distribution?

Note: To answer this question we need to use what is called the empirical rule of probability

[to be used more, later] stated below:

Empirical rule: If the distribution of a population is mound-shaped, then

a) Approximately 68% of all population measurements are within one standard deviation of

the population mean.

b) Approximately 95% of all population measurements are within two standard deviations of

the population mean.

c) Almost all (99.7%) of all population measurements are within three standard deviations of

the population mean.

Homework: Solve problems 1, 5, 7 to 14. Give reasons in each case.

When the number of observations is large we first need to tabulate the data, using equal intervals

whenever possible.

Table – 1. Distribution of STA6125 Students by Age

Ages Number of Students

17.5 up to 22.5 3

22.5 up to 27.5 32

27.5 up to 32.5 14

32.5 up to 37.5 4

37.5 up to 42.5 3

42.5 up to 47.5 1

47.5 up to 52.5 2

52.5 up to 57.5 0

57.5 up to 62.5 0

62.5 up to 67.5 0

67.5 up to 72.5 1

Total 60

Now we can draw a histogram in a similar way, with boxes centered at the center of each interval (20,

25, 30, 35, 40, 45, 50, 55, 60, 65 and 70) and the heights of the boxes are proportional to the

frequencies (number of observations in each interval).

20 30 40 50 60 70

35

30

25

20

15

10

5

0

Ages of Students

N

u

m

b

e

r

o

f

S

t

u

d

e

n

t

s

Figure - 3. Distiribution of STA 61125 Students by Age

Observe the following when drawing a histogram:

  1. Try to get intervals of equal length when tabulating the data. It gives a better picture and makes

your life easier.

  1. The length of the intervals is a subjective choice. Choose a nice round number such as 1 or 2 or

5 or 10, etc. so that there won’t be too many or too few intervals. The usual number of intervals

is 5 to 15.

  1. When intervals are of equal length, the heights of the boxes are equal to the number of

observations (frequency) in the interval. Some times relative frequencies (= number of

observations in each interval divided by total number of observations) or percentage (= relative

frequency times 100) are used for the height of the intervals.

  1. When intervals are not of equal length we need to find the density (= relative frequency

divided by the length of the interval) and that will be the height of the rectangles (thus the area

of each interval is proportional to the frequency).

  1. Note that adjacent intervals must touch each other (except when the frequency is zero).

Note that in a histogram we lose some information. For example, looking at Figure – 3, we cannot tell

how many students there are at each age. This is price we have to pay for summarizing the data.

d) Box-and-Whisker Plot (or Boxplot):

This is a graph that has a box with whiskers. The box starts at the first quartile (

1 ) and extends to the

third quartile (

2 ), with a horizontal line drawn at the median. One of the whiskers goes from the top of

the box to the maximum observation, that is not an outlier (

3 ) and the other one goes from the smallest

observation [that is not an outlier] to the bottom of the box. That’s all.

70

60

50

40

30

20

A

g

e

s

Box plot of Age Data

1 The first quartile (also called the 25

th percentile) of a set of ordered data is a number below which at most 25% the

observations lie. It is denoted by Q 1

2 The third quartile (also called the 75

th percentile) of a set of ordered data is a number below which at most 75% of the

observations lie. It is denoted by Q 3. The difference between these two quartiles is called the interquartile range , IQR =

Q 3 – Q 1. Can you guess what Q 2 represents?

3 An observation is called an outlier if it is 1.5×IQR below Q 1 or 1.5×IQR above Q 3. An extreme outlier is 3×IQR away

from the top or the bottom of the box.

36

34

32

30

28

26

24

22

20

A

g

e

s

Boxplot of Age Data

Ages > 36 are deleted.