





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An in-depth exploration of sampling design, focusing on methods, strata, clustering information, and inclusion probabilities. It covers various sampling methods, including random sampling with and without replacement, and unequal probabilities. The document also discusses the estimation of population totals, ratios, and means, as well as the calculation of standard errors and design effects.
Typology: Study notes
1 / 9
This page cannot be seen from the preview
Don't miss anything!






This document describes the algorithms used in the complex sampling estimation procedure
CSDESCRIPTIVES. The data do not have to be sorted.
Complex sample data must contain both the values of the variables to be analyzed and the
information on the current sampling design. Sampling design includes the sampling method,
strata and clustering information, and inclusion probabilities for all units at every sampling
stage. The overall sampling weight must be specified for each observation.
The sampling design specification for CSDESCRIPTIVES may include up to three stages
of sampling. Any of the following general sampling methods may be assumed in the first
stage: random sampling with replacement, random sampling without replacement and equal
probabilities and random sampling without replacement and unequal probabilities. The first
two sampling methods can also be specified for the second and the third sampling stage.
The following notation is used throughout this chapter unless otherwise stated:
Number of strata.
h
n
Sampled number of primary sampling units (PSU) per stratum.
h
f
Sampling rate per stratum.
hi
m (^) Number of elements in the
th
i sampled unit in stratum h , h
i = 1 , K, n.
hij
y (^) Value of variable y for the
th
j element in the
th
i sampled unit in stratum h.
hij
w (^) Overall sampling weight for
th
j element in the
th
i sampled unit in stratum h.
n Total number of elements in the sample.
Total number of elements in the population.
Population total sum for variable y.
Weights
Overall weights specified for each ultimate element are processed as given. They can be
obtained as a product of weights for corresponding units computed in each sampling stage.
When sampling without replacement in a given stage, substituting hi hi
stratum h results in the application of the estimator for the population totals due to Horvitz
and Thompson (1952). The corresponding variance estimator (2) or (3) will also be unbiased.
hi
If sampling with replacement in a given stage, substituting 1 ( ) hi h hi
w = n p yields the
estimator for the population totals due to Hansen and Hurwitz (1943). Repeatedly selected
units should be replicated in the data. The corresponding variance estimator (1) will be
unbiased. hi
p is the probability of selecting unit i in a single draw from stratum h in the
given stage.
Weights obtained in each sampling stage need to be multiplied when processing multi-stage
samples. The resulting overall weights for the elements in the final stage are used in all
expressions and formulas below.
Z expressions
hij hij hij
z = w y
∑
=
mhi
j
hi hij
z z
1
∑
=
nh
i
hi
h
h
z
n
z
1
∑
=
nh
i
hi h
h
h
z z
n
1
2 2
( )
For multi-stage samples, the index h denotes a stratum in the given stage, and i stands for
unit from h in the same stage. The index j runs over all final stage elements contained in
unit hi.
Variable Total
An estimate for the population total of variable y in a single-stage sample is the weighted
sum over all the strata and all the clusters:
∑∑∑
= = =
H
h
n
i
m
j
hij hij
h hi
Y w y
1 1 1
Alternatively, we compute the weighted sum over all the elements in the sample:
∑
=
n
i
i i
Y wy
1
The latter expression is more general because it also applies to multi-stages samples.
∑∑ ∑
= = =
H
h
n
i
K
k
hi hik
h hi
1 1 1
2 1
where
In the case of simple random sampling, the inclusion probability is equal to the sampling
rate h
f for stratum h.
K is the number of second stage strata in the primary sampling unit i within the first
stage stratum h.
U is a variance contribution from the second stage stratum k from the primary
sampling unit hi. It depends on the second stage sampling method. The corresponding
formula (1) or (2) applies.
Three-stage sample
When the sample is obtained in three stages where sampling in the first stage is done without
replacement and simple random sampling is applied in the second stage, we use the
following estimate for the variance of the total for variable y :
∑∑ ∑ ∑∑
= = = = =
H
h
n
i
K
k
n
j
L
l
hi hik hikjl
h hi hik hikj
V Y V Y f U
1 1 1 1 1
2
where
f is the sampling rate for the secondary sampling units in the second stage stratum
hik.
L is the number of third stage strata in the secondary sampling unit hikj.
U is a variance contribution from the third stage stratum l contained in the
secondary sampling unit hikj. It depends on the third stage sampling method.
Corresponding formula (1) or (2) applies.
Population Size Estimation
An estimate for the population size corresponds to the estimate for the variable total; it is
sum of the sampling weights. We have the following estimate for the single-stage samples:
∑∑∑
= = =
H
h
n
i
m
j
hij
h hi
N w
1 1 1
.
More generally,
∑
=
n
i
i
N w
1
.
The variance of N
is obtained by replacing hij
y with 1; that is, by replacing hij
z with
hij
w in the corresponding variance estimator formula for )
Ratio and Mean Estimation
Let R = Y X be the ratio of the totals for variables y and x. It is estimated by
where Y
and X
are the estimates for the corresponding variable totals.
The variance of R
is approximated using the Taylor linearization formula following
Woodruff (1971). The estimate for the approximate variance of the ratio estimate )
V R is
obtained by replacing hij
z with
z w y Rx X hij hij hij hij
in the corresponding variance estimator )
Mean Estimation
The mean Y for the variable y is estimated by
where Y
is the estimate for the total of y and N
is the population size estimate.
The variance of the mean is estimated using the ratio formulas, as the mean is a ratio of
and N
. Accordingly, )
V Y is obtained by substituting hij
z with
z w y Y N hij hij hij
in the corresponding variance estimator )
hij hij hij hij d hij d
z d w y R x X
hij hij hij hij d d
z d w y Y N
Standard Errors
Let Z denote any of the population or subpopulation quantities defined above: variable total,
population size, ratio or mean. Then the standard error of an estimator Z
is the square root
of its estimated variance:
StdError ( Z = V Z.
Coefficient of variation
The coefficient of variation of the estimator Z
is the ratio of its standard error and its value:
The coefficient of variation is undefined when 0
t Tests
Testing the hypothesis that a population quantity Z equals 0
0 0
performed using the t test statistic:
0
StdError Z
t Z
The p -value for the two-sided test is given by the probability
P (| T |>| t ( Z
where T is a random variable form the t distribution with df degrees of freedom.
Degrees of freedom
The number of the degrees of freedom df for the t distribution is calculated as the
difference between the number of primary sampling units and the number of strata in the first
stage of sampling.
Confidence Limits
A level 1 − αconfidence interval is constructed for a given 0 ≤α ≤ 1. The confidence
bounds are defined as
df
Z StdErrorZt
where )
StdError ( Z is the estimated standard error of Z
df
t is the
100 ( 1 − α 2 )percentile of the t distribution with df degrees of freedom.
Design Effects
The design effect Deff is estimated by
srs srs
Deff =
V Y is the estimate of variance of Y
under the appropriate sampling design, while
srs srs
V Y is the estimate of variance of srs
under the simple random sampling assumption
as follows:
∑
=
n
i
srs srs i i
w y
n
n
1
2
)
Deff is undefined when 1
n
.
Whereas design effect is not relevant for estimates of the population size, we do compute the
design effects for ratios and means in addition to the totals. The values of variable y in srs
are then replaced by the linearized values as follows:
y Rx X i i