











Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Formulas for estimating population totals and variances in complex samples using hansen-hurwitz estimator, variance estimation for simple and complex sampling designs, and estimation of cell and table population sizes. It also covers concepts of row and column population proportions, standard errors, and pearson statistic adjustments.
Typology: Study notes
1 / 19
This page cannot be seen from the preview
Don't miss anything!












This document describes the algorithms used in the complex sampling estimation procedure
CSTABULATE.
Complex sample data must contain both the values of the variables to be analyzed and the
information on the current sampling design. The sampling design includes the sampling
method, strata and clustering information, inclusion probabilities and the overall sampling
weights.
The sampling design specification for CSTABULATE may include up to three stages of
sampling. Any of the following general sampling methods may be assumed in the first stage:
random sampling with replacement, random sampling without replacement and equal
probabilities and random sampling without replacement and unequal probabilities. The first
two sampling methods can also be specified for the second and the third sampling stage.
The following notation is used throughout this chapter unless otherwise stated:
Number of strata.
h
n
Sampled number of primary sampling units (PSU) per stratum.
h
f
Sampling rate per stratum.
hi
m (^) Number of elements in the
th
i sampled unit in stratum h , h
i = 1 , …, n.
hij
y (^) Value of variable y for the
th
j element in the
th
i sampled unit in stratum h.
hij
w (^) Overall sampling weight for the
th
j element in the
th
i sampled unit in stratum h.
n (^) Total number of elements in the sample.
Total number of elements in the population.
Population total sum for variable y.
Weights
Overall weights specified for each ultimate element are processed as given. They can be
obtained as a product of weights for corresponding units computed in each sampling stage.
When sampling without replacement in a given stage, substituting hi hi
w = 1 π for unit i in
stratum h results in the application of the estimator for the population totals due to Horvitz
and Thompson (1952). The corresponding variance estimator (2) or (3) will also be unbiased.
hi
π is the probability of unit i from stratum h being selected in the given stage.
If sampling with replacement in a given stage, substituting 1 ( ) hi h hi
w = n p yields the
estimator for the population totals due to Hansen and Hurwitz (1943). Repeatedly selected
units should be replicated in the data. The corresponding variance estimator (1) will be
unbiased. hi
p is the probability of selecting unit i in a single draw from stratum h in the
given stage.
Weights obtained in each sampling stage need to be multiplied when processing multi-stage
samples. The resulting overall weights for the elements in the final stage are used in all
expressions and formulas below.
Z expressions
hij hij hij
z = w y hij hij hij
z ′ = w y ′
=
hi
m
j
hi hij
z z
1
=
hi
m
j
hi hij
z z
1
=
h
n
i
hi
h
h
z
n
z
1
=
h
n
i
hi
h
h
z
n
z
1
1
2
hi h
n
i
hi h
h
h
z z z z
n
S y y
h
=
For multi-stage samples, the index h denotes a stratum in the given stage, and i stands for
unit from h in the same stage. The index j runs over all final stage elements contained in
unit hi.
An estimate for the population total of variable y in a single-stage sample is the weighted
sum over all the strata and all the clusters:
= = =
H
h
n
i
m
j
hij hij
h hi
Y w y
1 1 1
Alternatively, we compute the weighted sum over all the elements in the sample:
For each stratum h containing a single element, the covariance contribution )
h
′ (^) is
always set to zero.
Two-stage sample
When the sample is obtained in two stages and sampling without replacement is applied in
the first stage, we use the following estimate for the covariance of the total for variables
y and y ′^ :
= = =
H
h
n
i
K
k
hi hik
h hi
1 1 1
2 1
where
the case of simple random sampling, the inclusion probability is equal to the sampling rate
h
f for stratum h.
K is the number of second stage strata in the primary sampling unit i within the first
stage stratum h.
hik
′ (^) is a covariance contribution from the second stage stratum k from the
primary sampling unit hi. It depends on the second stage sampling method. The
corresponding formula (1) or (2) applies.
Three-stage sample
When the sample is obtained in three stages where sampling in the first stage is done without
replacement and simple random sampling is applied in the second stage, we use the
following estimate for the covariance of the total for variables y and y ′^ :
= = = = =
H
h
n
i
K
k
n
j
L
l
hi hik hikjl
h hi hik
hikj
C YY C YY f U YY
1 1 1 1 1
2
where
f is the sampling rate for the secondary sampling units in the second stage stratum hik.
L is the number of the third stage strata in the secondary sampling unit hikj.
hikjl
′ (^) is a covariance contribution from the third stage stratum l contained in the
secondary sampling unit hikj. It depends on the third stage sampling method. The
corresponding formula (1) or (2) applies.
Variable total variance
The variance of the total for variable y in a complex sample is estimated by
with )
C YY defined above.
An estimate for the population size corresponds to the estimate for the variable total; it is
sum of the sampling weights. We have the following estimate for the single-stage samples:
= = =
H
h
n
i
m
j
hij
h hi
N w
1 1 1
.
More generally,
=
n
i
i
N w
1
.
The variance of N
is obtained by replacing hij
y with 1; that is, by replacing hij
z with hij
w
in the corresponding variance estimator formula for )
Let the population be classified according to the values of single categorical row variable and
possibly one or more categorical variables in the layer. Categories for the row variable are
enumerated by r = 1 , …, R and categories for the layer variables are given by l = 1 , …, L.
Each combination of the values ( r , l ) defines a domain and a cell in the one-way table
( r , l ), r = 1 , …, R. For each cell ( r , l )we define a corresponding indicator variable:
0 otherwise
1 ifthesampleunit isin thecell( )
hij r,l
rl hij
Sizes
To estimate a cell population size or a table population size, we replace i
y with ( r , l ) i
the formula for the population total and obtain the following expressions:
Let the population be cross-classified according to the values of a categorical row variable, a
categorical column variable and possibly one or more categorical variables in the layer.
Categories for the row variable are enumerated by r = 1 , …, R , while categories for the
column variable are denoted by c = 1 , …, C and categories for the layer variables are given
by l = 1 , …, L. Each combination of values ( r , c , l ) defines a domain and a cell in the
two-way table ( r , c , l ), where r = 1 , …, R and c = 1 , …, C. For each cell ( r , c , l ) we
define a corresponding indicator variable:
0 otherwise
1 ifthesampleunit isin thecell( )
hij r,c,l
rcl hij
We will also use the following indicator notation:
=
C
c
i i
r l rcl
1
=
R
r
i i
c l rcl
1
= =
R
r
C
c
i i
l rcl
1 1
Sizes
To estimate various domain sizes, we substitute i
y with i
for the whole population as follows:
=
n
i
i i
N rcl w rcl
1
=
n
i
i i
N r l w r l
1
=
n
i
i i
N cl w cl
1
=
n
i
i i
N l w l
1
Similarly, in order to estimate variance of the above estimators, we substitute hij
y with hij
in the corresponding formula for the whole population. The following substitutions of hij
z in
the formulas for )
V Y are used for estimating variances of:
z ( r , c , l ) w ( r , c , l ) hij hij hij
z ( r , , l ) w ( r , , l ) hij hij hij
z ( , c , l ) w ( , c , l ) hij hij hij
z ( , , l ) w ( , , l ) hij hij hij
N l
rcl l P rcl
z rcl w
hij hij tab
hij hij
N l
cl l P cl
z cl w
hij hij mcol
hij hij
N l
r l l P r l
z r l w
hij hij mrow
hij hij
Let Z denote any of the domain quantities defined above: cell population sizes or
proportions. Then the standard error of an estimator Z
is the square root of its estimated
variance:
Coefficient of variation
The coefficient of variation of the estimator Z
is the ratio of its standard error and its value:
The coefficient of variation is undefined when 0
Sizes
d
N defined earlier. The confidence bounds are then given by
ν
N ± SE N t − d d
where )
d
SE N is the estimated standard error of d
ν
t − is the
Proportions
For any domain proportion d
P , we use the logistic transformation f ( p )=ln( p ( 1 − p ))
ln( ) (1 / 2)
d d
d d d
t
ν
.
These bounds are transformed back to the original metric using the logistic inverse
( ) exp( )( 1 exp( ))
1
f y = y + y
−
.
Degrees of freedom
between the number of primary sampling units and the number of strata in the first stage of
sampling. We shall also refer to this quantity as the sample design degrees of freedom.
Sizes
The design effect Deff for a two-way table cell population size is estimated by
V N rc l
V N rcl
Deff
srs
Let the population be cross-classified according to the values of a categorical row variable, a
categorical column variable and possibly one a more categorical variables in the layer.
Categories for the row variable are enumerated by r = 1 , …, R , while categories for the
column variable are denoted by c = 1 , …, C. When the layer variables are given we assume
that their categories coincide with the strata in the first sampling stage. In the following we
omit reference to the layers as the formulas apply for each stratum separately when needed.
We use a contrast matrix C defined as follows. Let R
A be the contrast matrix given by
1 1
R R − R −
R − 1
I is an identity matrix of size R − 1 and R − 1
1 is a vector with R − 1 elements equal to 1.
Define C to be a RC × ( R − 1 )( C − 1 )matrix defined by the following Kronecker product:
R C
Adjusted Pearson statistic test of independence
We provide an adjusted Pearson statistic test. The Pearson statistic is computed according to
the following standard formula:
R
r
C
c Pr P c
Prc Pr P c
X n
1 1
2
2
Since under the null hypothesis, the asymptotic distribution of
2
X is generally not a chi-
square distribution, we perform an adjustment using the following Δ
matrix:
1
ˆ
1
ˆ
1 1
ˆ
1
ˆ
P P P P
− − − − −
= n ′ ′.
is a vector and P
ˆ
is a diagonal matrix of size RC containing elements ( , )
P rc.
P
ˆ
= − ′] is a multinomial covariance matrix estimating the asymptotic covariance
of P
under the simple random sampling design, while )
V P estimates covariance matrix of
under the complex sampling design.
We use the F-based variant of the Rao and Scott’s (1984) second-order adjustment
2
2
tr
with
2
2
tr
tr
d =.
The asymptotic distribution of
2
Properties of this test are given in a review of simulation studies by Rao and Thomas (2003).
Adjusted likelihood ratio test of independence
The likelihood ratio test statistic is given by
R
r
C
c Pr P c
Prc
G n Prc
1 1
2
)
( , )ln(
The adjusted likelihood ratio statistic is computed in an analogous manner to the Pearson
adjustment where Δ
is the same as before and
2
2
tr
with
2
2
tr
tr
d =.
Again, the asymptotic distribution of adjusted statistic
2
FG is approximated by the
Residuals
Under the independence hypothesis, the expected table proportion estimates are given by
E rc = Pr + P + c and residual are defined as (, )
R rc = Prc − Erc for
r = 1 , …, R and c = 1 , …, C.
Standardized residuals are computed by
2
21
1
11
1 ˆ
D and
2
22
1
12
2 ˆ
The following substitutions of ij
z in the formulas for )
V Y are used for estimating
variances:
z rc w
hij hij hij hij
hij hij
11 12 21 22
1
21 2 2
22
11 1 1
12
)
z rc w
hij hij hij hij
hij hij
2
2
22 21
2
1
12 11
z rc w
hij hij hij hij
hij hij
The estimations of variance for 2
RR and 2
D are performed using similar substitutions.
Confidence limits
ratio and risk difference in every table.
For the odds ratio or risk ratio R we use the logarithm transformation and obtain the
confidence bounds
ν
± t −
These bounds are transformed back to the original metric using the exponential function.
No transformations are used when estimating confidence bounds for a risk difference D :
ν
D ± SE Dt −.
Let the population be classified according to the values of a categorical row variable and
possibly one a more categorical variables in the layer. Categories for the row variable are
enumerated by r = 1 , …, R. When the layer variables are given we assume that their
categories coincide with the strata in the first sampling stage. In the following we omit
references to the layers as the formulas apply for each stratum separately when needed.
We study proportions P ( r )= N ( r ) N (+). Test of homogeneity consists in testing the
null hypothesis : P ( r ) 1 R 0
H = for r = 1 , …, R − 1.
Adjusted Pearson statistic test
We perform an adjusted Pearson statistic test for testing the homogeneity. The Pearson test
statistic is computed according to the following standard formula:
2
1
2
() 1 )
=
R
r
X n RPr R.
Since the asymptotic distribution of
2
X is generally not the chi-square distribution, we
apply an adjustment using the Δ
matrix given by:
0
1
0
−
= n.
0
V P is the estimated covariance matrix under the complex sample design, while )
0
is an estimated asymptotic covariance matrix under the simple random sampling given by
0 0 0 0
M P = diag P − PP ′ ,
where 0
P is a vector and )
0
diag P is a diagonal matrix of size R − 1 containing elements
P r , r = 1 , …, R − 1.
We use the F-based variant of the Rao and Scott’s (1984) second-order adjustment
2
2
tr
with
2
2
tr
tr
d =.
Rao,p J. N. K., and Thomas, D. R. (2003), “Analysis of categorical response data
from complex surveys: an upraisal and update”, In Analysis of Survey Data , ed.
R.Chambers and C. Skinner. New York: John Wiley & Sons.
Särndal, C. E., Swenson, B., and Wretman, J. H. (1992), Model Assisted Survey
Sampling , New York: Springer-Verlag.
Sen, A. R. (1953), “On the estimate of the variance in sampling with varying
probabilities”, Journal of the Indian Society of Agricultural Statistics , volume 5,
pages 55 - 77.
Thomas, D. R., and Rao, J. N. K. (1987), “Small-sample comparisons of level and
power for simple goodness-of-fit statistics under cluster sampling”, Journal of the
American Statistical Association , volume 82, pages 630-636.
Woodruff, R. S. (1971), “A Simple Method for Approximating the Variance of a
Complicated Estimate,” Journal of the American Statistical Association , volume
66, pages 411 – 414.
Yates, F., and Grundy, P. M. (1953), “Selection without replacement from within
strata with probability proportional to size”, Journal of the Royal Statistical
Society Series B , volume 15, pages 253 - 261.