Sampling Design and Variance Estimation: A Comprehensive Guide, Study notes of Mathematical Statistics

An in-depth exploration of sampling design, focusing on methods, strata, clustering information, and inclusion probabilities. It covers various sampling methods, including random sampling with and without replacement, and unequal probabilities. The document also discusses the estimation of population totals, ratios, and means, as well as the calculation of standard errors and design effects.

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSDESCRIPTIVES
This document describes the algorithms used in the complex sampling estimation procedure
CSDESCRIPTIVES. The data do not have to be sorted.
Complex sample data must contain both the values of the variables to be analyzed and the
information on the current sampling design. Sampling design includes the sampling method,
strata and clustering information, and inclusion probabilities for all units at every sampling
stage. The overall sampling weight must be specified for each observation.
The sampling design specification for CSDESCRIPTIVES may include up to three stages
of sampling. Any of the following general sampling methods may be assumed in the first
stage: random sampling with replacement, random sampling without replacement and equal
probabilities and random sampling without replacement and unequal probabilities. The first
two sampling methods can also be specified for the second and the third sampling stage.
Notation
The following notation is used throughout this chapter unless otherwise stated:
H Number of strata.
h
n Sampled number of primary sampling units (PSU) per stratum.
h
f Sampling rate per stratum.
hi
m Number of elements in the th
isampled unit in stratum h, h
ni ,,1 K=.
hij
y Value of variable yfor the th
jelement in the th
isampled unit in stratum h.
hij
w Overall sampling weight for th
jelement in the th
isampled unit in stratum h.
n Total number of elements in the sample.
N Total number of elements in the population.
Y Population total sum for variable y.
Weights
Overall weights specified for each ultimate element are processed as given. They can be
obtained as a product of weights for corresponding units computed in each sampling stage.
When sampling without replacement in a given stage, substituting hihi
w
π
1= for unit i in
stratum hresults in the application of the estimator for the population totals due to Horvitz
and Thompson (1952). The corresponding variance estimator (2) or (3) will also be unbiased.
hi
π
is the probability of unit i from stratum hbeing selected in the given stage.
If sampling with replacement in a given stage, substituting )(1 hihhi pnw = yields the
estimator for the population totals due to Hansen and Hurwitz (1943). Repeatedly selected
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Sampling Design and Variance Estimation: A Comprehensive Guide and more Study notes Mathematical Statistics in PDF only on Docsity!

CSDESCRIPTIVES

This document describes the algorithms used in the complex sampling estimation procedure

CSDESCRIPTIVES. The data do not have to be sorted.

Complex sample data must contain both the values of the variables to be analyzed and the

information on the current sampling design. Sampling design includes the sampling method,

strata and clustering information, and inclusion probabilities for all units at every sampling

stage. The overall sampling weight must be specified for each observation.

The sampling design specification for CSDESCRIPTIVES may include up to three stages

of sampling. Any of the following general sampling methods may be assumed in the first

stage: random sampling with replacement, random sampling without replacement and equal

probabilities and random sampling without replacement and unequal probabilities. The first

two sampling methods can also be specified for the second and the third sampling stage.

Notation

The following notation is used throughout this chapter unless otherwise stated:

H

Number of strata.

h

n

Sampled number of primary sampling units (PSU) per stratum.

h

f

Sampling rate per stratum.

hi

m (^) Number of elements in the

th

i sampled unit in stratum h , h

i = 1 , K, n.

hij

y (^) Value of variable y for the

th

j element in the

th

i sampled unit in stratum h.

hij

w (^) Overall sampling weight for

th

j element in the

th

i sampled unit in stratum h.

n Total number of elements in the sample.

N

Total number of elements in the population.

Y

Population total sum for variable y.

Weights

Overall weights specified for each ultimate element are processed as given. They can be

obtained as a product of weights for corresponding units computed in each sampling stage.

When sampling without replacement in a given stage, substituting hi hi

w = 1 π for unit i in

stratum h results in the application of the estimator for the population totals due to Horvitz

and Thompson (1952). The corresponding variance estimator (2) or (3) will also be unbiased.

hi

π is the probability of unit i from stratum h being selected in the given stage.

If sampling with replacement in a given stage, substituting 1 ( ) hi h hi

w = n p yields the

estimator for the population totals due to Hansen and Hurwitz (1943). Repeatedly selected

units should be replicated in the data. The corresponding variance estimator (1) will be

unbiased. hi

p is the probability of selecting unit i in a single draw from stratum h in the

given stage.

Weights obtained in each sampling stage need to be multiplied when processing multi-stage

samples. The resulting overall weights for the elements in the final stage are used in all

expressions and formulas below.

Z expressions

hij hij hij

z = w y

=

mhi

j

hi hij

z z

1

=

nh

i

hi

h

h

z

n

z

1

=

nh

i

hi h

h

h

z z

n

S

1

2 2

( )

For multi-stage samples, the index h denotes a stratum in the given stage, and i stands for

unit from h in the same stage. The index j runs over all final stage elements contained in

unit hi.

Variable Total

An estimate for the population total of variable y in a single-stage sample is the weighted

sum over all the strata and all the clusters:

∑∑∑

= = =

H

h

n

i

m

j

hij hij

h hi

Y w y

1 1 1

Alternatively, we compute the weighted sum over all the elements in the sample:

=

n

i

i i

Y wy

1

The latter expression is more general because it also applies to multi-stages samples.

∑∑ ∑

= = =

H

h

n

i

K

k

hi hik

h hi

V Y V Y V Y U

1 1 1

2 1

where

  • hi

π is the first stage inclusion probability for the primary sampling unit i in stratum h.

In the case of simple random sampling, the inclusion probability is equal to the sampling

rate h

f for stratum h.

  • hi

K is the number of second stage strata in the primary sampling unit i within the first

stage stratum h.

  • hik

U is a variance contribution from the second stage stratum k from the primary

sampling unit hi. It depends on the second stage sampling method. The corresponding

formula (1) or (2) applies.

Three-stage sample

When the sample is obtained in three stages where sampling in the first stage is done without

replacement and simple random sampling is applied in the second stage, we use the

following estimate for the variance of the total for variable y :

∑∑ ∑ ∑∑

= = = = =

H

h

n

i

K

k

n

j

L

l

hi hik hikjl

h hi hik hikj

V Y V Y f U

1 1 1 1 1

2

where

  • hik

f is the sampling rate for the secondary sampling units in the second stage stratum

hik.

  • hikj

L is the number of third stage strata in the secondary sampling unit hikj.

  • hikjl

U is a variance contribution from the third stage stratum l contained in the

secondary sampling unit hikj. It depends on the third stage sampling method.

Corresponding formula (1) or (2) applies.

Population Size Estimation

An estimate for the population size corresponds to the estimate for the variable total; it is

sum of the sampling weights. We have the following estimate for the single-stage samples:

∑∑∑

= = =

H

h

n

i

m

j

hij

h hi

N w

1 1 1

.

More generally,

=

n

i

i

N w

1

.

The variance of N

is obtained by replacing hij

y with 1; that is, by replacing hij

z with

hij

w in the corresponding variance estimator formula for )

V Y.

Ratio and Mean Estimation

Let R = Y X be the ratio of the totals for variables y and x. It is estimated by

R Y X

where Y

and X

are the estimates for the corresponding variable totals.

The variance of R

is approximated using the Taylor linearization formula following

Woodruff (1971). The estimate for the approximate variance of the ratio estimate )

V R is

obtained by replacing hij

z with

z w y Rx X hij hij hij hij

in the corresponding variance estimator )

V Y.

Mean Estimation

The mean Y for the variable y is estimated by

Y Y N

where Y

is the estimate for the total of y and N

is the population size estimate.

The variance of the mean is estimated using the ratio formulas, as the mean is a ratio of

Y

and N

. Accordingly, )

V Y is obtained by substituting hij

z with

z w y Y N hij hij hij

in the corresponding variance estimator )

V Y.

hij hij hij hij d hij d

z d w y R x X

  • Domain mean

hij hij hij hij d d

z d w y Y N

Standard Errors

Let Z denote any of the population or subpopulation quantities defined above: variable total,

population size, ratio or mean. Then the standard error of an estimator Z

is the square root

of its estimated variance:

StdError ( Z = V Z.

Coefficient of variation

The coefficient of variation of the estimator Z

is the ratio of its standard error and its value:

Z

SE Z

CV Z

The coefficient of variation is undefined when 0

Z =.

t Tests

Testing the hypothesis that a population quantity Z equals 0

θ , i.e.

0 0

H : Z = θ is

performed using the t test statistic:

0

StdError Z

Z

t Z

The p -value for the two-sided test is given by the probability

P (| T |>| t ( Z

where T is a random variable form the t distribution with df degrees of freedom.

Degrees of freedom

The number of the degrees of freedom df for the t distribution is calculated as the

difference between the number of primary sampling units and the number of strata in the first

stage of sampling.

Confidence Limits

A level 1 − αconfidence interval is constructed for a given 0 ≤α ≤ 1. The confidence

bounds are defined as

df

Z StdErrorZt

where )

StdError ( Z is the estimated standard error of Z

, and ( 1 − α/ 2 )

df

t is the

100 ( 1 − α 2 )percentile of the t distribution with df degrees of freedom.

Design Effects

The design effect Deff is estimated by

srs srs

V Y

V Y

Deff =

V Y is the estimate of variance of Y

under the appropriate sampling design, while

srs srs

V Y is the estimate of variance of srs

Y

under the simple random sampling assumption

as follows:

=

n

i

srs srs i i

N

Y

w y

n

N

N

n

V Y

1

2

)

Deff is undefined when 1

N

n

.

Whereas design effect is not relevant for estimates of the population size, we do compute the

design effects for ratios and means in addition to the totals. The values of variable y in srs

V

are then replaced by the linearized values as follows:

  • Ratio estimation

y Rx X i i