Statistical Estimation of Population Totals and Variances in Complex Samples, Study notes of Mathematical Statistics

Formulas for estimating population totals and variances in complex samples using hansen-hurwitz estimator, variance estimation for simple and complex sampling designs, and estimation of cell and table population sizes. It also covers concepts of row and column population proportions, standard errors, and pearson statistic adjustments.

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 19

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSTABULATE
This document describes the algorithms used in the complex sampling estimation procedure
CSTABULATE.
Complex sample data must contain both the values of the variables to be analyzed and the
information on the current sampling design. The sampling design includes the sampling
method, strata and clustering information, inclusion probabilities and the overall sampling
weights.
The sampling design specification for CSTABULATE may include up to three stages of
sampling. Any of the following general sampling methods may be assumed in the first stage:
random sampling with replacement, random sampling without replacement and equal
probabilities and random sampling without replacement and unequal probabilities. The first
two sampling methods can also be specified for the second and the third sampling stage.
Notation
The following notation is used throughout this chapter unless otherwise stated:
H
Number of strata.
h
n Sampled number of primary sampling units (PSU) per stratum.
h
f Sampling rate per stratum.
hi
m Number of elements in the th
isampled unit in stratum h, h
ni ,,1 =.
hij
y Value of variable
y
for the th
jelement in the th
isampled unit in stratum h.
hij
w Overall sampling weight for the th
jelement in the th
isampled unit in stratum h.
n
Total number of elements in the sample.
N Total number of elements in the population.
Y
Population total sum for variable
y
.
Weights
Overall weights specified for each ultimate element are processed as given. They can be
obtained as a product of weights for corresponding units computed in each sampling stage.
When sampling without replacement in a given stage, substituting hihi
w
π
1= for unit
i
in
stratum hresults in the application of the estimator for the population totals due to Horvitz
and Thompson (1952). The corresponding variance estimator (2) or (3) will also be unbiased.
hi
π
is the probability of unit
i
from stratum hbeing selected in the given stage.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13

Partial preview of the text

Download Statistical Estimation of Population Totals and Variances in Complex Samples and more Study notes Mathematical Statistics in PDF only on Docsity!

CSTABULATE

This document describes the algorithms used in the complex sampling estimation procedure

CSTABULATE.

Complex sample data must contain both the values of the variables to be analyzed and the

information on the current sampling design. The sampling design includes the sampling

method, strata and clustering information, inclusion probabilities and the overall sampling

weights.

The sampling design specification for CSTABULATE may include up to three stages of

sampling. Any of the following general sampling methods may be assumed in the first stage:

random sampling with replacement, random sampling without replacement and equal

probabilities and random sampling without replacement and unequal probabilities. The first

two sampling methods can also be specified for the second and the third sampling stage.

Notation

The following notation is used throughout this chapter unless otherwise stated:

H

Number of strata.

h

n

Sampled number of primary sampling units (PSU) per stratum.

h

f

Sampling rate per stratum.

hi

m (^) Number of elements in the

th

i sampled unit in stratum h , h

i = 1 , …, n.

hij

y (^) Value of variable y for the

th

j element in the

th

i sampled unit in stratum h.

hij

w (^) Overall sampling weight for the

th

j element in the

th

i sampled unit in stratum h.

n (^) Total number of elements in the sample.

N

Total number of elements in the population.

Y

Population total sum for variable y.

Weights

Overall weights specified for each ultimate element are processed as given. They can be

obtained as a product of weights for corresponding units computed in each sampling stage.

When sampling without replacement in a given stage, substituting hi hi

w = 1 π for unit i in

stratum h results in the application of the estimator for the population totals due to Horvitz

and Thompson (1952). The corresponding variance estimator (2) or (3) will also be unbiased.

hi

π is the probability of unit i from stratum h being selected in the given stage.

If sampling with replacement in a given stage, substituting 1 ( ) hi h hi

w = n p yields the

estimator for the population totals due to Hansen and Hurwitz (1943). Repeatedly selected

units should be replicated in the data. The corresponding variance estimator (1) will be

unbiased. hi

p is the probability of selecting unit i in a single draw from stratum h in the

given stage.

Weights obtained in each sampling stage need to be multiplied when processing multi-stage

samples. The resulting overall weights for the elements in the final stage are used in all

expressions and formulas below.

Z expressions

hij hij hij

z = w y hij hij hij

z ′ = w y

=

hi

m

j

hi hij

z z

1

=

hi

m

j

hi hij

z z

1

=

h

n

i

hi

h

h

z

n

z

1

=

h

n

i

hi

h

h

z

n

z

1

1

2

hi h

n

i

hi h

h

h

z z z z

n

S y y

h

− ′ −^ ′

=

For multi-stage samples, the index h denotes a stratum in the given stage, and i stands for

unit from h in the same stage. The index j runs over all final stage elements contained in

unit hi.

Variable Total

An estimate for the population total of variable y in a single-stage sample is the weighted

sum over all the strata and all the clusters:

= = =

H

h

n

i

m

j

hij hij

h hi

Y w y

1 1 1

Alternatively, we compute the weighted sum over all the elements in the sample:

For each stratum h containing a single element, the covariance contribution )

U ( Y Y

h

′ (^) is

always set to zero.

Two-stage sample

When the sample is obtained in two stages and sampling without replacement is applied in

the first stage, we use the following estimate for the covariance of the total for variables

y and y ′^ :

= = =

H

h

n

i

K

k

hi hik

h hi

C YY C YY C YY U YY

1 1 1

2 1

where

  • hi

π is the first stage inclusion probability for the primary sampling unit i in stratum h. In

the case of simple random sampling, the inclusion probability is equal to the sampling rate

h

f for stratum h.

  • hi

K is the number of second stage strata in the primary sampling unit i within the first

stage stratum h.

U ( Y Y

hik

′ (^) is a covariance contribution from the second stage stratum k from the

primary sampling unit hi. It depends on the second stage sampling method. The

corresponding formula (1) or (2) applies.

Three-stage sample

When the sample is obtained in three stages where sampling in the first stage is done without

replacement and simple random sampling is applied in the second stage, we use the

following estimate for the covariance of the total for variables y and y ′^ :

= = = = =

H

h

n

i

K

k

n

j

L

l

hi hik hikjl

h hi hik

hikj

C YY C YY f U YY

1 1 1 1 1

2

where

  • hik

f is the sampling rate for the secondary sampling units in the second stage stratum hik.

  • hikj

L is the number of the third stage strata in the secondary sampling unit hikj.

U ( YY

hikjl

′ (^) is a covariance contribution from the third stage stratum l contained in the

secondary sampling unit hikj. It depends on the third stage sampling method. The

corresponding formula (1) or (2) applies.

Variable total variance

The variance of the total for variable y in a complex sample is estimated by

V Y = CYY

with )

C YY defined above.

Population Size Estimation

An estimate for the population size corresponds to the estimate for the variable total; it is

sum of the sampling weights. We have the following estimate for the single-stage samples:

= = =

H

h

n

i

m

j

hij

h hi

N w

1 1 1

.

More generally,

=

n

i

i

N w

1

.

The variance of N

is obtained by replacing hij

y with 1; that is, by replacing hij

z with hij

w

in the corresponding variance estimator formula for )

V Y.

Cell Estimates: One-Way Tables

Let the population be classified according to the values of single categorical row variable and

possibly one or more categorical variables in the layer. Categories for the row variable are

enumerated by r = 1 , …, R and categories for the layer variables are given by l = 1 , …, L.

Each combination of the values ( r , l ) defines a domain and a cell in the one-way table

( r , l ), r = 1 , …, R. For each cell ( r , l )we define a corresponding indicator variable:

0 otherwise

1 ifthesampleunit isin thecell( )

hij r,l

rl hij

Sizes

To estimate a cell population size or a table population size, we replace i

y with ( r , l ) i

δ in

the formula for the population total and obtain the following expressions:

Cell Estimates: Two-Way Tables

Let the population be cross-classified according to the values of a categorical row variable, a

categorical column variable and possibly one or more categorical variables in the layer.

Categories for the row variable are enumerated by r = 1 , …, R , while categories for the

column variable are denoted by c = 1 , …, C and categories for the layer variables are given

by l = 1 , …, L. Each combination of values ( r , c , l ) defines a domain and a cell in the

two-way table ( r , c , l ), where r = 1 , …, R and c = 1 , …, C. For each cell ( r , c , l ) we

define a corresponding indicator variable:

0 otherwise

1 ifthesampleunit isin thecell( )

hij r,c,l

rcl hij

We will also use the following indicator notation:

  • Row indicator

=

C

c

i i

r l rcl

1

  • Column indicator

=

R

r

i i

c l rcl

1

  • Table indicator

= =

R

r

C

c

i i

l rcl

1 1

Sizes

To estimate various domain sizes, we substitute i

y with i

δ in the corresponding formula

for the whole population as follows:

  • Cell population size

=

n

i

i i

N rcl w rcl

1

  • Row population size

=

n

i

i i

N r l w r l

1

  • Column population size

=

n

i

i i

N cl w cl

1

  • Table population size

=

n

i

i i

N l w l

1

Similarly, in order to estimate variance of the above estimators, we substitute hij

y with hij

in the corresponding formula for the whole population. The following substitutions of hij

z in

the formulas for )

V Y are used for estimating variances of:

  • Cell population size

z ( r , c , l ) w ( r , c , l ) hij hij hij

  • Row population size

z ( r , , l ) w ( r , , l ) hij hij hij

  • Column population size

z ( , c , l ) w ( , c , l ) hij hij hij

  • Table population size

z ( , , l ) w ( , , l ) hij hij hij

  • Table population proportion

N l

rcl l P rcl

z rcl w

hij hij tab

hij hij

  • Marginal column population proportion

N l

cl l P cl

z cl w

hij hij mcol

hij hij

  • Marginal row population proportion

N l

r l l P r l

z r l w

hij hij mrow

hij hij

Standard Errors

Let Z denote any of the domain quantities defined above: cell population sizes or

proportions. Then the standard error of an estimator Z

is the square root of its estimated

variance:

SE ( Z = V Z.

Coefficient of variation

The coefficient of variation of the estimator Z

is the ratio of its standard error and its value:

Z

SE Z

CV Z

The coefficient of variation is undefined when 0

Z =.

Confidence Limits

Sizes

A level 1 − α confidence interval is constructed for a given 0 ≤α ≤ 1 for any domain size

d

N defined earlier. The confidence bounds are then given by

ν

N ± SE N td d

where )

d

SE N is the estimated standard error of d

N

, and ( 1 α/ 2 )

ν

t − is the

100 ( 1 − α 2 )percentile of the t distribution with ν degrees of freedom.

Proportions

For any domain proportion d

P , we use the logistic transformation f ( p )=ln( p ( 1 − p ))

and obtain the following 1 − α level confidence bounds for the transformed estimate:

ln( ) (1 / 2)

d d

d d d

P SE P

t

P P P

ν

.

These bounds are transformed back to the original metric using the logistic inverse

( ) exp( )( 1 exp( ))

1

f y = y + y

.

Degrees of freedom

The degrees of freedom ν for the t distributions above is calculated as the difference

between the number of primary sampling units and the number of strata in the first stage of

sampling. We shall also refer to this quantity as the sample design degrees of freedom.

Design Effects

Sizes

The design effect Deff for a two-way table cell population size is estimated by

V N rc l

V N rcl

Deff

srs

Tests of Independence for Two-Way Tables

Let the population be cross-classified according to the values of a categorical row variable, a

categorical column variable and possibly one a more categorical variables in the layer.

Categories for the row variable are enumerated by r = 1 , …, R , while categories for the

column variable are denoted by c = 1 , …, C. When the layer variables are given we assume

that their categories coincide with the strata in the first sampling stage. In the following we

omit reference to the layers as the formulas apply for each stratum separately when needed.

We use a contrast matrix C defined as follows. Let R

A be the contrast matrix given by

[ ]

1 1

R RR

A I | 1.

R − 1

I is an identity matrix of size R − 1 and R − 1

1 is a vector with R − 1 elements equal to 1.

Define C to be a RC × ( R − 1 )( C − 1 )matrix defined by the following Kronecker product:

R C

C = A ⊗ A.

Adjusted Pearson statistic test of independence

We provide an adjusted Pearson statistic test. The Pearson statistic is computed according to

the following standard formula:

R

r

C

c Pr P c

Prc Pr P c

X n

1 1

2

2

Since under the null hypothesis, the asymptotic distribution of

2

X is generally not a chi-

square distribution, we perform an adjustment using the following Δ

matrix:

1

ˆ

1

ˆ

1 1

ˆ

1

ˆ

Δ CD MD C CD VPD C

P P P P

− − − − −

= n ′ ′.

P

is a vector and P

D

ˆ

is a diagonal matrix of size RC containing elements ( , )

P rc.

M D P P

P

[

ˆ

= − ′] is a multinomial covariance matrix estimating the asymptotic covariance

of P

under the simple random sampling design, while )

V P estimates covariance matrix of

P

under the complex sampling design.

We use the F-based variant of the Rao and Scott’s (1984) second-order adjustment

2

2

tr

X

FX =

with

2

2

tr

tr

d =.

The asymptotic distribution of

2

FX is approximated by the F ( d , d ν)distribution where

ν is the number of the sample design degrees of freedom.

Properties of this test are given in a review of simulation studies by Rao and Thomas (2003).

Adjusted likelihood ratio test of independence

The likelihood ratio test statistic is given by

R

r

C

c Pr P c

Prc

G n Prc

1 1

2

)

( , )ln(

The adjusted likelihood ratio statistic is computed in an analogous manner to the Pearson

adjustment where Δ

is the same as before and

2

2

tr

G

FG =

with

2

2

tr

tr

d =.

Again, the asymptotic distribution of adjusted statistic

2

FG is approximated by the

F ( d , d ν)distribution where ν is the number of the sample design degrees of freedom.

Residuals

Under the independence hypothesis, the expected table proportion estimates are given by

E rc = Pr + P + c and residual are defined as (, )

R rc = PrcErc for

r = 1 , …, R and c = 1 , …, C.

Standardized residuals are computed by

2

21

1

11

1 ˆ

N

N

N

N

D and

2

22

1

12

2 ˆ

N

N

N

N

D.

The following substitutions of ij

z in the formulas for )

V Y are used for estimating

variances:

  • Odds ratio

OR

N N N N

z rc w

hij hij hij hij

hij hij

= − − + )×

11 12 21 22

  • Risk ratio 1

RR

1

21 2 2

22

11 1 1

12

)

( , ) ( RR

N N N

N

N N N

N

z rc w

hij hij hij hij

hij hij

= − − + ×

  • Risk difference 1

D

2

2

22 21

2

1

12 11

N

N N

N

N N

z rc w

hij hij hij hij

hij hij

The estimations of variance for 2

RR and 2

D are performed using similar substitutions.

Confidence limits

A level 1 − αconfidence interval is constructed for a given 0 ≤α ≤ 1 for odds ratio, risk

ratio and risk difference in every table.

For the odds ratio or risk ratio R we use the logarithm transformation and obtain the

confidence bounds

ln( α

ν

± t

R

SE R

R.

These bounds are transformed back to the original metric using the exponential function.

No transformations are used when estimating confidence bounds for a risk difference D :

ν

D ± SE Dt.

Tests of Homogeneity for One-Way Tables

Let the population be classified according to the values of a categorical row variable and

possibly one a more categorical variables in the layer. Categories for the row variable are

enumerated by r = 1 , …, R. When the layer variables are given we assume that their

categories coincide with the strata in the first sampling stage. In the following we omit

references to the layers as the formulas apply for each stratum separately when needed.

We study proportions P ( r )= N ( r ) N (+). Test of homogeneity consists in testing the

null hypothesis : P ( r ) 1 R 0

H = for r = 1 , …, R − 1.

Adjusted Pearson statistic test

We perform an adjusted Pearson statistic test for testing the homogeneity. The Pearson test

statistic is computed according to the following standard formula:

2

1

2

() 1 )

=

R

r

X n RPr R.

Since the asymptotic distribution of

2

X is generally not the chi-square distribution, we

apply an adjustment using the Δ

matrix given by:

0

1

0

Δ MP VP

= n.

0

V P is the estimated covariance matrix under the complex sample design, while )

0

MP

is an estimated asymptotic covariance matrix under the simple random sampling given by

]

) [ (

0 0 0 0

M P = diag PPP,

where 0

P is a vector and )

0

diag P is a diagonal matrix of size R − 1 containing elements

P r , r = 1 , …, R − 1.

We use the F-based variant of the Rao and Scott’s (1984) second-order adjustment

2

2

tr

X

FX =

with

2

2

tr

tr

d =.

Rao,p J. N. K., and Thomas, D. R. (2003), “Analysis of categorical response data

from complex surveys: an upraisal and update”, In Analysis of Survey Data , ed.

R.Chambers and C. Skinner. New York: John Wiley & Sons.

Särndal, C. E., Swenson, B., and Wretman, J. H. (1992), Model Assisted Survey

Sampling , New York: Springer-Verlag.

Sen, A. R. (1953), “On the estimate of the variance in sampling with varying

probabilities”, Journal of the Indian Society of Agricultural Statistics , volume 5,

pages 55 - 77.

Thomas, D. R., and Rao, J. N. K. (1987), “Small-sample comparisons of level and

power for simple goodness-of-fit statistics under cluster sampling”, Journal of the

American Statistical Association , volume 82, pages 630-636.

Woodruff, R. S. (1971), “A Simple Method for Approximating the Variance of a

Complicated Estimate,” Journal of the American Statistical Association , volume

66, pages 411 – 414.

Yates, F., and Grundy, P. M. (1953), “Selection without replacement from within

strata with probability proportional to size”, Journal of the Royal Statistical

Society Series B , volume 15, pages 253 - 261.