






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Various sampling algorithms that select units from a population with probability proportional to size (pps), including systematic sampling, sequential sampling, sampford’s method, and murthy’s method. These algorithms are used to ensure that larger units have a higher chance of being selected in the sample, providing a more representative sample. Detailed explanations of the algorithms, inclusion probabilities, sampling weights, and special requirements.
Typology: Study notes
1 / 12
This page cannot be seen from the preview
Don't miss anything!







This document describes the algorithm used by CSSELECT to draw samples according to complex designs. The data file does not have to be sorted. Population units can appear more than once in the data file and they do not have to be in a consecutive block of cases.
The following notation is used throughout this chapter unless otherwise stated:
N Population size
n Sample size
f Sampling fraction
h (^) i Hit counts of i-th population unit. (i=1,...,N)
M (^) i Size measure of i-th population unit. (i=1,...,N)
=
N
i 1
M Mi
pi
M
p
i i =^ is the relative size of i-th population unit (i=1,...,N)
Stratification partitions the sampling frame into disjoint sets. Sampling is carried out independently within each stratum. Therefore, without loss of generality, the algorithm described in this document only considers sampling from one population.
In the first stage of selection, the sampling frame is partitioned by the stratification variables specified in stage 1. In the second stage, the sampling frame is stratified by first-stage strata and cluster variables as well as strata variables specified in stage 2. If sampling with replacement is used in the first stage, the first-stage duplication index is also one of the stratification variables. Stratification of the third stage continues in a like manner.
Population Size
Sampling units in a population are identified by all unique level combinations of cluster variables within a stratum. Therefore, the population size N of a stratum is equal to the number of unique level combinations of the cluster variables within a stratum. When a sampling unit is selected, all cases having the same sampling unit identifier are included in the sample. If no cluster variable is defined, each case is a sampling unit.
Sample Size
CSSELECT uses a fixed sample size approach in selecting samples. If the sample size n is supplied by the user, it should satisfy 0 ≤ n≤Nfor any without replacement design and
n ≥ 0 for any with replacement design.
If a sampling fraction f is specified, it should satisfy 0 < f≤ 1 for any without
replacement design and f > 0 for any with replacement design. The actual sample size is
determined by the formula n = round(f*N). When the option RATEMINSIZE is
specified, a sample size less than RATEMINSIZE is raised to RATEMINSIZE. Likewise, a sample size exceeding RATEMAXSIZE is lowered to RATEMAXSIZE.
This algorithm selects n distinct units out of N population units with equal probability; see Fan, Muller & Rezucha (1962) for more information.
Algorithm
a) Generate a uniform (0,1) random number U
b) If (n − i)/(N−k+ 1 )>U, k-th population unit is selected and set i=i+1.
c) If i=n, terminate. Otherwise, go to step 3.
This algorithm selects n units out of N population units with equal probability and with replacement.
n
Algorithm
, N n 1 j N n i S ip
p
, 1 j N n 1 S ip
p p
Nn 1
Nn 1
Nn 1
−+
−+
−+
j 1 Nni
j j = − + −
−+
a np [ 1 (i 1 )P], j 2 ,...,N n 1
a ip
j 1
k 1
k
j j
1 1
−
=
the following revised probabilities
−
= +
−
−
− −
j 1
k j 1
k m 1
j j
j 1 j 1
m 1
m 1 m 1
a (i m 1 )p [ 1 (i m)P], j j 2 ,...,N n m
a (i m 1 )p
At the end of the algorithm, the last (n-i) units and units with indices j 1 ,...,jiare selected.
Joint Inclusion Probabilities (Case 1)
The joint inclusion probabilities of unit i and unit j in the population ( 1 ≤ i<j≤N) is
given by
=
n
r 1
(r) ij rKij
where
−+
−+
−+
if j N n r.
if N n i 0 and j N n r, S rp
rp
if N n r i N n and j N n r, S rp
rp
1 if N 1 i N n r,
(r) ij
N n 1
i
N n 1
Nn 1
(r) ij
(r ) ij ’s are the conditional joint inclusion probabilities given that the last (n-r) units are
selected at step 3. They can be computed by the following formula
(r) j
(r) i
(r) i 1
(r) 1
(r) ij =r (r− 1 )( 1 −P )...( 1 −P−)P p
where
−+
−+
−+
if N n 1 k N n r S rp
p
if k N n 1 S rp
p
p
Nn 1
Nn 1
Nn 1
k
(r) k
and
(p ... p )
p P (^) (r) Nn r
(r) k 1
(r) (r) k k
Note: There is a typo in (3.5) of Vijayan(1967) and (3.3) of Fox(1989). The factor (1/2) should not be there. See also Golmant (1990) and Watts (1991) for other corrections.
Algorithm (Case 2)
This algorithm assumes that the population units are sorted by M (^) i with the order
M 1 ≤ M 2 ≤...≤MNand the additional assumption M (^) N −n + 1 =MN.
, j 1 ,...,N 1 (M ... M )
j 1 N
j j = −
h #{U :M U M ,j 1 ,...,n}
j i
=
i
k 1
k
M (^) i M.
At the end of the algorithm, population units with hit count m (^) i > 0 are selected.
This algorithm selects n units out of N population units with probability proportional to size.
If the size of the i-th unit M (^) iis greater than the selection interval, the i-th unit is sampled
more than once.
the condition n
max Mi ≤ is needed.
Algorithm
=
i
k 1
k
M (^) i M.
j i
i =^ i− 1 < ≤ = ,
k=1,...,N.
At the end of the algorithm, population with hit counts h (^) i > 0 are selected.
This algorithm selects n units from N population units sequentially proportional to size with minimum replacement. This method is proposed by Chromy (1979).
max Mi ≤ , we can ensure maximum number of hits is
equal to 1.
Algorithm
receives a label 1. Then assign labels sequentially to the remaining units. If the end of the list is encountered, loop back to the beginning of the list until all N units are labeled. These labels are the index i in the subsequent steps.
=
i
k 1
k
M (^) i M ,
i=1,...,N.
Fi = Mi−I, i=1,...,N.
a) Set Ti = Ii.
b) If Fi > Fi− 1 , set Ti = Ti+ 1 with probability (F (^) i − Fi− 1 )/( 1 −Fi− 1 ).
c) Set i=i+1.
d) If i > N, terminate. Otherwise go to step 6.
a) Set Ti = Ii.
b) If Fi > Fi− 1 , set Ti = Ti+ 1.
c) If Fi (^) − 1 ≥Fi, set Ti = Ti+ 1 with probability Fi /Fi− 1.
d) Set i=i+1.
e) If i > N, terminate. Otherwise go to step 6.
At the end of the algorithm, number of hits of each unit can be computed by the formula
h (^) i = Ti−Ti− 1 , i=1,...,N. Units with m (^) i > 0 are selected.
Sampford’s (1967) method selects n units out of N population units without replacement and probabilities proportional to size.
Brewer’s (1963) method is a special case of Sampford’s method when n=2.
Murthy’s (1957) method selects two units out of N population units with probabilities proportional to size without replacement.
=
N
k (^1) k
k
i
i i 1 p
p
1 p
p p 1
Algorithm
p (^) k /( 1 − pi), k ≠ i.
Joint Inclusion Probabilities
The joint inclusion probabilities of population unit i and j is given by
(^) ij =p (^) ipj( 2 −pi−pj)/( 1 −pi)( 1 −pj).
STAGEPOPSIZE saves the population sizes of each stratum in a given stage.
STAGESAMPSIZE saves the actual sample sizes of each stratum in a given stage. See the "Sample Size" section for details on sample size calculations.
STAGESAMPRATE saves the actual sampling rate of each stratum in a given stage. It is computed by dividing the actual sample size by the population size. Due to the use of rounding and application of RATEMINSIZE and RATEMAXSIZE on sample size, the resulting STAGESAMPRATE may be different from sampling rate specified by the user.
Stage inclusion probabilities depend on the selection method. The formulae are given in the individual sections of each selection method.
It is equal to the inverse of stage inclusion probabilities.
It is the product of previous weight (if specified) and all the stage weights.
This is the number of times a unit is selected in a given stage. When a WOR method is used the value is always 0 or 1. When a WR method is used it can be any nonnegative integer.
This is the number of times an ultimate sampling unit is selected. It is equal to STAGEHITS of the last specified stage.
It is an index variable used to differentiate duplicated sampling units resulted from sampling with replacement. STAGEINDEX ranges from one to number of hits of a selected unit.
Brewer, K.W.R. (1963). A Model of Systematic Sampling with Unequal Probabilities. Australian Journal of Statistics , 5, 93 -105.
Chromy, J.R. (1979). Sequential Sample Selection Methods. Proceedings of the American Statistical Association, Survey Research Methods Section , 401 -406.
Fan, C.T., Muller, M.E., and Rezucha, I. (1962). Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers. Journal of the American Statistical Association , 57, 387 -402.
Fox, D.R. (1989). Computer Selection of Size-Biased Samples. The American Statistician , 43(3), 168 -171.
Golmant, J. (1990). Correction: Computer Selection of Size-Biased Samples. The American Statistician , 44(2), 194.
Hanurav, T.V. (1967). Optimum Utilization of Auxiliary Information: (^) psSampling of Two
Units from a Stratum. Journal of the Royal Statistical Society, Series B , 29, 374 -391.