Probability Proportional to Size (PPS) Sampling Algorithms, Study notes of Mathematical Statistics

Various sampling algorithms that select units from a population with probability proportional to size (pps), including systematic sampling, sequential sampling, sampford’s method, and murthy’s method. These algorithms are used to ensure that larger units have a higher chance of being selected in the sample, providing a more representative sample. Detailed explanations of the algorithms, inclusion probabilities, sampling weights, and special requirements.

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSSELECT
This document describes the algorithm used by CSSELECT to draw samples according to
complex designs. The data file does not have to be sorted. Population units can appear more
than once in the data file and they do not have to be in a consecutive block of cases.
Notation
The following notation is used throughout this chapter unless otherwise stated:
N Population size
n Sample size
f Sampling fraction
i
h Hit counts of i-th population unit. (i=1,...,N)
i
M Size measure of i-th population unit. (i=1,...,N)
M Total size.
=
=N
1i i
MM
i
p
M
M
pi
i= is the relative size of i-th population unit (i=1,...,N)
Stratification
Stratification partitions the sampling frame into disjoint sets. Sampling is carried out
independently within each stratum. Therefore, without loss of generality, the algorithm
described in this document only considers sampling from one population.
In the first stage of selection, the sampling frame is partitioned by the stratification variables
specified in stage 1. In the second stage, the sampling frame is stratified by first-stage strata
and cluster variables as well as strata variables specified in stage 2. If sampling with
replacement is used in the first stage, the first-stage duplication index is also one of the
stratification variables. Stratification of the third stage continues in a like manner.
Population Size
Sampling units in a population are identified by all unique level combinations of cluster
variables within a stratum. Therefore, the population size N of a stratum is equal to the
number of unique level combinations of the cluster variables within a stratum. When a
sampling unit is selected, all cases having the same sampling unit identifier are included in
the sample. If no cluster variable is defined, each case is a sampling unit.
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Probability Proportional to Size (PPS) Sampling Algorithms and more Study notes Mathematical Statistics in PDF only on Docsity!

CSSELECT

This document describes the algorithm used by CSSELECT to draw samples according to complex designs. The data file does not have to be sorted. Population units can appear more than once in the data file and they do not have to be in a consecutive block of cases.

Notation

The following notation is used throughout this chapter unless otherwise stated:

N Population size

n Sample size

f Sampling fraction

h (^) i Hit counts of i-th population unit. (i=1,...,N)

M (^) i Size measure of i-th population unit. (i=1,...,N)

M

Total size. ∑

=

N

i 1

M Mi

pi

M

M

p

i i =^ is the relative size of i-th population unit (i=1,...,N)

Stratification

Stratification partitions the sampling frame into disjoint sets. Sampling is carried out independently within each stratum. Therefore, without loss of generality, the algorithm described in this document only considers sampling from one population.

In the first stage of selection, the sampling frame is partitioned by the stratification variables specified in stage 1. In the second stage, the sampling frame is stratified by first-stage strata and cluster variables as well as strata variables specified in stage 2. If sampling with replacement is used in the first stage, the first-stage duplication index is also one of the stratification variables. Stratification of the third stage continues in a like manner.

Population Size

Sampling units in a population are identified by all unique level combinations of cluster variables within a stratum. Therefore, the population size N of a stratum is equal to the number of unique level combinations of the cluster variables within a stratum. When a sampling unit is selected, all cases having the same sampling unit identifier are included in the sample. If no cluster variable is defined, each case is a sampling unit.

Sample Size

CSSELECT uses a fixed sample size approach in selecting samples. If the sample size n is supplied by the user, it should satisfy 0 ≤ n≤Nfor any without replacement design and

n ≥ 0 for any with replacement design.

If a sampling fraction f is specified, it should satisfy 0 < f≤ 1 for any without

replacement design and f > 0 for any with replacement design. The actual sample size is

determined by the formula n = round(f*N). When the option RATEMINSIZE is

specified, a sample size less than RATEMINSIZE is raised to RATEMINSIZE. Likewise, a sample size exceeding RATEMAXSIZE is lowered to RATEMAXSIZE.

Simple Random Sampling

This algorithm selects n distinct units out of N population units with equal probability; see Fan, Muller & Rezucha (1962) for more information.

  • Inclusion probability of i-th unit = n/N
  • Sampling weight of i-th = N/n

Algorithm

  1. If f is supplied, compute n=round(f*N).
  2. Set k=0, i=0 and start data scan.
  3. Get a population unit and set k=k+1. If no more population units left, terminate.
  4. Test if k-th unit should go into the sample

a) Generate a uniform (0,1) random number U

b) If (n − i)/(N−k+ 1 )>U, k-th population unit is selected and set i=i+1.

c) If i=n, terminate. Otherwise, go to step 3.

Unrestricted Random Sampling

This algorithm selects n units out of N population units with equal probability and with replacement.

  • Inclusion probability of i-th unit = 1-(1-1/N)

n

  • Sampling weight of i-th = N/n. (For use with Hansen-Hurwitz(1943) estimator)
  • Expected number of hits of i-th = n/N

Algorithm

  1. Set i=0 and initialize all hit counts to zero.
  2. Generate an integer k between 1 and N uniformly.
  3. Increase hit count of k-th population unit by 1.
  4. Set i=i+1.
  5. If i=n, then terminate. Otherwise go to step 2.
  1. If the integer selected in step 2 is i, then the last (n-i) population units are selected.
  2. Define a new set of probabilities for the first (N-n+i) population units.

, N n 1 j N n i S ip

p

, 1 j N n 1 S ip

p p

Nn 1

Nn 1

Nn 1

  • j j

−+

−+

−+

  1. Define , j 1 ,...,N n i 1 (M ... M )

M

P

j 1 Nni

j j = − + −

  • −+

  1. Set m=1 and select one unit from the first (N-n+1) population units with probability proportional to

a np [ 1 (i 1 )P], j 2 ,...,N n 1

a ip

j 1

k 1

k

j j

1 1

=

  1. Denote the index of the selected unit by j (^) m.
  2. Set m=m+1 and select one unit from ( j (^) m − 1 + 1 )-th to (N-n+m)-th population units with

the following revised probabilities

= +

− −

j 1

k j 1

k m 1

j j

j 1 j 1

m 1

m 1 m 1

a (i m 1 )p [ 1 (i m)P], j j 2 ,...,N n m

a (i m 1 )p

  1. Denote the selected unit in step 8 by jm.
  2. If m=i, terminate. Otherwise, go to step 8.

At the end of the algorithm, the last (n-i) units and units with indices j 1 ,...,jiare selected.

Joint Inclusion Probabilities (Case 1)

The joint inclusion probabilities of unit i and unit j in the population ( 1 ≤ i<j≤N) is

given by

=

n

r 1

(r) ƒij „rKij

where

−+

−+

−+

if j N n r.

if N n i 0 and j N n r, S rp

rp

if N n r i N n and j N n r, S rp

rp

1 if N 1 i N n r,

K

(r) ij

N n 1

i

N n 1

Nn 1

(r) ij

(r ) ƒij ’s are the conditional joint inclusion probabilities given that the last (n-r) units are

selected at step 3. They can be computed by the following formula

(r) j

(r) i

(r) i 1

(r) 1

(r) ƒij =r (r− 1 )( 1 −P )...( 1 −P−)P p

where

−+

−+

−+

if N n 1 k N n r S rp

p

if k N n 1 S rp

p

p

Nn 1

Nn 1

Nn 1

k

(r) k

and

(p ... p )

p P (^) (r) Nn r

(r) k 1

(r) (r) k k

      • − +

Note: There is a typo in (3.5) of Vijayan(1967) and (3.3) of Fox(1989). The factor (1/2) should not be there. See also Golmant (1990) and Watts (1991) for other corrections.

Algorithm (Case 2)

This algorithm assumes that the population units are sorted by M (^) i with the order

M 1 ≤ M 2 ≤...≤MNand the additional assumption M (^) N −n + 1 =MN.

  1. Define the probabilities

, j 1 ,...,N 1 (M ... M )

M

P

j 1 N

j j = −

  1. Select one unit from the first (N-n+1) population units with probability proportional to
  1. Compute hit counts of i-th population unit

h #{U :M U M ,j 1 ,...,n}

j i

i =^ j i− 1 < ≤ = , where^ M^0 =^0 and^ ∑

=

i

k 1

k

M (^) i M.

At the end of the algorithm, population units with hit count m (^) i > 0 are selected.

PPS Systematic Sampling

This algorithm selects n units out of N population units with probability proportional to size.

If the size of the i-th unit M (^) iis greater than the selection interval, the i-th unit is sampled

more than once.

  • Inclusion probability of i-th unit = npi
  • Sampling weight of i-th unit = npi
  • Expected number of hits of i-th unit = np (^) i. In order to have no duplicates in the sample,

the condition n

M

max Mi ≤ is needed.

Algorithm

1. Compute cumulated sizes ∑

=

i

k 1

k

M (^) i M.

  1. Compute the selection interval I=M/n.
  2. Generate a random number S from uniform(0,I).
  3. Generate the sequence {S (^) j :Sj= S+(j− 1 )I,j= 1 ,...,n}.
  4. Compute hit counts of i-th population unit h #{M S M,j 1 ,...,n}

j i

i =^ i− 1 < ≤ = ,

k=1,...,N.

At the end of the algorithm, population with hit counts h (^) i > 0 are selected.

PPS Sequential Sampling (Chromy)

This algorithm selects n units from N population units sequentially proportional to size with minimum replacement. This method is proposed by Chromy (1979).

  • Inclusion probability of i-th unit = npi
  • Sampling weight of i-th unit = npi
  • Maximum number of hits of i-th unit = trunc( np (^) i ) + 1
  • By applying the restriction n

M

max Mi ≤ , we can ensure maximum number of hits is

equal to 1.

Algorithm

  1. Select one unit from the population proportional to its size M (^) i. The selected unit

receives a label 1. Then assign labels sequentially to the remaining units. If the end of the list is encountered, loop back to the beginning of the list until all N units are labeled. These labels are the index i in the subsequent steps.

  1. Compute integer part of expected hit counts I trunc(M)

i =^ i , where^ ∑

=

i

k 1

k

M (^) i M ,

i=1,...,N.

  1. Compute fractional part of expected hit counts (^) i

Fi = Mi−I, i=1,...,N.

  1. Define I 0 = 0 , F 0 = 0 and T 0 = 0.
  2. Set i=1.
  3. If Ti (^) − 1 = Ii− 1 , go to step 8.
  4. If Ti (^) − 1 = Ii− 1 + 1 , go to step 9.
  5. Determine accumulated hits at i-th step (case 1).

a) Set Ti = Ii.

b) If Fi > Fi− 1 , set Ti = Ti+ 1 with probability (F (^) i − Fi− 1 )/( 1 −Fi− 1 ).

c) Set i=i+1.

d) If i > N, terminate. Otherwise go to step 6.

  1. Determine accumulated hits at i-th step (case 2).

a) Set Ti = Ii.

b) If Fi > Fi− 1 , set Ti = Ti+ 1.

c) If Fi (^) − 1 ≥Fi, set Ti = Ti+ 1 with probability Fi /Fi− 1.

d) Set i=i+1.

e) If i > N, terminate. Otherwise go to step 6.

At the end of the algorithm, number of hits of each unit can be computed by the formula

h (^) i = Ti−Ti− 1 , i=1,...,N. Units with m (^) i > 0 are selected.

PPS Sampford’s Method

Sampford’s (1967) method selects n units out of N population units without replacement and probabilities proportional to size.

  • Inclusion probability of i-th unit = npi

PPS Brewer’s Method (n=2)

Brewer’s (1963) method is a special case of Sampford’s method when n=2.

PPS Murthy’s Method (n=2)

Murthy’s (1957) method selects two units out of N population units with probabilities proportional to size without replacement.

  • Inclusion probability of i-th unit = (^)  

=

N

k (^1) k

k

i

i i 1 p

p

1 p

p p 1

  • Sampling weight of i-th unit = inverse of inclusion probability

Algorithm

  1. Select first unit from the population with probabilities p (^) k, k=1,...,N.
  2. If the first selected unit has index i, then select second unit with probabilities

p (^) k /( 1 − pi), k ≠ i.

Joint Inclusion Probabilities

The joint inclusion probabilities of population unit i and j is given by

ƒ (^) ij =p (^) ipj( 2 −pi−pj)/( 1 −pi)( 1 −pj).

Saved Variables

STATGEPOPSIZE

STAGEPOPSIZE saves the population sizes of each stratum in a given stage.

STAGESAMPSIZE

STAGESAMPSIZE saves the actual sample sizes of each stratum in a given stage. See the "Sample Size" section for details on sample size calculations.

STAGESAMPRATE

STAGESAMPRATE saves the actual sampling rate of each stratum in a given stage. It is computed by dividing the actual sample size by the population size. Due to the use of rounding and application of RATEMINSIZE and RATEMAXSIZE on sample size, the resulting STAGESAMPRATE may be different from sampling rate specified by the user.

STAGEINCLPROB

Stage inclusion probabilities depend on the selection method. The formulae are given in the individual sections of each selection method.

STAGEWEIGHT

It is equal to the inverse of stage inclusion probabilities.

SAMPLEWEIGHT

It is the product of previous weight (if specified) and all the stage weights.

STAGEHITS

This is the number of times a unit is selected in a given stage. When a WOR method is used the value is always 0 or 1. When a WR method is used it can be any nonnegative integer.

SAMPLEHITS

This is the number of times an ultimate sampling unit is selected. It is equal to STAGEHITS of the last specified stage.

STAGEINDEX

It is an index variable used to differentiate duplicated sampling units resulted from sampling with replacement. STAGEINDEX ranges from one to number of hits of a selected unit.

References

Brewer, K.W.R. (1963). A Model of Systematic Sampling with Unequal Probabilities. Australian Journal of Statistics , 5, 93 -105.

Chromy, J.R. (1979). Sequential Sample Selection Methods. Proceedings of the American Statistical Association, Survey Research Methods Section , 401 -406.

Fan, C.T., Muller, M.E., and Rezucha, I. (1962). Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers. Journal of the American Statistical Association , 57, 387 -402.

Fox, D.R. (1989). Computer Selection of Size-Biased Samples. The American Statistician , 43(3), 168 -171.

Golmant, J. (1990). Correction: Computer Selection of Size-Biased Samples. The American Statistician , 44(2), 194.

Hanurav, T.V. (1967). Optimum Utilization of Auxiliary Information: ƒ (^) psSampling of Two

Units from a Stratum. Journal of the Royal Statistical Society, Series B , 29, 374 -391.