Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Probability Proportional to Size (PPS) Sampling Algorithms, Study notes of Mathematical Statistics

Alliance University Mathematical Statistics

Various sampling algorithms that select units from a population with probability proportional to size (pps), including systematic sampling, sequential sampling, sampford’s method, and murthy’s method. These algorithms are used to ensure that larger units have a higher chance of being selected in the sample, providing a more representative sample. Detailed explanations of the algorithms, inclusion probabilities, sampling weights, and special requirements.

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar 🇮🇳

4.5

(4)

118 documents

1 / 12

This page cannot be seen from the preview

Don't miss anything!

CSSELECT

This document describes the algorithm used by CSSELECT to draw samples according to

complex designs. The data file does not have to be sorted. Population units can appear more

than once in the data file and they do not have to be in a consecutive block of cases.

Notation

The following notation is used throughout this chapter unless otherwise stated:

N Population size

n Sample size

f Sampling fraction

i

h Hit counts of i-th population unit. (i=1,...,N)

i

M Size measure of i-th population unit. (i=1,...,N)

M Total size. ∑

=

=N

1i i

MM

i

p

M

pi

i= is the relative size of i-th population unit (i=1,...,N)

Stratification

Stratification partitions the sampling frame into disjoint sets. Sampling is carried out

independently within each stratum. Therefore, without loss of generality, the algorithm

described in this document only considers sampling from one population.

In the first stage of selection, the sampling frame is partitioned by the stratification variables

specified in stage 1. In the second stage, the sampling frame is stratified by first-stage strata

and cluster variables as well as strata variables specified in stage 2. If sampling with

replacement is used in the first stage, the first-stage duplication index is also one of the

stratification variables. Stratification of the third stage continues in a like manner.

Population Size

Sampling units in a population are identified by all unique level combinations of cluster

variables within a stratum. Therefore, the population size N of a stratum is equal to the

number of unique level combinations of the cluster variables within a stratum. When a

sampling unit is selected, all cases having the same sampling unit identifier are included in

the sample. If no cluster variable is defined, each case is a sampling unit.

Discover Study notes of Mathematical Statistics Alliance University

Partial preview of the text

Download Probability Proportional to Size (PPS) Sampling Algorithms and more Study notes Mathematical Statistics in PDF only on Docsity!

CSSELECT

This document describes the algorithm used by CSSELECT to draw samples according to complex designs. The data file does not have to be sorted. Population units can appear more than once in the data file and they do not have to be in a consecutive block of cases.

Notation

The following notation is used throughout this chapter unless otherwise stated:

N Population size

n Sample size

f Sampling fraction

h (^) i Hit counts of i-th population unit. (i=1,...,N)

M (^) i Size measure of i-th population unit. (i=1,...,N)

M

Total size. ∑

=

N

i 1

M Mi

pi

M

p

i i =^ is the relative size of i-th population unit (i=1,...,N)

Stratification

Stratification partitions the sampling frame into disjoint sets. Sampling is carried out independently within each stratum. Therefore, without loss of generality, the algorithm described in this document only considers sampling from one population.

In the first stage of selection, the sampling frame is partitioned by the stratification variables specified in stage 1. In the second stage, the sampling frame is stratified by first-stage strata and cluster variables as well as strata variables specified in stage 2. If sampling with replacement is used in the first stage, the first-stage duplication index is also one of the stratification variables. Stratification of the third stage continues in a like manner.

Population Size

Sampling units in a population are identified by all unique level combinations of cluster variables within a stratum. Therefore, the population size N of a stratum is equal to the number of unique level combinations of the cluster variables within a stratum. When a sampling unit is selected, all cases having the same sampling unit identifier are included in the sample. If no cluster variable is defined, each case is a sampling unit.

Sample Size

CSSELECT uses a fixed sample size approach in selecting samples. If the sample size n is supplied by the user, it should satisfy 0 ≤ n≤Nfor any without replacement design and

n ≥ 0 for any with replacement design.

If a sampling fraction f is specified, it should satisfy 0 < f≤ 1 for any without

replacement design and f > 0 for any with replacement design. The actual sample size is

determined by the formula n = round(f*N). When the option RATEMINSIZE is

specified, a sample size less than RATEMINSIZE is raised to RATEMINSIZE. Likewise, a sample size exceeding RATEMAXSIZE is lowered to RATEMAXSIZE.

Simple Random Sampling

This algorithm selects n distinct units out of N population units with equal probability; see Fan, Muller & Rezucha (1962) for more information.

Inclusion probability of i-th unit = n/N
Sampling weight of i-th = N/n

Algorithm

If f is supplied, compute n=round(f*N).
Set k=0, i=0 and start data scan.
Get a population unit and set k=k+1. If no more population units left, terminate.
Test if k-th unit should go into the sample

a) Generate a uniform (0,1) random number U

b) If (n − i)/(N−k+ 1 )>U, k-th population unit is selected and set i=i+1.

c) If i=n, terminate. Otherwise, go to step 3.

Unrestricted Random Sampling

This algorithm selects n units out of N population units with equal probability and with replacement.

Inclusion probability of i-th unit = 1-(1-1/N)

n

Sampling weight of i-th = N/n. (For use with Hansen-Hurwitz(1943) estimator)
Expected number of hits of i-th = n/N

Algorithm

Set i=0 and initialize all hit counts to zero.
Generate an integer k between 1 and N uniformly.
Increase hit count of k-th population unit by 1.
Set i=i+1.
If i=n, then terminate. Otherwise go to step 2.

If the integer selected in step 2 is i, then the last (n-i) population units are selected.
Define a new set of probabilities for the first (N-n+i) population units.

, N n 1 j N n i S ip

p

, 1 j N n 1 S ip

p p

Nn 1

j j

−+

Define , j 1 ,...,N n i 1 (M ... M )

M

P

j 1 Nni

j j = − + −

−+

Set m=1 and select one unit from the first (N-n+1) population units with probability proportional to

a np [ 1 (i 1 )P], j 2 ,...,N n 1

a ip

j 1

k 1

k

j j

1 1

−

=

Denote the index of the selected unit by j (^) m.
Set m=m+1 and select one unit from ( j (^) m − 1 + 1 )-th to (N-n+m)-th population units with

the following revised probabilities

−

= +

−

− −

j 1

k j 1

k m 1

j j

j 1 j 1

m 1

m 1 m 1

a (i m 1 )p [ 1 (i m)P], j j 2 ,...,N n m

a (i m 1 )p

Denote the selected unit in step 8 by jm.
If m=i, terminate. Otherwise, go to step 8.

At the end of the algorithm, the last (n-i) units and units with indices j 1 ,...,jiare selected.

Joint Inclusion Probabilities (Case 1)

The joint inclusion probabilities of unit i and unit j in the population ( 1 ≤ i<j≤N) is

given by

=

n

r 1

(r) ij rKij

where

−+

if j N n r.

if N n i 0 and j N n r, S rp

rp

if N n r i N n and j N n r, S rp

rp

1 if N 1 i N n r,

K

(r) ij

N n 1

i

N n 1

Nn 1

(r) ij

(r ) ij ’s are the conditional joint inclusion probabilities given that the last (n-r) units are

selected at step 3. They can be computed by the following formula

(r) j

(r) i

(r) i 1

(r) 1

(r) ij =r (r− 1 )( 1 −P )...( 1 −P−)P p

where

−+

if N n 1 k N n r S rp

p

if k N n 1 S rp

p

Nn 1

k

(r) k

and

(p ... p )

p P (^) (r) Nn r

(r) k 1

(r) (r) k k

- - − +

Note: There is a typo in (3.5) of Vijayan(1967) and (3.3) of Fox(1989). The factor (1/2) should not be there. See also Golmant (1990) and Watts (1991) for other corrections.

Algorithm (Case 2)

This algorithm assumes that the population units are sorted by M (^) i with the order

M 1 ≤ M 2 ≤...≤MNand the additional assumption M (^) N −n + 1 =MN.

Define the probabilities

, j 1 ,...,N 1 (M ... M )

M

P

j 1 N

j j = −

Select one unit from the first (N-n+1) population units with probability proportional to

Compute hit counts of i-th population unit

h #{U :M U M ,j 1 ,...,n}

j i

i =^ j i− 1 < ≤ = , where^ M^0 =^0 and^ ∑

=

i

k 1

k

M (^) i M.

At the end of the algorithm, population units with hit count m (^) i > 0 are selected.

PPS Systematic Sampling

This algorithm selects n units out of N population units with probability proportional to size.

If the size of the i-th unit M (^) iis greater than the selection interval, the i-th unit is sampled

more than once.

Inclusion probability of i-th unit = npi
Sampling weight of i-th unit = npi

Expected number of hits of i-th unit = np (^) i. In order to have no duplicates in the sample,

the condition n

M

max Mi ≤ is needed.

Algorithm

1. Compute cumulated sizes ∑

=

i

k 1

k

M (^) i M.

Compute the selection interval I=M/n.
Generate a random number S from uniform(0,I).
Generate the sequence {S (^) j :Sj= S+(j− 1 )I,j= 1 ,...,n}.
Compute hit counts of i-th population unit h #{M S M,j 1 ,...,n}

j i

i =^ i− 1 < ≤ = ,

k=1,...,N.

At the end of the algorithm, population with hit counts h (^) i > 0 are selected.

PPS Sequential Sampling (Chromy)

This algorithm selects n units from N population units sequentially proportional to size with minimum replacement. This method is proposed by Chromy (1979).

Inclusion probability of i-th unit = npi
Sampling weight of i-th unit = npi

Maximum number of hits of i-th unit = trunc( np (^) i ) + 1

By applying the restriction n

M

max Mi ≤ , we can ensure maximum number of hits is

equal to 1.

Algorithm

Select one unit from the population proportional to its size M (^) i. The selected unit

receives a label 1. Then assign labels sequentially to the remaining units. If the end of the list is encountered, loop back to the beginning of the list until all N units are labeled. These labels are the index i in the subsequent steps.

Compute integer part of expected hit counts I trunc(M)

i =^ i , where^ ∑

=

i

k 1

k

M (^) i M ,

i=1,...,N.

Compute fractional part of expected hit counts (^) i

Fi = Mi−I, i=1,...,N.

Define I 0 = 0 , F 0 = 0 and T 0 = 0.
Set i=1.
If Ti (^) − 1 = Ii− 1 , go to step 8.
If Ti (^) − 1 = Ii− 1 + 1 , go to step 9.
Determine accumulated hits at i-th step (case 1).

a) Set Ti = Ii.

b) If Fi > Fi− 1 , set Ti = Ti+ 1 with probability (F (^) i − Fi− 1 )/( 1 −Fi− 1 ).

c) Set i=i+1.

d) If i > N, terminate. Otherwise go to step 6.

Determine accumulated hits at i-th step (case 2).

a) Set Ti = Ii.

b) If Fi > Fi− 1 , set Ti = Ti+ 1.

c) If Fi (^) − 1 ≥Fi, set Ti = Ti+ 1 with probability Fi /Fi− 1.

d) Set i=i+1.

e) If i > N, terminate. Otherwise go to step 6.

At the end of the algorithm, number of hits of each unit can be computed by the formula

h (^) i = Ti−Ti− 1 , i=1,...,N. Units with m (^) i > 0 are selected.

PPS Sampford’s Method

Sampford’s (1967) method selects n units out of N population units without replacement and probabilities proportional to size.

Inclusion probability of i-th unit = npi

PPS Brewer’s Method (n=2)

Brewer’s (1963) method is a special case of Sampford’s method when n=2.

PPS Murthy’s Method (n=2)

Murthy’s (1957) method selects two units out of N population units with probabilities proportional to size without replacement.

Inclusion probability of i-th unit = (^)  

=

N

k (^1) k

k

i

i i 1 p

p

1 p

p p 1

Sampling weight of i-th unit = inverse of inclusion probability

Algorithm

Select first unit from the population with probabilities p (^) k, k=1,...,N.
If the first selected unit has index i, then select second unit with probabilities

p (^) k /( 1 − pi), k ≠ i.

Joint Inclusion Probabilities

The joint inclusion probabilities of population unit i and j is given by

(^) ij =p (^) ipj( 2 −pi−pj)/( 1 −pi)( 1 −pj).

Saved Variables

STATGEPOPSIZE

STAGEPOPSIZE saves the population sizes of each stratum in a given stage.

STAGESAMPSIZE

STAGESAMPSIZE saves the actual sample sizes of each stratum in a given stage. See the "Sample Size" section for details on sample size calculations.

STAGESAMPRATE

STAGESAMPRATE saves the actual sampling rate of each stratum in a given stage. It is computed by dividing the actual sample size by the population size. Due to the use of rounding and application of RATEMINSIZE and RATEMAXSIZE on sample size, the resulting STAGESAMPRATE may be different from sampling rate specified by the user.

STAGEINCLPROB

Stage inclusion probabilities depend on the selection method. The formulae are given in the individual sections of each selection method.

STAGEWEIGHT

It is equal to the inverse of stage inclusion probabilities.

SAMPLEWEIGHT

It is the product of previous weight (if specified) and all the stage weights.

STAGEHITS

This is the number of times a unit is selected in a given stage. When a WOR method is used the value is always 0 or 1. When a WR method is used it can be any nonnegative integer.

SAMPLEHITS

This is the number of times an ultimate sampling unit is selected. It is equal to STAGEHITS of the last specified stage.

STAGEINDEX

It is an index variable used to differentiate duplicated sampling units resulted from sampling with replacement. STAGEINDEX ranges from one to number of hits of a selected unit.

References

Brewer, K.W.R. (1963). A Model of Systematic Sampling with Unequal Probabilities. Australian Journal of Statistics , 5, 93 -105.

Chromy, J.R. (1979). Sequential Sample Selection Methods. Proceedings of the American Statistical Association, Survey Research Methods Section , 401 -406.

Fan, C.T., Muller, M.E., and Rezucha, I. (1962). Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers. Journal of the American Statistical Association , 57, 387 -402.

Fox, D.R. (1989). Computer Selection of Size-Biased Samples. The American Statistician , 43(3), 168 -171.

Golmant, J. (1990). Correction: Computer Selection of Size-Biased Samples. The American Statistician , 44(2), 194.

Hanurav, T.V. (1967). Optimum Utilization of Auxiliary Information: (^) psSampling of Two

Units from a Stratum. Journal of the Royal Statistical Society, Series B , 29, 374 -391.

Probability Proportional to Size (PPS) Sampling Algorithms, Study notes of Mathematical Statistics

Related documents

Partial preview of the text

Download Probability Proportional to Size (PPS) Sampling Algorithms and more Study notes Mathematical Statistics in PDF only on Docsity!

CSSELECT

Notation

M

Total size. ∑

M

Stratification

Simple Random Sampling

Unrestricted Random Sampling

M

P

K

M

P

i =^ j i− 1 < ≤ = , where^ M^0 =^0 and^ ∑

PPS Systematic Sampling

M

1. Compute cumulated sizes ∑

PPS Sequential Sampling (Chromy)

M

i =^ i , where^ ∑

PPS Sampford’s Method

PPS Brewer’s Method (n=2)

PPS Murthy’s Method (n=2)

Saved Variables

STATGEPOPSIZE

STAGESAMPSIZE

STAGESAMPRATE

STAGEINCLPROB

STAGEWEIGHT

SAMPLEWEIGHT

STAGEHITS

SAMPLEHITS

STAGEINDEX

References