





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The use of reweighted estimators for approximating selection probabilities in snowball sampling, a popular approach for sampling from hidden populations. The study compares the efficiency of different reweighting techniques in terms of bias correction and presents simulation results indicating the superiority of the suggested methods over traditional estimators with equal sample weights.
Typology: Study notes
1 / 9
This page cannot be seen from the preview
Don't miss anything!






Snowball sampling, where existing study subjects recruit further subjects from among their acquaintances, is a popular approach when sampling from hidden populations. Since people with many in-links are more likely to be selected, there will be a selection bias in the samples obtained. In order to eliminate this bias, the sample data must be weighted. However, the exact selection probabilities are unknown for snowball samples and need to be approximated in an appropriate way. This paper proposes different ways of approximating the selection probabilities and develops weighting techniques using the inverse of the selection probabilities. Some numerical examples for small graphs and simulations on larger networks are provided to compare the efficiency of the weighting techniques. The simulation results indicate that the suggested re- weighted estimators should be preferred to traditional estimators with equal sample weights for the initial snowball sampling waves.
Standard sampling and estimation techniques require the sample selection to be done with known probabilities. However, for many populations of interest it is impracti- cal or impossible to construct a sampling frame needed for the calculation of these probabilities. This could be due to the difficulty of locating members of the target population. These populations are referred to as hidden populations and are charac- terized by their lack of sampling frames and in some cases, also their strong privacy concerns. Examples of such populations are drug users, commercial prostitutes, il- legal immigrants and the homeless. Kalton and Anderson (1986) discuss difficulties in statistical inference for rare or hidden populations and review different sampling procedures and their limitations. One approach to sampling members of hard-to-reach populations while still ob- taining unbiased estimates of population characteristics is through snowball sampling, where the initially sampled individuals will lead you to the other members of the hid- den population, which in turn lead to other members and so on. Biernacki & Waldorf (1981) review problems and techniques of snowball sampling and applications of its
use can be found in Welch (1975) and Snow et al. (1981). Using this approach to include elements from the hidden population will lead to a sample selection bias. This occurs since those with many contacts are more likely to be included in the sample. Frank (1977, 1979) and Thompson and Frank(2000) consider statistical problems for snowball sampling and Thompson (2006) treats a special case of snowball sampling called walk sampling, where only one further vertex is selected at each sampling stage. Since snowball samples are selected with unequal probabilities, the sample mean can no longer be used as a basis for making unbiased estimation about population characteristics. The degrees of the sampled elements will, at each sampling wave, determine the sample sizes obtained and if these degrees are correlated to the outcome of the study variable, we may get large biases in our estimations. To obtain an unbiased estimate, it is necessary to weight the sample data in some way. Typically these weights are inversely related to probabilities of selection. In this paper, the probabilities of selection at each sampling wave are approxi- mated in different ways and weighting techniques are applied to the sample data. The weighting techniques proposed are all based on the link information in the obtained sample, and substitute the equal sample weights in traditional estimators. Some sim- ulations on larger networks are performed. The simulations consider two special cases and compare the efficiency of the proposed weighting techniques in terms of bias cor- rection when making estimations using snowball samples of five waves.
The snowballing process is as follows. It is done by first identifying a few members of the population, the initial sample, also referred to as starting seeds. A convenient way of finding the initial seeds is by site sampling. For instance, homeless could be initially sampled at a shelter. The next step is then to ask each of the gathered seeds to identify other members of the population. Those who are not in the initial sample but mentioned by at least one individual in the initial sample, are part of the first wave of the snowball sample. Those who are neither members of the initial sample nor the first wave but mentioned by at least one member of the first wave, are said to belong to the second wave of the snowball sample, and so on. A wave is final if no new individuals are mentioned that have not been mentioned earlier or when a predefined wave number or sample size is reached.
Some notations are introduced. Let U be a population with a known or unknown number of elements N. If the population is represented by a graph, the elements are the vertices and the contacts are the edges between the vertices. Each element is characterized by a real-valued property yi which is unknown but observable if element i is sampled. Assuming that the population consists of drug-users, yi may be quantities like the average amount of money spent on drugs per week or an indicator variable which equals 1 if the subject has a permanent residence. We are interested in the average quantity y¯U =
U
yi N
based on approximations of the sample selection probabilities by using the observed information about the relations between the elements in the sample. Using this infor- mation is rational since the probability of vertices in the population being included in the sample are correlated with their corresponding degrees. In other words, all weighting techniques presented here are based on the observed degrees in the sample. Practically, these re-weightings can quite easily be implemented on social networks by asking each sampled unit to name their outgoing relations, even if we stop the sampling after that specific unit. By doing this we will obtain the degree of each sample element and approximating the sample selection probabilities needed for re- weighting. In order to obtain (^) ∑ ωi = 1,
all proposed weights are normalized according to
ωi = ∑ϑi S ϑi
where ϑi is defined for each of the four RW’s in the following subsections.
The first proposed technique is performed by assuming that the inverse of the degrees are approximations of the inclusion probabilities for each vertex i in the population, i.e. P (i ∈ S) ∝ di for i = 1,... , N.
Thus, the selected sample elements are weighted proportional to these probabilities and we have that ϑi =
di for i = 1,... , n. Intuitively, these weights seem like a good and straightforward option. However, the degrees of the vertices in the intitial sample will not affect their probabilities of selection. Taking this into consideration, RW2-RW4 are developed.
The second re-weighting technique, RW2, is similar to RW1 but with the difference that we change the weights of the starting elements for the reason mentioned above. This initial seed value is arbitrarily chosen and set proportional to 2. For the first wave we have the selection probability for the initial vertex proportional to 1/2, and the selection probabilities for the remaining (n − 1) draws proportional to 1/di. Thus we have that
P (i ∈ S) ∝
2 if vertex i is a seed di if vertex i is not a seed,
and
ϑi =
1 / 2 if vertex i is a seed 1 /di if vertex i is not a seed.
The third re-weighting method approximates the sample selection probabilities with the inverse of degrees but with an unknown constant, c, added to the inclusion prob- abilities of the vertices in the sample. Thus we have that
P (i ∈ S) ∝ di + c for i = 1,... , N,
and ϑi =
di + c for i = 1,... , n.
Throughout this paper, we will use the constant value c = 0.5. Note also that when c = 0, RW3 coincides with RW1.
The last re-weighting technique presented here is somewhat different than the previous ones. Here, we will use the observed mean degrees of the sample to approximate the inclusion probabilities. Assume the inclusion probability of the initial vertex is (1/N ) and the inclusion probabilities for the remaining (n − 1) draws is equal to di/
i di, where^ di^ is the degree of vertex i. Thus, the sample selection probabilities for each possible sample is approximately inversely proportional to [ 1 N
After multiplying with
di/(n − 1), and estimating the population mean degree (
i di/N^ ) by the sample mean degree (
∑n i di/n), we have that 1 ϑi
d (n − 1)
, for i = 1,... , n.
In this expression all terms can be calculated from the sample implying that no infor- mation about the population is needed.
In this section, simulations are performed to evaluate how the different re-weightings (RW1-RW4) work for larger networks. When the degrees di of all vertices i ∈ U are determined, the networks can be generated by the algorithm presented in Shafie (2009), where the creation of a simple graph with only one type of undirected edges is described. It is only possible to construct such networks when a number of conditions are satisfied, one which is that no degree may be larger than half of the sum of degrees, or larger than (N − 1). In addition to earlier assumptions in this paper, we here assume that the vertices consist of two separate groups, denoted A and B. The degrees of the vertices in the two groups, dA and dB , are kept fixed and the snowballing from this network starts with one initial seed chosen randomly from the graph population. This initial seed is denoted by S 0. We consider the case when sampling is done in waves, that is, the sample sizes obtained at each wave depend on the degrees of the vertices selected at the previ- ous wave. Thus, as our sampling procedure proceeds to the subsequent waves, the
For the second simulation case, the network size remains the same (N = 100), but we drop the assumption about equally sized groups and instead assume that group A elements consist of only 20% of the population. Further, we assume a larger divergence between the degrees of those in group A and those in group B. The degrees are set to dA = 10 and dB = 2. As for the first simulation case, we use the traditional and the re-weighted estimators for estimating πA. The results plotted in Figure 5. The results are consistent with those for Case 1. PA,RW 4 gives the smallest biases for the three first snowballing waves and for the subsequent waves, the traditional estimator with equal sample weights should be preferred. Assume that we are interested in the estimation of another population proportion denoted Q. For instance, assume that the population consists of drug-users grouped after gender (A or B) and let Q be another binary study variable of heroin-users in the population of drug-users. The distribution of this variable over the graph could for instance be such that the majority of heroine-users are in group B, consistent with the second simulation case given here. The expected value of Qˆ is then a linear function of the estimated PA; E( Qˆ) = PAQA + (1 − PA)QB.
As seen, if the selection bias is ignored in the estimations of PA, they will reflect on other estimates made on population characteristics.
In this paper, four different ways of approximating the selection probabilities of snow- ball samples are presented using information about the degree of vertices in the ob- tained sample. These probabilities are then used to weight the sample data to eliminate the selection bias evident in the sampling procedure, i.e. the fact that people with many ingoing links are more likely to be sampled. These weights are inversely related to the probabilities of selection. The weighting techniques are applied on snowball samples (performed in waves) from small graphs of only six vertices with varying degrees, but also on larger simulated two-group population networks with fixed degrees in each respective group. The results show that all re-weightings are to be preferred to equal sample weights, but only for the initial waves, where the selection bias is most visible. As the sampling fraction increase, the bias of the traditional estimator with equal weights decrease while the opposite occurs for the proposed estimators. General conclusions about the re-weighting techniques can not be made since their performance is highly dependent on the graph size and structure. For the simulations made in this paper (N = 100), the fourth re-weighting, RW4, using the observed mean degrees of the samples obtained to estimate the inclusion probabilities, was shown to be preferable when estimating the group proportion of two group populations. In this paper, the mean degree of the graph was estimated using a straight mean of the observed degrees of the sampled elements. To further evaluate the proposed weighting techniques and in order to get some- what general results, larger simulations need to be performed. Also, one should con-