Simple Random Sampling and Confidence Intervals in Statistics, Lab Reports of Statistics

The concept of simple random sampling in statistics, where each subset of units from a population has an equal chance of being selected. It also discusses the use of histograms to understand the sampling distribution and the calculation of confidence intervals for population means and totals. A rule of thumb for determining when the normal approximation is reasonable and discusses the limitations of simple random sampling, leading to the introduction of stratified random sampling.

Typology: Lab Reports

Pre 2010

Uploaded on 08/18/2009

koofers-user-kpf
koofers-user-kpf 🇺🇸

5

(1)

9 documents

1 / 31

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Sampling 1
Last updated April 8, 2008
Chapter 2: Environmental Sampling
This chapter discusses means of obtaining data for environmental studies. Either
the data will come from a planned experiment in the lab or from sampling done
in the field. This chapter discusses several methodologies for obtaining data in a
scientifically valid way via sampling.
One of the key points to understand is that a valid sampling plan is needed in order
to obtain useful data. If the scientist simply goes out into the field and picks sites to
sample with no plan ahead of time, then biases and other problems can lead to poor
or worthless data.
Example: Estimate the number of trees in a forest with a particular disease. How
can we do this? One idea is to divide the forest into plots of size 1 acre say and
then obtain a random sample of these acres. Count the number of diseased trees in
each sampled acre. From this sample, we can use statistical principals to estimate
the number of trees in the forest with the disease.
Some of the most well-known sampling designs used in practice and discussed here
are as follows:
Simple Random Sampling
Stratified Random Sampling
Systematic Sampling
Double Sampling
Multistage Sampling
2.1 Introduction
First, we introduce some terminology and basic ideas.
Census: This occurs when one samples the entire population of interest.
The United States government tries to do this every 10 years. However, in practical
problems, a true census is almost never possible.
In most practical problems, instead of obtaining a census, a sample is obtained by
observing the population of interest, hopefully without disturbing the population.
The sample will generally be a very tiny fraction of the whole population.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f

Partial preview of the text

Download Simple Random Sampling and Confidence Intervals in Statistics and more Lab Reports Statistics in PDF only on Docsity!

Last updated April 8, 2008

Chapter 2: Environmental Sampling

This chapter discusses means of obtaining data for environmental studies. Either the data will come from a planned experiment in the lab or from sampling done in the field. This chapter discusses several methodologies for obtaining data in a scientifically valid way via sampling.

One of the key points to understand is that a valid sampling plan is needed in order to obtain useful data. If the scientist simply goes out into the field and picks sites to sample with no plan ahead of time, then biases and other problems can lead to poor or worthless data.

Example: Estimate the number of trees in a forest with a particular disease. How can we do this? One idea is to divide the forest into plots of size 1 acre say and then obtain a random sample of these acres. Count the number of diseased trees in each sampled acre. From this sample, we can use statistical principals to estimate the number of trees in the forest with the disease.

Some of the most well-known sampling designs used in practice and discussed here are as follows:

  • Simple Random Sampling
  • Stratified Random Sampling
  • Systematic Sampling
  • Double Sampling
  • Multistage Sampling

2.1 Introduction

First, we introduce some terminology and basic ideas.

Census: This occurs when one samples the entire population of interest.

The United States government tries to do this every 10 years. However, in practical problems, a true census is almost never possible.

In most practical problems, instead of obtaining a census, a sample is obtained by observing the population of interest, hopefully without disturbing the population. The sample will generally be a very tiny fraction of the whole population.

One must of course determine the population of interest – this is not always an easy problem. Also, the variable(s) of interest need to be decided upon.

Element: an object on which a measurement is taken.

Sampling Units: non-overlapping (usually) collections of elements from the popu- lation.

In some situations, it is easy to determine the sampling units (households, hospitals, etc.) and in others there may not be well-defined sampling units (acre plots in a forest for example).

Example. Suppose we want to determine the concentration of a chemical in the soil at a site of interest. One way to do this is to subdivide the region into a grid. The sampling units then consist of the points making up the grid. The obvious question then becomes – how to determine grid size. One can think of the actual chemical concentration in the soil at the site varying over continuous spatial coordinates. Any grid that is used will provide a discrete approximation to the true soil contamination. Therefore, the finer the grid, the better the approximation to the truth.

Frame: A list of the sampling units.

Sample: A collection of sampling units from the frame.

Notation:

N Number of Units in the Population n Sample size (number of units sampled) y Variable of interest.

Two Types of Errors.

  • Sampling Errors – these result from the fact that we generally do not sample the entire population. For example, the sample mean will not equal the population mean. This statistical error is fine and expected. Statistical theory can be used to ascertain the degree of this error by way of standard error estimates.
  • Non-Sampling Errors – this is a catchall phrase that corresponds to all errors other than sampling errors such as non-response and clerical errors. Sampling errors cannot be avoided (unless a census is taken). However, every effort should be made to avoid non-sampling errors by properly training those who do the sampling and carefully entering the data into a database etc.

2.2 Simple Random Sampling (SRS)

One of the simplest sampling designs available is the simple random sample.

Furthermore, using counting techniques, it also follows that

var(¯y) = {σ^2 /n}(1 − n/N ).

The factor (1 − n/N ) is called the finite population correction factor which is approx- imately equal to 1 when n is a tiny fraction of N. The square-root of the variance of y¯ is the Standard error of the sample mean. This is usually estimated by

Estimated Standard Error of the mean: =

s √ n

√ 1 − n/N.

Example: Consider two populations of sizes N 1 = 1, 000 , 000 and N 2 = 1000. Sup- pose the variance of a variable y is the same for both populations. What will give a more accurate estimate of the mean of the population: a SRS of size 1000 from the first population or a SRS of size 30 from the second population? In the first case, 1000 out of a million is 1/1000th of the population. In the second case, 30/1000 is 3% of the population. Surprisingly, the sample from the larger population is more accurate.

Confidence Intervals. A (1 − α)100% confidence interval for the population mean can be formed using the following formula:

y¯ ± tα/ 2 ,n− 1 SÊ (¯y) = ¯y ± tα/ 2 ,n− 1 (s/

n)

√ 1 − n/N ,

where tα/ 2 ,n− 1 is the α/2 critical value of the t-distribution on n−1 degrees of freedom. This confidence interval is justified by applying a finite population version of the central limit theorem to the sample mean obtained from random sampling.

2.4 Estimating a Population Total

Often, interest lies in estimating the population total, call it Ty. For instance, in the diseased tree example, one may be interested in knowing how many trees have the disease. If the sampling unit is a square acre and the forest has N = 1000 acres, then Ty = N μ = 1000μ. Since μ is estimated by ¯y, we can estimate the population total by ty = N y¯ (1)

and the variance of this estimator is

var(ty) = var(N y¯) = N 2 var(¯y) = N 2 (1 − n/N )σ^2 /n.

Confidence Interval for Population Total. A (1 − α)100% confidence interval for the population total Ty is given by

ty ± tα/ 2 ,n− 1 (s/

n)

√ N (N − n).

Sample Size Requirements.

When using a confidence interval to estimate μ or Ty, the total, we may require that our estimate lies within d units from the true population parameter. How large a sample size is required so that the half-width of the confidence interval is d? The following two formulas give the (approximate) sample size required for the population mean and total:

For the mean μ: n ≥

N σ^2 z α/^22 σ^2 z α/^22 + N d^2

and

For the total Ty: n ≥

N 2 σ^2 z α/^22 N σ^2 z α/^22 + d^2

where zα/ 2 is the standard normal critical value (for instance, if α = 0.05, the z 0. 025 = 1.96). These two formulas are easily derived algebraically solving for n in the confidence interval formulas.

Note that these formulas require that we plug a value in for σ^2 which is unknown in practice. To overcome this problem, one can use an estimate of σ^2 from a previous study or a pilot study. Alternatively, one can use a reasonable range of values for the variable of interest to get an estimate of σ^2 : σ ≈ Range/6.

Example. Suppose a study is done to estimate the number of ash trees in a state forest consisting of N = 3000 acres. A sample of n = 100 one-acre plots are selected at random and the number of ash trees per selected acre are counted. Suppose the average number of trees per acre was found to be ¯y = 5.6 with standard deviation s = 3.2. Find a 95% confidence interval for the total number of ash trees in the state forest.

The estimated total l is ty = N y¯ = 3000(5.6) = 16800 ash trees in the forest. The 95% confidence interval is

16800 ± 1 .96(3. 2 /

√ 3000(3000 − 100) = 16800 ± 1849. 97.

A Note of Caution. The confidence interval formulas given above for the mean and total will be approximately valid if the sampling distribution of the sample mean and total are approximately normal. However, the approximate normality may not hold if the sample size is too small and/or if the distribution of the variable is strongly skewed. To illustrate the problem, consider the following illustration. Suppose a survey is to be conducted to estimate the total number of students in Ohio public schools suffering from asthma. Let us take each county as a sampling unit. Then N = 88 for the eighty eight counties in Ohio.

For the sake of illustration, suppose we know the number of students in each county suffering from asthma and that the data is given in the following table:

1 Adams 359 2 Allen 1296 3 Ashlan 520

Histogram of Number of Students with Asthma

Number of Students

Frequency

0 5000 10000 15000

0

20

40

60

Figure 1: Actual distribution of student totals per county. Note that the distribution is very strongly skewed to the right.

To illustrate the sampling distribution of the estimated total ty where

ty = N y,¯

10,000 samples of size n were obtained and for each sample, the total was estimated. The histograms show the sampling distribution for ty for sample sizes of n = 5, 25 , and 50 in Figure 2, Figure 3, and Figure 4 respectively. The long vertical line denotes the true total of T = 131, 260.

Clearly the sampling distribution of ty, the estimated total, is not nearly normal for n = 5. We see a bimodal distribution which results due to the presence of lightly populated and heavily populated counties.

Cochran (1977) gives the following rule of thumb for populations with positive skew- ness: the normal approximation will be reasonable provided the sample size n satisfies

n ≥ 25 G^21 ,

where G 1 is the population skewness,

G 1 =

∑^ N

i=

(yi − μ)^3 /(N σ^3 ).

For this particular example, we find

25 G^2 = 357

which is much bigger than the entire number of sampling units (counties)!

In order to get an idea of how well the 95% confidence interval procedure works for this data, we performed the sampling 10,000 times for various sample sizes and

Histogram of Total Estimate, n=

Total Esitmate

Frequency

0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 6e+

0

100

200

300

400

Figure 2: SRS of n = 5 for estimating the total number of students. The thick vertical line marks the true total.

Histogram of Total Estimate, n=

Total Estimate

Frequency

0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 6e+

0

50

100

150

200

Figure 3: SRS of n = 25 for estimating the total number of students. The thick vertical line marks the true total.

If we obtain a sample of size n from a population of size N , and each unit in the population either has or does not have a particular attribute of interest (e.g. disease or no disease), then the number of items in the sample that have the attribute is a random variable having a hypergeometric distribution. If N is considerably larger than n, then the hypergeometric distribution is approximated by the binomial distribution. We omit the details of these two probability distributions.

The data for experiments such as these looks like y 1 , y 2 ,... , yn, where

yi =

{ 1 if the ith unit has the attribute 0 if the ith unit does not have the attribute.

The population proportion is denoted by p and is given by

p =

N

∑^ N

i=

yi.

We can estimate p using the sample proportion ˆp given by

pˆ =

n

∑^ n

i=

yi.

Note that in statistics, it is common to denote the estimator of a parameter such as p by ˆp (“p”-hat). This goes for other parameters as well.

Using simple random sampling, one can show that

var(ˆp) = (

N − n N − 1

p(1 − p) n

An unbiased estimator of this variance is given by

var(ˆˆ p) = (

N − n N

pˆ(1 − pˆ) n − 1

An approximate (1 − α)100% confidence interval for the population proportion is given by

pˆ ± zα/ 2

√√ √√ (N − n)ˆp(1 − pˆ) N (n − 1)

This confidence interval is justified by assuming that the sample proportion behaves like a normal random variable which follows from the central limit theorem. The approximation is better when the true value of p is near 1/2. If p is close to zero or one, the distribution of ˆp tends to be skewed quite strongly unless the sample size is very large.

The sample size required to estimate p with confidence level (1 − α) with half-width d is given by

n ≥

z α/^22 p(1 − p)N z α/^22 p(1 − p) + d^2 (N − 1)

Note that this formula requires knowing p which is what we are trying to estimate! There are a couple ways around this problem. (1) Plug in p = 1/2 for p in the

formula. This will guarantee a larger than necessary sample size. (2) Use a guess for p, perhaps based on a previous study.

2.7 Stratified Random Sampling.

Data is often expensive and time consuming to collect. Statistical ideas can be used to determine efficient sampling plans that will provide the same level of accuracy for estimating parameters with smaller sample sizes. The simple random sample works just fine, but we can often do better in terms of efficiency. There are numerous sampling designs that do a better job than simple random sampling. In this section we look at perhaps the most popular alternative to simple random sampling: Stratified Random Sampling.

The idea is to partition the population into K different strata. Often the units within a strata will be more homogeneous. For stratified random sampling, one simply obtains a simple random sample in each strata. Of course, the problem arises as to how many observations to allocate to each strata. Another issue is how to define the strata in the first place.

There are three advantages to stratifying:

  1. Parameter estimation can be more precise with stratification.
  2. Sometimes stratifying reduces sampling cost, particularly if the strata are based on geographical considerations.
  3. We can obtain separate estimates of parameters in each of the strata which may be of interest in of itself.

Examples.

  • Estimate the mean PCB level in a particular species of fish. We could stratify the population of fish based on sex and also on the lakes the fish are living.
  • Estimate the proportion of farms in Ohio that use a particular pesticide. We could stratify on the basis of the size of the farm (small, medium, large) and/or on geographical location etc.

These two examples illustrate a couple of points about stratification. Sometimes the units fall naturally into different stratum and sometimes they do not.

Notation. Let Ni denote the size of the ith stratum for i = 1, 2 ,... , K, where K is the number of strata. Then the overall population size is

N =

∑^ K

i=

Ni.

The average number of plants per acre using the two-strata sampling is estimated to be: y¯s = N 1 y¯ 1 /N + N 2 ¯y 2 /N = 86(63.36)/158 + 72(192.83)/158 = 122. 36.

The standard error of this estimate is given by

̂ SE(¯ys) =

√ (N 1 /N )^2 s^21 /n 1 (1 − n 1 /N 1 ) + (N 1 /N )^2 s^22 /n 2 (1 − n 2 /N 2 )

=

√ (86/158)^2 (32.738)^2 /14(1 − 14 /86) + (72/158)^2 (80.782)^2 /12(1 − 12 /72) = 10. 635.

Thus, with 95% confidence, we estimate that the average number of honeysuckle per acre in the forest is

  1. 36 ± 2(10.635) = 122. 36 ± 21 .270 plants.

It is interesting to note what would have happened if we had ignored the stratification and simply treated this as a simple random sample of size n = n 1 +n 2 = 14+12 = 26. The sample mean of all n = 26 acres is ¯y = 123.12 which is very close to the estimated mean found using the stratification formulas. The standard deviation for the n = 26 measurements is s = 88. 100. The standard error of the mean using the simple random sampling formula is

̂ SE(¯y) = s/n(1 − n/N ) = 88. 100 /26(1 − 26 /158) = 15. 792.

Thus, using a stratified sampling plan led to a much smaller standard error of the mean (10.635 compared to 15.792) than if we had just treated the data as a simple random sample. That is, the stratified design leads to a much more precise estimator of the mean. In addition, the stratification design allows us to obtain separate estimates of honeysuckle abundance in new and old growth parts of the forest.

2.8 Post-Stratification

Sometimes the stratum to which a unit belongs is unknown until after the data is collected. For example, values such as age or sex which could be used to form stratum, but these values may not be known until individual units are sampled. The idea of post-stratification is to take a simple random sample first and then stratify the observations into strata after. Once this is done, the data can be treated as if it were a stratified random sample. One difference however is that in a post-stratification setting, the sample sizes at each stratum are not fixed ahead of time but are instead random quantities. This will cause a slight increase in the variability of the estimated mean (or total).

Allocation in Stratified Random Sampling

If a stratified sample of size n is to be obtained, the question arises as to how to allocate the sample to the different strata. In deciding the allocation, three factors need to be considered:

  1. Total number of elements in each stratum.
  2. Variability in each strata, and
  3. The cost of obtaining an observation from each stratum.

Intuitively, we would expect to allocate larger sample sizes to larger stratum and/or stratum with high variability. Surveys are often restricted by cost, so the cost may need to be considered. In some situations, the cost of sampling units at different strata could vary for various reasons (distance, terrain, etc.). The optimal allocation of the total sample n to the ith stratum is to chose ni proportional to

ni ∝

Niσi √ ci

where ci is the cost for sampling a single unit from the ith stratum. Therefore, the i stratum will be allocated a larger sample size if its relative size or variance is big or its cost is low. If the costs are the same per stratum, then the optimal allocation is given by ni ∝ Niσi,

which is known as Neyman Allocation.

A simple allocation formula is to use proportional allocation where the sample size allocated to each stratum is proportional to the size of the stratum. This will be nearly optimal if the cost and variance at each stratum are nearly equal.

Stratification for Estimating Proportions.

A population proportion can be thought of as a population mean where the variable of interest takes only the values zero or one. Stratification can be used to estimate a proportion, just as it can be used to estimate a mean. The formula for the stratified estimate of a population proportion is given by

pˆs =

N

∑^ K

i=

Ni pˆi,

and the estimated variance of this estimator is given by

var(ˆ̂ ps) =

N 2

∑^ K

i=

Ni(Ni − ni)ˆpi(1 − pˆi)/(ni − 1).

2.9 Systematic Sampling.

Another sampling design that is often easy to implement is a systematic sample. The idea is to randomly choose a unit from the first k elements of the frame and then sample every kth unit thereafter. This is called a one-in-k systematic sample. A systematic sample is typically spread more evenly over the population of interest.

The idea of cluster sampling is to obtain a simple random sample of primary units and then to sample every unit within the cluster.

For example, suppose a survey of schools in the state is to be conducted to study the prevalence of lead paint. One could obtain a simple random sample of schools throughout the state. But this could lead to high costs due to a lot of travel. Instead, one could treat school districts as clusters and obtain a simple random sample of school districts. Once an investigator is in a particular school district, she could sample every school in the district.

A rule of thumb for determining appropriate clusters is that the number of elements in a cluster should be small (e.g. schools per district) relative to the population size and the number of clusters should be large. Note that one of the difficulties in sampling is obtaining a frame. Cluster sampling often makes this task much easier since it if often easy to compile a list of the primary sampling units (e.g. school districts).

Cluster sampling is often less efficient than simple random sampling because units within a cluster often tend to be similar. Thus, if we sample every unit within a cluster, we are in a sense obtaining redundant information. However, if the cost of sampling an entire cluster is not too high, then cluster sampling becomes appealing for the sake of convenience. Note that we can increase the efficiency of cluster sampling by increasing the variability within clusters. That is, when deciding on how to form clusters, say over a spatial region, one could choose clusters that are long and thin as opposed to square or circular so that there will be more variability within each cluster.

Estimation and standard error formulas for cluster sampling can be found in most textbooks on sampling (e.g. Scheaffer, Mendenhall, and Ott 1996).

Notation.

N = The number of clusters n = Number of clusters selected in a simple random sample mi = Number of elements in cluster i

M =

∑^ N

i=

mi = Total number of elements in the population

yi = The total of all observations in the ith cluster

The population mean μ is estimated by

y¯ =

∑n ∑i=1^ yi n i=1 mi

This estimator is a special case of a ratio estimator which we shall introduce a bit later. The estimated variance of ¯y is given by

var(¯̂ y) = {(N − n)/(N n M¯ 2 )}s^2 r ,

where

s^2 r =

∑^ n

i=

(yi − ym¯ i)^2 /(n − 1),

and M¯ = M/N,

the average size of a cluster for the population. Note that often in practice M and hence M¯ are unknown in which case M¯ can be estimated by

m¯ =

n

∑^ n

i=

mi.

Estimating the Population Total in Cluster Sampling. An estimate of the population total in cluster sampling can be obtained in much the same way it was obtained in simple random sampling:

ty = M y.¯

The estimated variance of ty is simply M 2 var(¯̂ y). What is wrong with using this estimator of the population total? The problem is that it requires that we know M which is often unknown.

Alternatively, if we do not know M , we could estimate the population total using

N y¯t,

where

y¯t =

n

∑^ n

i=

yi,

is the average of the cluster totals for the sampled clusters. The estimated variance of N y¯t is ̂ var(N ¯yt) = N (N − n)s^2 t /n,

where

s^2 t =

∑^ n

i=

(yi − y¯t)^2 /(n − 1).

N ¯yt is an unbiased estimator of the population total, but because it does not use the information on the cluster sizes (e.g. the mi’s), the variance of N ¯yt tends to be bigger than the variance of ty.

Example. Roberts et al (2004) used a cluster sampling approach to estimate the number of additional deaths in Iraq that resulted due to the Iraq war that started in

  1. From this article, it was widely reported that the number of Iraqi’s killed from the war (so far) is 100,000. Their estimate of Iraqi deaths due to the war was 98, (not including Falluja which had a very high number of deaths). A 95% confidence interval for this total was given as (8000, 194000). 33 clusters were sampled based on Governorates and 30 households were interviewed in each cluster. The 33 clusters were sampled using a systematic sampling approach. Additional details can be found in the article.

value is denoted by x2(2), and so on.

x 11 x 12 x 13 x 14 x 15 : x1(1) x 21 x 22 x 23 x 24 x 25 : x2(2) x 31 x 32 x 33 x 34 x 35 : x3(3) x 41 x 42 x 43 x 44 x 45 : x4(4) x 51 x 52 x 53 x 54 x 55 : x5(5)

An unbiased estimator of the mean is given by the ranked set mean estimator:

¯¯x =^1 n

∑^ n

i=

xi(i).

It can be shown that the ranked set sample mean is more efficient than the simple random sample mean, i.e. the variance of ¯x¯ is less than the variance of the sample mean from an ordinary simple random sample. In fact, the increased efficiency of ranked set sampling can be quite substantial. Of course if errors are likely when ranking the observations in each row above, then the efficiency of the ranked set sampling will decrease.

2.11 Ratio Estimation.

It is quite common that we will obtain auxiliary information on the units in our sample. In such cases, it makes good sense to use the information in this auxiliary information to improve the estimates of the parameters of interest, particularly if the auxiliary information provides information on the variable of interest.

Suppose x is the variable of interest and for each unit, there is another (auxiliary) variable u available. If u is correlated with x, then measurements on u provide information on x. Typically in practice, measurements on the variable u will be easier and/or less expensive to obtain and then we can use this information to get a more precise estimator for the mean or total of x. For instance, suppose we want to estimate the mean number of European corn bore egg masses on corn stalks. It is time consuming to inspect each and every leaf of the plant for corn borers. We could do this on a sample of plants. However, it is relatively easy to count the number of leaves on each given stalk of corn. It seems plausible that the number of egg masses on a plant will be correlated with the number of leaves on the plant.

A common use of ratio estimation is in situations where u is an earlier measure- ment taken on the population and x represents the current measurement. In these situations, we can use information from the previous measurements to help in the estimation of the current mean or total.

Suppose we obtain a sample of pairs (u 1 , x 1 ),... , (un, xn). We can compute the means of the two variables ¯x and ¯u and form their ratio:

r =

¯x ¯u

Letting μx and μu denote the population means of x and u respectively, then we would expect that μx μu

x¯ u ¯

in which case μx ≈ rμu.

Using this relationship, we can define the ratio estimator of mean μx as

¯xratio = rμu,

and if N is the total population size, then the ratio estimator of the total τ is

tx = rN μu.

What is the intuition behind the ratio estimator? If the estimated ratio remains fairly constant regardless of the sample obtained, then there will be little variability in the estimated ratio and hence little variability in the estimated mean using the ratio estimator for the mean (or total).

Another way of thinking of the ratio estimator is as follows: suppose one obtains a sample and estimates μx using ¯x and for this particular sample, ¯x underestimates the true mean μx. Then the corresponding mean of u will also tend to underestimate μu for this sample if x and u are positively correlated. In other words, μu/¯u will be greater than one. The ratio estimator of μx is

¯xratio = rμu = ¯x(

μu ¯u

From this relationship, we see that the ratio estimator takes the usual estimator ¯x and scales it upwards by a factor of μu/u¯ which will help correct the under-estimation of ¯x.

There is a problem with the ratio estimator: it is biased. In other words, the ratio estimator of μx does not come out to μx on average. One can show that

E[¯xratio] = μx − cov(r, x¯).

However, the variability of the ratio estimator often tends to be smaller than the variability of the usual estimator of ¯x indicating that it may still be preferable.

An estimate of the variance of the ratio estimator ¯xratio is given by the following formula:

var(¯̂ xratio) = (1 − n/N )

∑^ n

i=

(xi − rui)^2 /[n(n − 1)]. (2)

By the central limit theorem applied to the ratio estimator, ¯xratio follows an ap- proximate normal distribution for large sample sizes. In order to guarantee a good approximation, a rule of thumb in practice is to have n ≥ 30 and the coefficient of variation σx/μx < 0 .10. If the coefficient of variation is large, then the variability of ratio estimator tends to be large as well.