Download Statistics: Simple Random Sampling and Estimation of Population Mean and Proportion and more Exams Design history in PDF only on Docsity!
2.3 Simple Random Sampling
• Simple random sampling without replacement (srswor) of size n is the probability
sampling design for which a fixed number of n units are selected from a population of N
units without replacement such that every possible sample of n units has equal probability
of being selected. A resulting sample is called a simple random sample or srs.
• Note: I will use SRS to denote a simple random sample and SR as an abbreviation of
‘simple random’.
• Some necessary combinatorial notation:
– (n factorial) n! = n × (n − 1) × (n − 2) × · · · × 2 × 1. This is the number of
unique arrangements or orderings (or permutations) of n distinct items. For example:
6! = 6 × 5 × 4 × 3 × 2 × 1 = 720.
– (N choose n)
N
n
N (N − 1) · · · (N − n + 1)
n!
N!
n!(N − n)!
. This is the
number of combinations of n items selected from N distinct items (and the order of
selection doesn’t matter). For example,
• There are
(N
n
possible SRSs of size n selected from a population of size N.
• For any SRS of size n from a population of size N , we have P (S) = 1/
(N
n
• Unless otherwise specified, we will assume sampling is without replacement.
2.3.1 Estimation of yU and t
• A natural estimator for the population mean yU is the sample mean y. Because y is an
estimate of an individual unit’s y-value, multiplication by the population size N will give
us an estimate ̂t of the population total t. That is:
ŷU = y = 1
n
∑^ n
i=
yi ̂t =
N
n
∑^ n
i=
yi = (10)
• ̂yU and ̂t are design unbiased. That is, the average values of y and N y taken over all
possible SRSs equal yU and t, respectively.
Demonstration of Unbiasedness: Suppose we have a population consisting of five y-values:
Unit i 1 2 3 4 5
yi 0 2 3 4 7
which has the following parameters:
N = t = yU = S^2 = S ≈
Suppose a SRS of size n = 2 is selected. Then P (S) = 1/
2
= 1/10 for each of the 10 possible
SRSs.
All Possible Samples and Statistics from Example Population
Sample Units y-values
yi ̂yU = y ̂t = N y Ŝ^2 = s^2 Ŝ = s
S 1 1,2 0,2 2 1 5 2 1.
S 2 1,3 0,3 3 1.5 7.5 4.5 2.
S 3 1,4 0,4 4 2 10 8 2.
S 4 1,5 0,7 7 3.5 17.5 24.5 4.
S 5 2,3 2,3 5 2.5 12.5 .5 0.
S 6 2,4 2,4 6 3 15 2 1.
S 7 2,5 2,7 9 4.5 22.5 12.5 3.
S 8 3,4 3,4 7 3.5 17.5 .5 0.
S 9 3,5 3,7 10 5 25 8 2.
S 10 4,5 4,7 11 5.5 27.5 4.5 2.
Column Sum 32 160 67 22.
Expected value 3210 = 3. 2 16010 = 16 6710 = 6. 7 22. 106274 = 2. 26274
= E(estimator) = yU = t = S^2 6 = S
The averages for estimators ŷU = y, ̂t = N y, and Ŝ^2 = s^2 equal the parameters that they
are estimating. This implies that y, N y, and s^2 are unbiased estimators of yU , t, and S^2.
Notation: E(ŷ U ) = yU , E(̂t) = t, E(Ŝ^2 ) = S^2 or E(y) = yU , E(N y) = t, E(s^2 ) = S^2.
The average for estimator Ŝ = s does not equal the parameter S. This implies that s is a
biased estimator of S. Notation: E(Ŝ ) 6 = S or E(s) 6 = S.
- The next problem is to study the variances of ŷU = y and ̂t = N y.
- Warning: In an introductory statistics course, you were told that the variance of the sample
mean V (Y ) = S^2 /n (= σ^2 /n) and its standard deviation is S/
n (= σ/
n). This is
appropriate if a sample was to be taken from an infinite or extremely large population.
- However, we are dealing with finite populations that often are not considered extremely
large. In such cases, we have to adjust our variance formulas by
N − n
N
which is known
as the finite population correction (f.p.c.).
- Texts may rewrite the f.p.c.
N − n
N
as either 1 −
n
N
or 1 − f where f = n/N is the
fraction of the population that was sampled. By definition :
V ( ŷ U ) = V (y) = V (̂t) = N 2 V (y) = N (N − n)
S^2
n
- Because S^2 is unknown, we use s^2 to get unbiased estimators of the variances in (11)::
V̂ (̂yU ) = V̂ (y) = V̂ (̂t) = N 2 V̂ (y) = N (N − n)s
2
n
- Taking a square root of a variance in (11) yields the standard deviation of the estimator.
- Taking a square root of an estimated variance in (12) yields the standard error of the
estimate.
2.3.2 SRS With Replacement
• Consider a sampling procedure in which a sampling unit is randomly selected from the
population, its y-value recorded, and is then returned to the population. This process of
randomly selecting units with replacement after each stage is repeated n times. Thus, a
sampling unit may be sampled multiple times. A sample of n units selected by such a
procedure is called a simple random sample with replacement.
• The estimators for SRS with replacement are: ŷU = y V̂ (ŷU ) = V̂ (̂ y) =
s^2
n
• Suppose we have two estimators ̂θ 1 and ̂θ 2 of some parameter θ.
̂ θ 1 is less efficient than θ̂ 2 for estimating θ if V (θ̂ 1 ) > V ( θ̂ 2 ).
̂ θ 1 is more efficient than θ̂ 2 for estimating θ if V ( θ̂ 1 ) < V ( θ̂ 2 ).
• For most situations, the estimator for a SRS with replacement is less efficient than the
estimator for a SRS without replacement.
• There will be circumstances (such as sampling proportional to size) where we will consider
sampling with replacement. Unless otherwise stated, we assume that sampling is done
without replacement.
2.4 Two-Sided Confidence Intervals for yU and t
• In an introductory statistics course, you were given confidence interval formulas
y ± z∗^
s
n
and y ± t∗^
s
n
These formulas are applicable if a sample was to be taken from an infinitely or extremely
large population. But when we are dealing with finite populations, we adjust our variance
formulas by the finite population correction.
• In the finite population version of the Central Limit Theorem, we assume the estimators
̂ yU = y and ̂t = N y have sampling distributions that are approximately normal. That is,
ŷ U ∼˙ N
yU ,
N − n
N
S^2
n
and ̂t ∼˙ N
t , N (N − n)
S^2
n
• For large samples, approximate 100(1 − α)% confidence intervals for yU (μ) and t (τ ) are
For yU : For t : (14)
y ± z∗
N − n
N
s^2
n
N y ± z∗
N (N − n)
s^2
n
y ± z∗s
N − n
N
/n N y ± z∗s
N (N − n)/n (15)
where z∗^ is the upper α/2 critical value from the standard normal distribution. Or, in
standard error (s.e.) notation,
ŷ U ± ̂t ±
For 90%, 95%, and 99%, z∗^ = 1. 645 , 1. 96 , and 2.576, respectively.
- For smaller samples, approximate 100(1 − α)% confidence intervals for yU and t are
For yU : For t : (16)
y ± t∗
N − n
N
s^2
n
N y ± t∗
N (N − n)
s^2
n
y ± t∗s
N − n
N
/n N y ± t∗s
N (N − n)/n (17)
where t∗^ is the upper α/2 critical value from the t(n − 1) distribution.
- The quantity being added and subtracted from ̂yU = y or ̂t = N y in the confidence
interval is known as the margin of error.
Example: Use the small population data again. For n = 2, t∗^ ≈ 6 .314 for a nominal 90%
confidence level.
All Possible Samples and Confidence Intervals from Example Population Sample y-values ∑ yi ŷU = y ̂t = N y Ŝ^2 = s^2 Ŝ = s V̂ (ŷU ) V̂ (̂ t) 90% ci for t 1 0,2 2 1 5 2 1.4142 0.6 15 (-19.45, 29.45) 2 0,3 3 1.5 7.5 4.5 2.1213 1.35 33.75 (-29.18, 44.18) 3 0,4 4 2 10 8 2.8284 2.4 60 (-38.91, 58.91) 4 0,7 7 3.5 17.5 24.5 4.9497 7.35 183.75 (-68.09, 103.09) 5 2,3 5 2.5 12.5 .5 0.7071 0.15 3.75 (0.27, 24.73) 6 2,4 6 3 15 2 1.4142 0.6 15 (-9.45, 39.45) 7 2,7 9 4.5 22.5 12.5 3.5355 3.75 93.75 (-38.63, 83.63) 8 3,4 7 3.5 17.5 .5 0.7071 0.15 3.75 (5.27, 29.73) 9 3,7 10 5 25 8 2.8284 2.4 60 (-23.91, 73.91) 10 4,7 11 5.5 27.5 4.5 2.1213 1.35 33.75 (-9.18, 64.18)
2.4.1 One-Sided Confidence Intervals for yU and t
- Occasionally, a researcher may want a one-sided confidence interval. There are two types
of one-sided confidence intervals: upper and lower.
- Approximate upper and lower 100(1 − α)% confidence intervals for yU and t are:
For yU : For t :
y − t∗s
N − n
N
/n , ∞
N y − t∗s
N (N − n)/n , ∞
upper
−∞ , y + t∗s
N − n
N
/n
−∞ , N y + t∗s
N (N − n)/n
lower
where t∗^ is the upper α critical value from the t(n − 1) distribution.
- If the y-values cannot be negative, replace −∞ with 0 in the lower confidence interval
formulas. If the y-values cannot be positive, replace ∞ with 0 in the upper confidence
interval formulas.
SRS taken from Figure 1 (n = 10, t = 13354, yU = 33. 385 , y = 34. 1 , s^2 = 18.32)
SRS Example using Rathbun and Cressie (1994) Data
- To illustrate the application of simple random sampling to population total t estimation, consider the abundance data in Figure 2. The abundance counts correspond to the census data studied by Rathbun and Cressie (1994).
- This 200 × 200 m study region is located in an old-growth forest in Thomas County, Georgia. This data represents the number of longleaf pine trees located in each quadrat. The coordinates of the 584 tree locations are given in Cressie (1991).
- I have gridded the region into a 20 × 20 grid of 10 × 10 m quadrats. The total abundance t = 584 and the mean abundance per quadrat yU = 584/400 = 1.435. The population variance S^2 = 3.853.
- There is only a weak spatial correlation of tree counts within the study region.
- The pineleaf census data will be used to compare estimation properties of various sampling designs.
- Note the two relatively large boldfaced values ( 14 and 16 ).
Figure 2
Longleaf Pine Data (Rathbun and Cressie 1994)
1 1 1 1 1 2 1 0 0 0 4 5 0 1 0 1 2 1 0 1 3 2 1 0 1 0 0 0 1 2 2 2 0 2 2 2 0 2 0 1 7 4 1 1 1 1 0 0 0 2 2 0 4 3 2 4 2 1 2 2 0 1 2 0 0 0 0 0 4 6 5 1 5 0 0 0 2 1 2 0 1 1 0 2 3 2 0 0 2 1 3 1 4 1 1 1 2 2 1 1 2 0 0 0 4 3 3 0 1 16 5 0 1 3 8 0 0 1 3 3 0 0 1 14 3 3 1 2 0 8 0 2 0 3 9 0 4 2 1 0 0 0 5 1 8 7 6 6 6 1 0 4 0 0 1 2 2 0 1 2 0 0 2 2 3 2 2 3 1 1 1 3 0 0 2 2 0 3 4 0 0 0 0 0 1 0 3 1 1 1 2 0 2 0 2 0 2 1 1 0 1 8 7 7 8 0 5 0 1 0 1 2 0 0 2 4 2 2 2 4 0 9 1 0 0 1 1 1 0 0 0 1 2 4 0 2 1 3 3 1 0 0 0 1 0 2 4 3 1 2 2 0 0 1 1 2 2 0 2 4 0 1 0 0 1 2 0 2 3 5 2 0 0 2 1 1 2 0 1 3 1 0 0 1 1 0 0 0 2 2 2 1 1 1 0 0 2 0 0 0 0 2 0 2 2 0 1 1 0 2 0 0 1 0 0 1 1 1 5 3 0 0 0 3 2 1 0 0 0 0 0 2 1 0 1 1 1 3 1 2 1 0 0 1 0 3 0 1 0 0 2 1 2 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 0 3 0 2 0 1 1 0 2 0 0 0 0 0 0 0 1 2 0 1 3 0 0 1 0 1 2 4
REFERENCES (for Figure 2 data)
Cressie, Noel (1991) Statistics for Spatial Data. Wiley, New York.
Rathbun, S.L. and Cressie, N. (1994) A space-time survival point process for a longleaf pine forest in southern Georgia. Journal of the American Statistical Association, 89 , 1164-1174.
2.4.2 Using the R Survey Package for a SRS
R Code and Output for Figure 1 SRS Analysis
"count" "fpc" <- This is the contents of the data file fig1.txt
33 400 <- The first column are the recorded responses
33 400 <- The second column is the population size N
R Code
source("c:/courses/st446/rcode/confintt.r")
# t-based confidence intervals for SRS in Figure 1
library(survey)
srsdat <- read.table("c:/courses/st446/rcode/fig1.txt", header=T)
srsdat
srs_design <- svydesign(id=~1, fpc=~fpc, data=srsdat)
srs_design
esttotal <- svytotal(~count,srs_design)
print(esttotal,digits=15)
confint.t(esttotal,degf(srs_design),level=.95)
confint.t(esttotal,degf(srs_design),level=.95,tails=’lower’)
confint.t(esttotal,degf(srs_design),level=.95,tails=’upper’)
estmean <- svymean(~count,srs_design)
print(estmean,digits=15)
confint.t(estmean,degf(srs_design),level=.95)
confint.t(estmean,degf(srs_design),level=.95,tails=’lower’)
confint.t(estmean,degf(srs_design),level=.95,tails=’upper’)
R output for t-based confidence interval for SRS
> srsdat
count fpc
Independent Sampling design
total SE
count 13640 534.
mean( count ) = 13640.
SE( count ) = 534.
Two-Tailed CI for count where alpha = 0.05 with 9 df
mean( count ) = 13640.
SE( count ) = 534.
One-Tailed (Lower) CI for count where alpha = 0.05 with 9 df
5 % upper
12659.96724 infinity
mean( count ) = 13640.
SE( count ) = 534.
One-Tailed (upper) CI for count where alpha = 0.05 with 9 df
lower 95 %
-infinity 14620.
mean SE
count 34.1 1.
mean( count ) = 34.
SE( count ) = 1.
Two-Tailed CI for count where alpha = 0.05 with 9 df
mean( count ) = 34.
SE( count ) = 1.
One-Tailed (Lower) CI for count where alpha = 0.05 with 9 df
5 % upper
31.64992 infinity
mean( count ) = 34.
SE( count ) = 1.
One-Tailed (upper) CI for count where alpha = 0.05 with 9 df
lower 95 %
-infinity 36.
mean( count ) = 620.
SE( count ) = 289.
Two-Tailed CI for count where alpha = 0.05 with 19 df
mean( count ) = 620.
SE( count ) = 289.
One-Tailed (Lower) CI for count where alpha = 0.05 with 19 df
5 % upper
120.10415 infinity
mean( count ) = 620.
SE( count ) = 289.
One-Tailed (upper) CI for count where alpha = 0.05 with 19 df
lower 95 %
-infinity 1119.
mean SE
count 1.55 0.
mean( count ) = 1.
SE( count ) = 0.
Two-Tailed CI for count where alpha = 0.05 with 19 df
mean( count ) = 1.
SE( count ) = 0.
One-Tailed (Lower) CI for count where alpha = 0.05 with 19 df
5 % upper
0.30026 infinity
mean( count ) = 1.
SE( count ) = 0.
One-Tailed (upper) CI for count where alpha = 0.05 with 19 df
lower 95 %
-infinity 2.
2.4.3 Using SAS PROC Surveymeans for a SRS
DM ’LOG;CLEAR;OUT;CLEAR’; *** I recommend putting these two lines of code; OPTIONS NODATE NONUMBER; *** at the beginning of every SAS program ;
data SRS_Fig1; wgt= 400/10; * wgt = N/n ; input count @@; datalines; 33 33 30 34 39 27 32 36 35 42 ; proc surveymeans data=SRS_Fig1 total=400 mean clm sum clsum; var count; weight wgt; title1 ’Simple Random Sample -- Example 1’; title2 ’Estimating the population mean and total from the data in Figure 1’; run; ===========================================================================
Simple Random Sample -- Example 1 Estimating the population mean and total from the data in Figure 1
The SURVEYMEANS Procedure
Data Summary
Number of Observations 10 Sum of Weights 400
Statistics
Std Error Variable Mean of Mean 95% CL for Mean
count 34.100000 1.336569 31.0764709 37.
Variable Sum Std Dev 95% CL for Sum
count 13640 534.627596 12430.5884 14849.
2.5 Attribute Proportion Estimation
• Suppose we are interested in an attribute (characteristic) associated with the sampling
units. The population proportion p is the proportion of population units having that
attribute.
• Statistically, the goal is to estimate proportion p.
• Examples: the proportion of females (or males) in an animal population, the proportion of
consumers who own motorcycles, the proportion of married couples with at least 1 child...
• Statistically, we use an indicator function that assigns a yi value to unit i as follows:
yi =
1 if unit i possesses the attribute
0 otherwise
Then t =
∑^ N
i=
yi and yU =
N
∑^ N
i=
yi = p. The population proportion p can be
expressed as a population mean yU. Therefore, we will, under certain conditions, be able
to apply the SRS methods for estimating yU.
• By taking a SRS of size n, we can estimate p with the sample proportion p̂ of units that
possess that attribute: p̂ =
∑n
i=1 yi
n
= y. The sample proportion p̂ is unbiased for p.
• For a finite population of 0 and 1 values, the population variance
S^2 =
N − 1
∑^ N
i=
(yi − p)^2 =
• Therefore, the variance of p̂ is
V (p̂) =
N − n
N
S^2
n
N − n
N
N
N − 1
p(1 − p)
n
• Because S^2 is unknown, we estimate it with s^2 =
n
n − 1
p(1 − ̂p). Substitution provides
the unbiased estimator of V ( p̂):
V̂ (̂p) =
N − n
N
s^2
n
• The square root of V (p̂) in (18) is the standard deviation of the estimator p̂.
• The square root of V̂ (p̂) in (19) is the standard error of p̂.
• The effects of omitting the finite population correction (f.p.c.) from the formulas for large
and small samples apply here as they did earlier.
Figure 3: The Presence/Absence of Longleaf Pine
Rathbun/Cressie data (t = 249 N = 400 p =. 6225 ) 1 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 1 1 1 0 0 1 1 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 0 1 1 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 0 0 0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 0 1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 0 1 1 0 0 0 1 1 1 1 1 1 0 0 1 0 0 0 0 1 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1
A simple random sample of size n = 25
R Code and Output for Figure 3 Example
source("c:/courses/st446/rcode/confintt.r")
# t-based confidence intervals for SRS in Figure 3
library(survey)
srsdat <- read.table("c:/courses/st446/rcode/fig3.txt", header=T)
srsdat
srs_design <- svydesign(id=~1, fpc=~fpc, data=srsdat)
estmean <- svymean(~presence,srs_design)
print(estmean,digits=15)
confint.t(estmean,degf(srs_design),level=.90)
confint.t(estmean,degf(srs_design),level=.90,tails=’lower’)
confint.t(estmean,degf(srs_design),level=.90,tails=’upper’)
R output for t-based confidence interval for SRS
> srsdat
presence fpc
mean SE
presence 0.72 0.
mean( presence ) = 0.
SE( presence ) = 0.
Two-Tailed CI for presence where alpha = 0.1 with 24 df
mean( presence ) = 0.
SE( presence ) = 0.
One-Tailed (Lower) CI for presence where alpha = 0.1 with 24 df
10 % upper
0.60305 infinity
mean( presence ) = 0.
SE( presence ) = 0.
One-Tailed (upper) CI for presence where alpha = 0.1 with 24 df
lower 90 %
-infinity 0.
SAS Code and Output for Figure 3 Example
DM ’LOG;CLEAR;OUT;CLEAR’;
OPTIONS NODATE NONUMBER LS=72 PS=54;
DATA SRS_Fig3;
INPUT ind @@;
DATALINES;
DATA SRS_Fig3; set SRS_Fig3;
IF ind = 0 then pa = ’absent ’;
IF ind = 1 then pa = ’present’;
PROC SURVEYMEANS DATA=SRS_Fig3 TOTAL = 400 ALPHA = .10;
VAR pa;
TITLE ’Simple Random Sample -- Figure 3’;
TITLE2 ’Estimating population proportion p’;
RUN;
Simple Random Sample -- Figure 3
Estimating population proportion p
The SURVEYMEANS Procedure
Data Summary
Number of Observations 25
Class Level Information
Class
Variable Levels Values
pa 2 absent present
Statistics
Std Error
Variable Level N Mean of Mean 90% CL for Mean
pa absent 7 0.280000 0.088741 0.12817428 0.
present 18 0.720000 0.088741 0.56817428 0.