




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Main points of this file are CSGLM, Model Specification, Estimation method, Predicted values and residual, Estimation algorithm, Variance estimates, Standard Errors, Degrees of freedom
Typology: Study notes
1 / 8
This page cannot be seen from the preview
Don't miss anything!





1
CSGLM is a procedure for regression analysis as well as analysis of variance and covariance
based on complex samples.
Complex sample data must contain both the values of the variables to be analyzed and the
information on the current sampling design. Sampling design includes the sampling method,
strata and clustering information, inclusion probabilities and the overall sampling weights.
Sampling design specification for CSGLM may include up to three stages of sampling. Any
of the following general sampling methods may be assumed in the first stage: random
sampling with replacement, random sampling without replacement and equal probabilities
and random sampling without replacement and unequal probabilities. The first two sampling
methods can also be specified for the second and the third sampling stage.
n (^) Total number of elements in the sample.
p Number of regression parameters in the model.
Y (^) Dependent variable vector containing values y i n i
n x p design matrix. The rows correspond to the observations and the columns to
the model parameters. The i
th row is i n i
x ′ , = 1 ,!,.
W (^) Diagonal matrix with sampling weights w i n i
, = 1 ,!, on the diagonal.
B Vector of^ p^ unknown population parameters.
Total number of elements in the population.
Weights
Overall weights specified for each ultimate element are processed as given. See “Complex
Samples: Covariance Matrix of Total” (cs_covariance.pdf) for more information on weights
and variance estimation methods.
Let the linear model be specified by the equation Y = X + where Y is a vector of
observed dependent variable values, X is the linear model design matrix, is a vector of
model parameters and E is a vector of random errors with zero mean. Each column of the
design matrix corresponds to a parameter in the model equation. Each parameter corresponds
to one of the intercept, factor main effects, factor interaction effects, factor nested effects,
covariate effects and factors by covariates interaction effects. For every factor effect level
occurring in data there is a separate parameter. This results in an over-parametrized model.
Assuming that the entire finite population has been observed, we can obtain the least square
parameter estimates for the linear model by solving the following normal equations
N N N N
where N
X and N
Y denote design matrix and dependent variable for all elements in the
given population. A solution vector for this system, estimating the model parameters , is
denoted by B. In our analyses we take the established design-based approach concerned
with estimating the finite population parameters B developed by Kish and Frankel (1974),
Fuller (1975), Shah, Holt and Folsom (1977) and others. See Särndal et al. (1992) for an
overview.
Estimates for the population matrices N N
X ′^ X and N N
X ′^ Y are given by X ′ WX^ and
X ′ WY^ respectively. We solve the following set of weighted normal equations
where W is a diagonal matrix with sampling weights w i n i
, = 1! on the diagonal. A
solution for B is then given by the equation
− ˆ
where
−
(X ′ WX) is a generalized g2 inverse of X ′ WX.
Predicted values and residuals
Predicted values for each observation are given by x B
i i
y = ′ , where i
x ′^ is the i
th
row of the
design matrix X. Vector of residual r is defined with r y y i n i i i
The residual sum of squares r ′ Wr^ is computed directly by the following:
Let V( d ) T
be its sample design-based covariance matrix computed by the methods
described in “Complex Samples: Covariance Matrix of Total” (cs_covariance.pdf). Then the
covariance matrix of B
is estimated by
− −
V( B) = (X ′ WX) V(d )(X ′ WX) T
.
Note: If any diagonal element of V( d ) T
happens to be non-positive due to the use of Yates-
Grundy-Sen estimator, all elements in the corresponding row and column are set to zero.
Subpopulation estimates
When analyses are requested for a given subpopulation S , we redefine ( ′^ , )′ i i
x y as follows:
otherwise
ifthe elementisin
S
th
i i i
i i
y
y
x
x
When computing point estimates, this substitution is equivalent to including only the
subpopulation elements in the calculations. This is in contrast to computing the variance
estimates where all elements in the sample need to be included.
Let i
denote a non-redundant parameter estimate. Its standard error is the square root of its
estimated variance:
i i
Standard error is undefined for redundant parameters.
statistics below is calculated as the difference between the number of primary sampling units
and the number of strata in the first stage of sampling. We shall also refer to this quantity as
A level 1 − α confidence interval is constructed for a given 0 ≤α ≤ 1 for each non-
redundant model parameter i
. Confidence bounds are given by
ν
B ± SEB t − i i
where )
i
SE B is the estimated standard error of i
ν
t − is the
Testing hypothesis 0
0
i i
H B for each non-redundant model parameter i
is performed
using the t test statistic:
i
i
i
t B =.
The p -value for the two-sided test is given by the probability )|)
i
P T > t B , where T is
Design effect )
i
Deff B for non-redundant parameter estimate i
is given by
Deff B
srs i
i
i
ˆ ˆ
Design effect is undefined for redundant parameters.
i
V B is the estimate of variance of i
under the appropriate sampling design, while
srs i
V B is the estimate of variance of i
under the simple random sampling assumption.
The latter is computed as the i
th
diagonal element of the following matrix:
srs i srs T ii
− −
= (X ′ WX) V (d )(X ′ WX)
where
=
n
i
srs T i i i
w
n
n
1
V (d ) dd
with i
d as specified earlier.
Each row i
l ′^ of matrix L is also tested separately. Estimate for the i
th
row is given by
i
l ′^ and its standard error by i i
l V( B) l
See “Complex Samples: Model Testing” (cs_modeltesting.pdf) for additional tests and p-
value adjustments.
Custom tests
Custom hypothesis tests are conducted only when L is such that LB is estimable. This
condition is verified using the following equality:
−
.
Default tests of model effects
For each effect specified in the model, Type III test L matrix is constructed such that LB is
estimable. It involves parameters only for the given effect and the containing effects and it
does not depend on the order of effects specified in the model. If such a matrix cannot be
constructed, the effect is not testable. Matrix K is always set to 0 when computing the test
statistics for model effects.
Hypothesis for the corrected model is that all the parameters except for the intercept are zero.
Estimated marginal means (EMMEANS) are based on the estimated cell means. For a given
fixed set of factors, or their interactions, we estimate marginal means as the mean value
averaged over all cells generated by the rest of the factors in the model. Covariates may be
fixed at any specified value. If not specified, the value for each covariate is set to its overall
mean estimate.
When missing cells are present in the data, EMMEANS may not be estimable. In such
circumstance, we provide a modified estimate proposed by Searle, Speed and Milliken
(1980) that ignores the non-estimable cells.
Each marginal estimate is finally constructed in the form B
l ′^ such that l ′ B^ is estimable.
Comparing EMMEANS
For a given factor in the model, a vector of EMMEANS is created for all levels of the factor.
This vector can be expressed in the form LB
ˆ (^) = where each row of L matrix is generated
as described above. Variance is then computed by the following formula:
.
A set of contrasts for the factor is created according to the selected contrast type. Let this set
of contrasts define the matrix C used for testing the following hypothesis : C = 0 0
The Wald
2
Χ statistic is used for testing given set of contrasts for the factor as follows:
2 −
Χ = ′ ′
The asymptotic distribution of the
2
Χ test statistic is chi-square with I
r degrees of freedom,
where ˆ )
r = rank ( CV ( )C ′ I
.
Each row i
c ′^ of matrix C is also tested separately. The estimate for the i
th row is given by
i
c ′ and its standard error by i i
c V( ˆ ) c
See “Complex Samples: Model Testing” (cs_modeltesting.pdf) for additional tests and p-
value adjustments. Substitute the following formula for the simple random sampling
covariance: V ( ) = LV (B)L ′
srs srs
.
Binder, D. A. (1983), “On the variances of asymptotically normal estimators from complex
surveys”, International Statistical Review, 51, 279-292.
Fuller, W. A. (1975), “Regression analysis for sample survey”, Sankhya, Series C 37, 117-
Kish, L. (1965), Survey Sampling, New York: John Wiley & Sons.
Kish, L. (1995), “Methods for Design Effects”, Journal of Official Statistics, volume 11,
pages 119 - 127.
Kish, L. and Frankel, M. R. (1974), “Inference from complex samples”, Journal of the Royal
Statistical Society B, 36, 1-37.
Särndal, C. E., Swenson, B., and Wretman, J. H. (1992), Model Assisted Survey Sampling,
New York: Springer-Verlag.
Searle, S. R., Speed, F. M., and Milliken, G. A. (1980), “Population marginal means in the
linear model: an alternative to least square means”, The American Statistician, volume
34, pages 216 - 221.
Shah, B. V., Holt, M. M., and Folsom, R. E. (1977), “Inference about regression models
from sample survey data”, Bulletin of the International Statistical Institute XLVII, 3, 43-