Csglm - Mathematics and Statistics - Study Notes, Study notes of Mathematical Statistics

Main points of this file are CSGLM, Model Specification, Estimation method, Predicted values and residual, Estimation algorithm, Variance estimates, Standard Errors, Degrees of freedom

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CSGLM
Introduction
CSGLM is a procedure for regression analysis as well as analysis of variance and covariance
based on complex samples.
Complex sample data must contain both the values of the variables to be analyzed and the
information on the current sampling design. Sampling design includes the sampling method,
strata and clustering information, inclusion probabilities and the overall sampling weights.
Sampling design specification for CSGLM may include up to three stages of sampling. Any
of the following general sampling methods may be assumed in the first stage: random
sampling with replacement, random sampling without replacement and equal probabilities
and random sampling without replacement and unequal probabilities. The first two sampling
methods can also be specified for the second and the third sampling stage.
Notations
n
Total number of elements in the sample.
p
Number of regression parameters in the model.
Y
Dependent variable vector containing values niyi,,1, =.
X
n
x
p
design matrix. The rows correspond to the observations and the columns to
the model parameters. The ith row is ni
i,,1, =
x.
W Diagonal matrix with sampling weights niwi,,1, =on the diagonal.
B
Vector of
p
unknown population parameters.
N Total number of elements in the population.
Weights
Overall weights specified for each ultimate element are processed as given. See “Complex
Samples: Covariance Matrix of Total” (cs_covariance.pdf) for more information on weights
and variance estimation methods.
pf3
pf4
pf5
pf8

Partial preview of the text

Download Csglm - Mathematics and Statistics - Study Notes and more Study notes Mathematical Statistics in PDF only on Docsity!

1

Introduction

CSGLM is a procedure for regression analysis as well as analysis of variance and covariance

based on complex samples.

Complex sample data must contain both the values of the variables to be analyzed and the

information on the current sampling design. Sampling design includes the sampling method,

strata and clustering information, inclusion probabilities and the overall sampling weights.

Sampling design specification for CSGLM may include up to three stages of sampling. Any

of the following general sampling methods may be assumed in the first stage: random

sampling with replacement, random sampling without replacement and equal probabilities

and random sampling without replacement and unequal probabilities. The first two sampling

methods can also be specified for the second and the third sampling stage.

Notations

n (^) Total number of elements in the sample.

p Number of regression parameters in the model.

Y (^) Dependent variable vector containing values y i n i

X

n x p design matrix. The rows correspond to the observations and the columns to

the model parameters. The i

th row is i n i

x ′ , = 1 ,!,.

W (^) Diagonal matrix with sampling weights w i n i

, = 1 ,!, on the diagonal.

B Vector of^ p^ unknown population parameters.

N

Total number of elements in the population.

Weights

Overall weights specified for each ultimate element are processed as given. See “Complex

Samples: Covariance Matrix of Total” (cs_covariance.pdf) for more information on weights

and variance estimation methods.

Model Specification

Let the linear model be specified by the equation Y = X + where Y is a vector of

observed dependent variable values, X is the linear model design matrix, is a vector of

model parameters and E is a vector of random errors with zero mean. Each column of the

design matrix corresponds to a parameter in the model equation. Each parameter corresponds

to one of the intercept, factor main effects, factor interaction effects, factor nested effects,

covariate effects and factors by covariates interaction effects. For every factor effect level

occurring in data there is a separate parameter. This results in an over-parametrized model.

Estimation method

Assuming that the entire finite population has been observed, we can obtain the least square

parameter estimates for the linear model by solving the following normal equations

N N N N

X ′ X = X ′ Y

where N

X and N

Y denote design matrix and dependent variable for all elements in the

given population. A solution vector for this system, estimating the model parameters , is

denoted by B. In our analyses we take the established design-based approach concerned

with estimating the finite population parameters B developed by Kish and Frankel (1974),

Fuller (1975), Shah, Holt and Folsom (1977) and others. See Särndal et al. (1992) for an

overview.

Estimates for the population matrices N N

X ′^ X and N N

X ′^ Y are given by XWX^ and

XWY^ respectively. We solve the following set of weighted normal equations

X ′ WXB = X ′ WY

where W is a diagonal matrix with sampling weights w i n i

, = 1! on the diagonal. A

solution for B is then given by the equation

B = (X ′ WX) X ′ WY

− ˆ

where

(XWX) is a generalized g2 inverse of XWX.

Predicted values and residuals

Predicted values for each observation are given by x B

i i

y = ′ , where i

x ′^ is the i

th

row of the

design matrix X. Vector of residual r is defined with r y y i n i i i

= −ˆ^ , = 1 ,!.

The residual sum of squares rWr^ is computed directly by the following:

Let V( d ) T

be its sample design-based covariance matrix computed by the methods

described in “Complex Samples: Covariance Matrix of Total” (cs_covariance.pdf). Then the

covariance matrix of B

is estimated by

− −

V( B) = (XWX) V(d )(XWX) T

.

Note: If any diagonal element of V( d ) T

happens to be non-positive due to the use of Yates-

Grundy-Sen estimator, all elements in the corresponding row and column are set to zero.

Subpopulation estimates

When analyses are requested for a given subpopulation S , we redefine ( ′^ , )′ i i

x y as follows:

otherwise

ifthe elementisin

S

th

i i i

i i

y

y

x

x

When computing point estimates, this substitution is equivalent to including only the

subpopulation elements in the calculations. This is in contrast to computing the variance

estimates where all elements in the sample need to be included.

Standard Errors

Let i

B

denote a non-redundant parameter estimate. Its standard error is the square root of its

estimated variance:

i i

SE B = V B.

Standard error is undefined for redundant parameters.

Degrees of freedom

Number of the degrees of freedom ν used for computing confidence intervals and test

statistics below is calculated as the difference between the number of primary sampling units

and the number of strata in the first stage of sampling. We shall also refer to this quantity as

the sample design degrees of freedom. Alternatively, ν may be specified by the user.

Confidence Intervals

A level 1 − α confidence interval is constructed for a given 0 ≤α ≤ 1 for each non-

redundant model parameter i

B

. Confidence bounds are given by

ν

B ± SEB ti i

where )

i

SE B is the estimated standard error of i

B

, and ( 1 α/ 2 )

ν

t − is the

100 ( 1 − α 2 )percentile of t distribution with ν degrees of freedom.

t Tests

Testing hypothesis 0

0

i i

H B for each non-redundant model parameter i

B

is performed

using the t test statistic:

i

i

i

SE B

B

t B =.

The p -value for the two-sided test is given by the probability )|)

i

P T > t B , where T is

a random variable from the t distribution with ν degrees of freedom.

Design Effects

Design effect )

i

Deff B for non-redundant parameter estimate i

B

is given by

V (B )

V(B)

Deff B

srs i

i

i

ˆ ˆ

Design effect is undefined for redundant parameters.

i

V B is the estimate of variance of i

B

under the appropriate sampling design, while

srs i

V B is the estimate of variance of i

B

under the simple random sampling assumption.

The latter is computed as the i

th

diagonal element of the following matrix:

srs i srs T ii

V B ]

) [

− −

= (XWX) V (d )(XWX)

where

=

n

i

srs T i i i

w

n

N

N

n

1

V (d ) dd

with i

d as specified earlier.

Each row i

l ′^ of matrix L is also tested separately. Estimate for the i

th

row is given by

B

i

l ′^ and its standard error by i i

l V( B) l

See “Complex Samples: Model Testing” (cs_modeltesting.pdf) for additional tests and p-

value adjustments.

Custom tests

Custom hypothesis tests are conducted only when L is such that LB is estimable. This

condition is verified using the following equality:

L = L(X ′ WX)(X ′ WX)

.

Default tests of model effects

For each effect specified in the model, Type III test L matrix is constructed such that LB is

estimable. It involves parameters only for the given effect and the containing effects and it

does not depend on the order of effects specified in the model. If such a matrix cannot be

constructed, the effect is not testable. Matrix K is always set to 0 when computing the test

statistics for model effects.

Hypothesis for the corrected model is that all the parameters except for the intercept are zero.

Estimated marginal means

Estimated marginal means (EMMEANS) are based on the estimated cell means. For a given

fixed set of factors, or their interactions, we estimate marginal means as the mean value

averaged over all cells generated by the rest of the factors in the model. Covariates may be

fixed at any specified value. If not specified, the value for each covariate is set to its overall

mean estimate.

When missing cells are present in the data, EMMEANS may not be estimable. In such

circumstance, we provide a modified estimate proposed by Searle, Speed and Milliken

(1980) that ignores the non-estimable cells.

Each marginal estimate is finally constructed in the form B

l ′^ such that lB^ is estimable.

Comparing EMMEANS

For a given factor in the model, a vector of EMMEANS is created for all levels of the factor.

This vector can be expressed in the form LB

ˆ (^) = where each row of L matrix is generated

as described above. Variance is then computed by the following formula:

V( ) = LV(B) L ′

.

A set of contrasts for the factor is created according to the selected contrast type. Let this set

of contrasts define the matrix C used for testing the following hypothesis : C = 0 0

H.

The Wald

2

Χ statistic is used for testing given set of contrasts for the factor as follows:

(C )(CV( ˆ )C) (C ˆ )

2 −

Χ = ′ ′

The asymptotic distribution of the

2

Χ test statistic is chi-square with I

r degrees of freedom,

where ˆ )

r = rank ( CV ( )CI

.

Each row i

c ′^ of matrix C is also tested separately. The estimate for the i

th row is given by

i

c ′ and its standard error by i i

c V( ˆ ) c

See “Complex Samples: Model Testing” (cs_modeltesting.pdf) for additional tests and p-

value adjustments. Substitute the following formula for the simple random sampling

covariance: V ( ) = LV (B)L

srs srs

.

References

Binder, D. A. (1983), “On the variances of asymptotically normal estimators from complex

surveys”, International Statistical Review, 51, 279-292.

Fuller, W. A. (1975), “Regression analysis for sample survey”, Sankhya, Series C 37, 117-

Kish, L. (1965), Survey Sampling, New York: John Wiley & Sons.

Kish, L. (1995), “Methods for Design Effects”, Journal of Official Statistics, volume 11,

pages 119 - 127.

Kish, L. and Frankel, M. R. (1974), “Inference from complex samples”, Journal of the Royal

Statistical Society B, 36, 1-37.

Särndal, C. E., Swenson, B., and Wretman, J. H. (1992), Model Assisted Survey Sampling,

New York: Springer-Verlag.

Searle, S. R., Speed, F. M., and Milliken, G. A. (1980), “Population marginal means in the

linear model: an alternative to least square means”, The American Statistician, volume

34, pages 216 - 221.

Shah, B. V., Holt, M. M., and Folsom, R. E. (1977), “Inference about regression models

from sample survey data”, Bulletin of the International Statistical Institute XLVII, 3, 43-