








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Main points of this file are Custom Tables, Conditions and assumptions, Statistics, Small sample validity of the test, Test statistics for multiple response sets, Use of case weights, Hypothesis
Typology: Study notes
1 / 14
This page cannot be seen from the preview
Don't miss anything!









1
This document describes the algorithms used in the Custom Tables procedure.
A note on weights and multiple response sets
Case weights are always based on Counts, not Responses, even when one of the variables is a multiple response variable.
The following notation is used for the computation of Pearson’s chi-square:
R Number of rows in the sub-table.
C Number of columns in the sub-table.
f (^) ij Sum of case weights in cell (i,j).
ri Marginal case weights total in i-th row.
c (^) j Marginal case weights total in j-th column.
W Marginal case weights total in the sub-table.
E (^) ij Expected cell counts.
2 χ (^) p
Pearson's Chi-Square statistic.
p (^) ij Population proportion for cell (i,j).
p (^) i. Marginal population proportion for i-th row.
p (^) .j Marginal population proportion for j-th column.
df Degrees of Freedom.
p p-value of the chi-square test.
2
α (^) Significance level supplied by the user.
Hypothesis:
H 0 :p (^) ij = pi.p.ji = 1 ,...,Rand j = 1 ,...,Cvs. not H 0
4
Multiple response set in rows and categorical variable in columns
R
i
C
j i ij
ij ij p
r E
f E
1 1
2 2
Under the null hypothesis, the statistic has an approximate Chi-square distribution with df = R ( C − 1 )degrees of freedom.
Multiple response sets in rows and columns
= = − −
R
i
C
j (^) i j ij
ij ij p
c
r E
f E
1 1
2 2
Under the null hypothesis, the statistic has an approximate Chi-square distribution with df = RC degrees of freedom.
P-value
p 1 F( ;df )
2 = − χp ,
where F( x;df)is the cumulative distribution function of Chi-square distribution
with df degrees of freedom.
The chi-square test is significant if the p <α.
Use of case weights:
The case weights (or frequency weights) are supposed to be integers representing number of replications of each case. In chi-square tests, we will only check if the
aggregated cell counts f (^) ij are integers. If not, they will be rounded to nearest
integer before computations.
Small sample validity of the test
Pearson's chi-square is a large sample test, it may not be valid when sample size is small. A rule of thumb is to check if there are more than 80% of cells have expected cell counts larger than 5 and expected cell counts are all larger than 1.
Test statistics for multiple response sets
The formulas above use a variation of the Pearson chi-square test statistics developed for a combination of categorical variable and a multiple response set as initially suggested by Agresti and Liu (1999). Formulas and properties of this test can be found in a comparative study by Bilder et al. (2000). An extension of this approach when both variables are multiple response sets is given in the paper by Thomas and Decady (2004). It contains a study of the test properties as well as additional references.
Agresti, A. and Liu, I.-M. (1999), “Modeling responses to a categorical variable allowing arbitrarily many category choices”, Biometrics, 55, 936-943.
Bilder, C.R., Loughin, T.M. and Nettleton, D. (2000), “Multiple marginal independence testing for pick any/c variables”, Communications in Statistics: Simulation, 29, 1285-1316.
Thomas, D.R. and Decady, Y.J. (2004), “Testing for association using multiple response survey data: approximate procedures based on Rao-Scott Approach”, International Journal of Testing, 4, 43-59.
The following notation is used for the computation of Column Proportions Tests:
R Number of rows in the sub-table.
C Number of columns in the sub-table.
A (^) i i-th category of the row variable.
off, cell statistics requested must include cell counts or simple column percents; an unweighted analysis will be performed.
sum of case weights in a category is less than 2, (i.e. c (^) j < 2 ). If less than two
proportions are left after discarding proportions, test will not be performed.
Table layout:
B 1 B 2 ... B (^) C
A 1 P 11 p 12 p (^) 1C
A 2 P 21 p 22 p (^) 2C
... (^) ... ... ... ...
AR p (^) R1 p (^) R2 ... p (^) RC
Hypothesis:
Without lost of generality, we will only look at the i-th row of the table. Let C* be the number of categories in the i-th row where the proportion is greater than zero and less than one, and where the sum of case weights in the corresponding column is at least 2. In the i-th row, C(C-1)/2 comparisons will be made among
p (^) i 1 ,pi 2 ,...,piC. The (j,k)th hypothesis will be
H (^0) jk:pij= pik vs. H (^1) jk:pij≠ pik.
Aggregated Statistics:
Column proportions tests are based on the aggregated proportions ( pˆ^ ij) and cell
counts for each column ( c (^) j). Column proportions are computed using the un-
8
rounded cell counts j
ij ij c
f pˆ = which are equal to the proportions actually
displayed in CTABLE.
Statistics for the (i,j)th comparisons:
Pooled proportion: j k
j ij k ik ijk c c
c p c p p ~ ~
z statistic with a categorical variable in the columns:
j k
ijk ijk
ij ik
c c
p p
p p z
− +
When multiple response set defines columns there may exist cases that belong to
both j-th and k-th columns. Let c ~ jk^ be the rounded sum of weights for such cases.
z statistic with a multiple response set in the columns:
j k
jk
j k
ijk ijk
ij ik
cc
c
c c
p p
p p z
p-value: p = 2 [ 1 −Φ(|z|)] ,
where Φ( z)is the CDF of standard normal distribution.
Alternatively, the statistics can be constructed as a chi-square statistic,
2 2 χ =z ,
10
k Number of categories with case weights greater than or equal to 2.
μ (^) i Population mean of the i-th category, i=1,...,k.
x (^) ij j-th observation in i-th group.
w (^) ij Case weight of the j-th observation in i-th group.
w (^) i Sum of case weights in category i, i=1,...,k.
w i
~ (^) Rounded sum of case weights in category i, i=1,...,k.
x (^) i Mean of category i, i=1,...,k.
s (^) i Standard devation of category i, i=1,...,k.
s (^) ij Pooled standard deviation from i-th and j-th group.
s (^) w Pooled standard deviation of all categories.
W (^) Total case weights. Sum of rounded w (^) i's.
p (^) B p-value adjusted by using Bonferroni method.
α Significance level supplied by the user.
All Pairwise Comparisons
Hypotheses:
H (^0) ij:μ (^) i= μ j, vs. H 1 ij :μ (^) i≠ μj , for all i > j.
Total number of hypotheses: 2
k (k 1 )
−
=
k
i 1
i
k I(w 2 )).
Aggregated statistics:
The statistics in pairwise comparisons are computed from aggregated category
means ( x (^) i), sample variances (
2 si ) and sample sizes ( w (^) i), i=1,...,k. Various
quantities used in the comparisons are shown below.
=
k
i 1
W round(wi )I(wi 2 )
Mean of i-th category: i
n
j 1
ij ij
i w
w x
x
i
T-statistic for comparing levels of a multiple response set
t (^) ij =
i j
ij
i j
ij
i j
ww
w
w w
s
x x
,
P-value p = 2 [ 1 − F (| tij |; w ~ i + w ~ j − w ~ ij − 2 )] ,
A comparison is significant if p <α(or p (^) B<α, if Bonferroni adjustment is
used).
Statisitics for (i,j)th comparisons with variance pooled from all categories
Assume w (^) i ≥ 2 and w (^) j ≥ 2.
Within groups variance pooled from all the categories:
2
2 1
W k
I w w s
s
i
k
i
i i
w −
=
T-statistic for levels of a categorical variable:
t (^) ij =
i j
w
i j
w w
s
x x
P-value p 2 [ 1 F(|t |;W k )]
= − ij −.
A comparison is significant if p <α(or p (^) B<α, if Bonferroni adjustment is
used).
This test is available for categories defined by categorical variable only.
14
Bonferroni adjustment
If the Bonferroni adjustment for multiple comparisons is requested, the p-value p will be adjusted by
min(
pk k p (^) B
Possible computational problems:
From the formulas, we can see that comparison can be made as long as
either
2 s ij or
2 s w is nonzero. If variances for both compared categories are zero, the
first test cannot be conducted. If variances for all categories with cell count greater
than or equal to two are zero,
2 s (^) w becomes zero and the second test conducted be
conducted either.
Use of case weights:
The case weights (or frequency weights) are supposed to be integers representing number of replications of each case. If sum of case weights in each group
( w (^) i,i=1,...,k) are not integers, they will be rounded to the nearest integers before
calculations. Consequently, the total weight W will become the sum of rounded
w (^) i's.