Plum - Mathematics and Statistics - Study Notes, Study notes of Mathematical Statistics

This document has following main points Plum, Data Aggregation, Data Assumptions, Model, Log-likelihood Function, Derivatives of the Log-likelihood Function, Expectation of the Second Derivative

Typology: Study notes

2011/2012

Uploaded on 10/31/2012

sangawar
sangawar 🇮🇳

4.5

(4)

118 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
PLUM
The purpose of the PLUM procedure is to model the dependence of an ordinal categorical
response variable on a set of discrete and/or continuous independent variables.
Since the choice and the number of response categories can be quite arbitrary, it is
essential to model the dependence such that the choice of the response categories does not
affect the conclusion of the inference. That is, the final conclusion should be the same if any
two or more adjacent categories of the old scale are combined. Such considerations lead to
modeling the dependence of the response on the independent variables by means of the
cumulative response probability.
Notations
Y
The ordinal response variable, which takes integer values from 1 to J, J2.
J
The number of categories of the ordinal response.
m
The number of subpopulations.
X
A mp
A
matrix with vector-element x
i
A, the observed values at the ith
subpopulation, determined by the independent variables specified in the
command.
X
m
p
matrix with vector-element x
i
, the observed values of the location
model’s independent variables at the ith subpopulation.
Z
m
q
matrix with vector-element z
i
, the observed values of the scale
model’s independent variables at the ith subpopulation.
fijs The frequency weight for the s-th observation which belongs to the cell
corresponding to
Y
j
at subpopulation i.
nij The sum of frequency weights of the observations that belong to the cell
corresponding to
Y
j
at subpopulation i.
rij The cumulative total up to and including
Y
j
at subpopulation i.
n
i
The marginal frequency of subpopulation i.
n
The sum of all frequency weights.
ij The cumulative response probability up to and including
Y
j
at
subpopulation i.
π
ij The cell response probability corresponding to
Y
j
at subpopulation i.
T
(
)
J

1
1
vector of threshold parameters in the location part of the model.
E
p
1
vector of location parameters in the location part of the model.
W
q
1
vector of scale parameters in the scale part of the model.
B (,,)TEW
TTTT
The {(J-1)+p+q}×1 vector of unknown parameters in the general model.
(,,)B TEW
TTTT
The {(J-1)+p+q}×1 vector of maximum likelihood estimates of the
parameters in the general model.
B (,)TE
TTT
The {(J-1)+p}×1 vector of maximum likelihood estimates of the parameters
in the location-only model.
ij The cumulative response probability estimate based on the maximum
likelihood estimate
B
in the general model.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Plum - Mathematics and Statistics - Study Notes and more Study notes Mathematical Statistics in PDF only on Docsity!

PLUM

The purpose of the PLUM procedure is to model the dependence of an ordinal categorical response variable on a set of discrete and/or continuous independent variables. Since the choice and the number of response categories can be quite arbitrary, it is essential to model the dependence such that the choice of the response categories does not affect the conclusion of the inference. That is, the final conclusion should be the same if any two or more adjacent categories of the old scale are combined. Such considerations lead to modeling the dependence of the response on the independent variables by means of the cumulative response probability.

Notations

Y (^) The ordinal response variable, which takes integer values from 1 to J , J ≥ 2. J The number of categories of the ordinal response. m (^) The number of subpopulations. X A^ m  p A matrix with vector-element xiA^ , the observed values at the i th subpopulation, determined by the independent variables specified in the command. X (^) m  p matrix with vector-element xi , the observed values of the location model’s independent variables at the i th subpopulation. Z (^) m  q matrix with vector-element zi , the observed values of the scale model’s independent variables at the i th subpopulation. f (^) ijs The frequency weight for the^ s -th observation which belongs to the cell corresponding to Y 2 j at subpopulation i. nij The sum of frequency weights of the observations that belong to the cell corresponding to Y 2 j at subpopulation i. rij The cumulative total up to and including^ Y^^2 j at subpopulation^ i. ni The marginal frequency of subpopulation^ i. n The sum of all frequency weights. γ (^) ij The cumulative response probability up to and including Y 2 j at subpopulation i. π (^) ij The cell response probability corresponding to^ Y^^2 j at subpopulation^ i. „ ( J  1 )  1 vector of threshold parameters in the location part of the model. i p  1 vector of location parameters in the location part of the model. ‡ (^) q  1 vector of scale parameters in the scale part of the model. B 2 ( „ T^ , i T^ , ‡T^ )T^ The {( J -1)+ p + q }^ ×^ 1 vector of unknown parameters in the general model. B^ ^ 2 ( „^ T^ , i^ T^ , ‡^ T^ )T^ The {( J -1)+ p + q }^ ×^ 1 vector of maximum likelihood estimates of the parameters in the general model.    B 2 ( „ T^ , iT^ )T^ The {(in the location-only model. J -1)+ p }^ ×^ 1 vector of maximum likelihood estimates of the parameters

γ^  (^) ij The cumulative response probability estimate based on the maximum likelihood estimate B  in the general model.

γ^  ij The cumulative response probability estimate based on the maximum likelihood estimate

B in the location-only model. π^  (^) ij The cell response probability estimate based on the maximum likelihood estimate B  in the general model. π^  ij The cell response probability estimate based on the maximum likelihood estimate

B in the location-only model. e^  Number of non-redundant parameters in the general model. If all parameters are non-redundant, e ˆ^ = ( J -1) + p + q. e^  Number of non-redundant parameters in the location-only model. If all parameters are non-redundant, e^ ^ = ( J -1) + p.

Data Aggregation

Observations with negative or missing frequency weights are discarded. Observations are aggregated by the definition of subpopulations. Subpopulations are defined by the cross- classifications of the set of independent variables specified in the command. Let ni be the marginal count of subpopulation i ,

n (^) i nij j

J (^2)  1

If there is no observation for the cell of Y 2 j at subpopulation i , it is assumed that nij 2 0 , provided that ni  0. A non-negative scalar δ [ , )0 1 may be added to any zero cell (i.e., cell

with nij 2 0 ) if its marginal count ni is nonzero. The value of δ is zero by default.

Data Assumptions

Let ( n (^) i 1 ,..., niJ )T^ be the J  1 vector of counts for the categories of Y at subpopulation. It is assumed that each ( ni (^) 1 ,..., niJ )T^ is independently multinomial distributed with probability vector ( π (^) i 1 ,..., π iJ )T^ of dimension J  1 and fixed total ni.

Model

Let γ (^) ij = Prob( Yj | x i ) be the cumulative response probability for Y , i.e.,

γ ij π il l

j (^2)  1

for j = 1, …, J -1. Notice that γ (^) iJ 2 1 , hence only the first J -1 γ ’s are needed in the model.

l rij (^) ij ri (^) j g ij j

J

i

m 2  (^) 

  ^ ϕ (^1 )^ (ϕ^ ) 1

1

1

in which rij is the cumulative total

rij nk k

j (^2)  1

the argument ϕ (^) ij is given by

ϕ

γ ij γ γ

ij i j ij

% '

&

( 0

) 

log 1

and the function g ( ϕ )is

g i j i j ij

( ϕ) ( (ϕ ))

γ γ γ

% '

&

( 0

)

 

log 1 exp log^1 1

Notice that a constant term c (^) i n (^) i ni niJ

m (^2)  1 log{ !/ ( 1 !! !)} which is independent of the unknown parameters has been excluded here. Thus, l is in fact the kernel of the true log- likelihood function. Further details of the log-likelihood function can be found at the end of this chapter.

Derivatives of the Log-likelihood Function

Details of derivatives can be found at the end of this chapter.

First Derivative

The first derivative of l with respect to B (^) k , k 2 1 ,...,( J  1 )  p  q , is

  

l B

l (^) U Q k

i ij

ij ijk j

J

i

m

1 ϕ

1

1

in which



li (^) r r ij

ij i j

ij ϕ i j

γ ( 1 ) (^) γ 1

U (^) ij i j ij i j ij

 

γ γ γ γ

1 ( 1 )

and

Qij Pijk ij P ij

i j k

ij i j

i j i j



 

γ η

γ γ

γ (^1 1) η

1 1

in which

P

B

k J

x J k J p z J p k J p q

ijk

ij k

jk i i k J i i k J p ij

7

8

u u uu

9

u u u u

 

  

η

δ

η

exp( )

exp( )

[ ( )]

[ {( ) }]

T

T

if

if if

z

z

1

1

δ (^) jk 2 1 if j = k , 0 otherwise, and P (^) iJk = 0. For i = 1, …, m , j = 1, …, J -1,

7

8

u uu

9

u u u



γ η

γ γ γ γ γ γ φ γ π γ π

ij ij

ij ij ij ij ij ij ij ij

( ) log( ) log( ) ( ( )) ( (. )) /

1

Logit link Complementary log - log link Negative Log - log link Probit link cos 2 Cauchit link

A

and γ (^) iJ / η iJ 20.

Second Derivative

The second derivative is

% '

&

( 0

)

  

2 2

1

1

1

l B B

l B U Q^

l U B Q^

l (^) U Q s k B

i s ij

ij ijk i ij

ij s

ijk i ij

ij

ijk j s

J

i

m ϕ ϕ ϕ

for s , k = 1, …, ( J – 1) + p + q. The first term of the equation is



(^2 ) 1

l B U Q^

r i U Q Q s ij

ij ijk

i j i j ϕ γ ij^ ijs^ ijk.

The second term is

E E

E

% '&^

( 0 )^

% '

&

( 0

)

% '

&

( 0

)



 





 



 

2 2

1

1

1 1 1 1

1

1

1

1

1

l B B

l B U Q r U Q Q

n U Q Q

s k

i s ij

ij ijk j

J

i

m

i j j i j^ ij^ ijs^ ijk

J

i

m

i ij ijs ijk j

J

i

m

ϕ

γ

Parameter Estimation

Maximum Likelihood Estimate

To obtain the maximum likelihood estimate of B , a Fisher Scoring iterative estimation method or Newton-Raphson iterative estimation method can be used. Let B ( ) t^ be the parameter vector at iteration t and l / B ( ) t be a vector of the first derivatives of l evaluated at B 2 B ( ) t^. Moreover, let A ( ) t^ be a {( J -1)+ p + q } × {( J -1)+ p + q } matrix such that

A B^ B

B B

( ) ( )

( )

t sk

s k

s k

l B B l B B

t

t

% '&^

( 0 )

7

8

u uu

9

u uu

2

2

Newton - Raphson approach

E Fisher Scoring approach

For a location-only model, the corresponding formulas use the first ( J -1)+ p elements of l / B ( ) t and the upper {( J -1)+ p } × {( J -1)+ p } submatrix of A ( ) t^. The parameter vector B at iteration t  1 is updated by B (^ t^ ^1 )^ where

A B A B

B

( ) ( ) ( ) ( ) ( )

t t t t t

 (^12)  ξ l

and ξ 3 0 is a stepping scalar such that l (^) R B ( t^^ ^1 )^ W R l B ( ) t W 0.

Stepping

Use step-halving method if l (^) R B ( t^^ ^1 )^ W R l B ( ) t W 10. Let V be the maximum number of steps in step-halving, the set of values of ξ is {1/2 v : v = 0, …, V -1}.

Starting Values of the Parameters

Location-Only Model

If a location-only model is specified, set B ( )^0 2 „ ( )^0 T^ , 0 T^

T R W where

θ (^) j

ik k

j

i

m

i i

m

n

n

( ) (^0 )

1

%

'

& & & & &&

(

0

) ) ) ) ))

 



link

for j = 1, …, J -1.

General Model

If a general model is specified, first ignore the scale part (i.e., by assuming that = 0 and treat the model as if it is a location-only model) and use B ( )^0 2 „ ( )^0 T^ , 0 T^

T R W as the starting

value to obtain the maximum likelihood estimate

 ^ 

B 2 „ T^ iT^

T R , W. After^

B is obtained, find

the maximum likelihood estimate B ^ 2 „^ T^ , i^ T^ , ‡T^

T R W of the general model by starting at   „ T^ iT^ T^

T R ,^ ,^0 W.

The above practice is essentially the same as taking B ( )^0 2 „(0) T^ , 0 T^ , 0 T^

T R W. The advantage is that the maximum likelihood estimate

B can be obtained in the process of finding B .

Ordinal Adjustments for the Threshold Parameters

If the monotonicity property θ 1  ! θ J  1 is not preserved at the end of any iteration, ad hoc adjustment will be taken before the next iteration starts. If θ ( ) jt^^3 θ( ) jt  1 for some j , then both θ ( ) jt^ and θ ( ) jt  1 are set toR θ ( ) jt^^  θ( ) jt ^1 W / 2 before the next iteration. This value is then compared with θ ( ) jt ^2 and so on.

Convergence Criteria

Given two convergence criteria ε (^) k 3 0 and ε (^) p 3 0 , the iteration is considered to be converged if one of the following criteria are satisfied:

  1. l ( B ( t^^ ^1 )^ )  l ( B ( ) t^ ) 1 ε (^) k.
  2. max (^ ) i i

B t^ ^1  B it^ 1 ε (^) p.

Location-Only Model versus Intercept-Only Model

The following statistic is available when a location-only model is specified. The Model Chi- square statistic is given by

 2 l ( B ( )^0 )  2 l ( B )

Under that null hypothesis that H 0 :i 2 0 , the Model Chi-square is asymptotically chi- squared distributed with e^ ^ – ( J – 1) degrees of freedoms.

General Model versus Location-Only Model

The following statistic is available when a general model is specified. The Model Chi-square statistic is given by

 2 l ( )  2 l ( ^ )

B B.

Under that null hypothesis that H 0 :‡ 2 0 , the Model Chi-square is asymptotically chi- squared distributed with e ˆ^ – e^ ^ degrees of freedoms.

Likelihood Ratio Test for Equal Slopes Assumption

For location-only model, a likelihood ratio test of parallel lines in the location is performed. If the regression lines are not parallel, the location can be specified as

η (^) ij 2 θ j  i (^) j T^ x i

for j = 1, …, J -1. That is, the location parameters i (^) j (or slopes) vary with the levels of the response. The parameter for the above “non-parallel” location-only model is B 2 ( „ T^ , i T j^ ,..., iT J^  1 )T which is of dimension {( J -1)+( J -1) p } × 1. The first derivative l / B of the log-likelihood is the same as in the “parallel” model, except that Pijk 2 η (^) ij / Bk is replaced by the following:

P

B

k J ijk x J sp k J sp p s J

ij k

jk i k J sp

7 8 9 ^ ^ 

η δ^1 1 1 1 1 2

[ {( ) }] (^ )^ (^ )^ ,^ ,...,(^ )^.

Similarly, the expected value of the second derivative is the same as in the parallel model, except that the Pijk is replaced by the above equation. To test the null hypothesis of parallelism H (^) 0 : i 1 2 ... 2 i J  1 , find the maximum likelihood estimate

B of the parallel location-only model and the maximum likelihood estimate

B of the non-parallel model. The Model Chi-square statistic is given by

 2 l ( )  2 l ( )

B B.

Under the null hypothesis, the Model Chi-square statistic is asymptotically chi-squared distributed with ( k -2) p degrees of freedoms.

Pseudo R Squares

Cox and Snell’s R Square

The Cox and Snell’s R^2 for a general model is

R L

L

n CS^2

0

2 2 1 

% '&^

( 0 )

( ^ )

B ( )

B

Replace B  by

B for a location-only model.

Nagelkerke’s R Square

The Nagelkerke’s R^2 is

R R

N (^) L n (^2) CS^2 (^2 1)  ( B ( ) (^0) ) 2 /

McFadden’s R Square

The McFadden’s R^2 for a general model is

R l M l 2 2 1  (^0)

% '&^

( 0 )

( ^ )

B

B

Replace B  by

B for a location-only model.

Predicted Cell Counts & Cumulative Totals

Predicted Cell Counts

The estimated cell response probability based on the maximum likelihood estimate for the general model is

π

γ γ γ γ

ij

i ij i j i J

j j J j J

7 8

u

9

u

 

1 1 1

Covariance and Correlation Matrices

The estimate of the covariance matrix of B  is

Cov

Newton - Raphson method

E Fisher Scoring method

( ^ )

B

B B

B B

B B

B B

% '&^

( 0 )

7

8

u uu

9

u uu

2

2

l

l

Let † be the {( J -1)+ p + q } × 1 vector of the square roots of the diagonal elements in Cov( B ^ ). The estimate of the correlation matrix of B  is

Cor( B ^ ) 2 Diag( †^ ^1 )Cov( B ^ )Diag( †^ ^1 ).

Replace B  by

B and † by † (a {( J  1 )  p } 1 vector) for a location-only model.

Parameter Statistics

An estimate of the standard deviation of B  (^) k is σ (^) k. The Wald statistic for B  (^) k is

Wald (^) k k k

2 B 

σ

Under the null hypothesis that H (^) 0 : B (^) k 2 0 , Wald (^) k is asymptotically chi-squared distributed with 1 degree of freedom. Based on the asymptotic normality of the parameter estimate, a 100(1- α )% Wald confidence interval for B  (^) k is

B^ ^ k  z 1  α/ 2 σ k

where z 1 (^) − α/ 2 is the upper (1- α /2)100thpercentile of the standard normal distribution. Replace B  (^) k by

B (^) k and σ (^) k by σ (^) k for a location-only model.

Linear Hypothesis Testing

For a general model, let L be a matrix of coefficients for the linear hypotheses

H 0 : LB 2 c

where c is a k  1 vector of constants. The Wald statistic for H 0 is

Wald( L c , ) = ( LB ^  c ) T^ { L Cov( B L ^ ) T } ^1 (^ LB ^  c ).

Under the null hypothesis, Wald( L , c ) is asymptotically chi-squared distributed with l degrees of freedom, where l is the rank of L. Replace B  by

B for a location-only model.

References

Cox, D.R.(1972). Regression Models and Life Tables (with discussion). Journal of Royal Statistical Society. B 46, 1-30.

Goodman, L.A. (1979). Simple Models for the Analysis of Association in Cross- Classifications having Ordinal Categories. Journal of American Statistical Association. 74, 537-52.

Goodman, L.A. (1981). Association Models and Canonical Correlation in the Analysis of Cross-Classifications having Ordered Categories. Journal of American Statistical Association. 76, 320-34.

Greenland, S. (1994). Alternative Models for Ordinal Logistic Regression. Statistics in Medicine 13, 1665-77.

Hosmer. D.W.J. and Lemeshow, S. (1981). Applied Logistic Regression Models. Biometrics , 34, 318-327.

Magidson, J. (1995). Introducing a New Graphical Model for the Analysis of an Ordinal Categorical Response – Part I. Journal of Targeting, Measurement and Analysis for Marketing (UK), Vol. IV, 2, 133-48.

McCullagh, P. (1980). Regression Models for Ordinal Data. Journal of Royal Statistical Society. B, 42, No. 2, 109-142.

Pregibon, D. (1981). Logistic Regression Diagnostics. Annals of Statistics , 9, 705-24.

Williams, D.A. (1982). Extra-Binomial Variation in Logistic Linear Models. Applied Statistics , 31, 144-48.