Validity, Types of Validity - Basic Statistics for Behavioral Sciences - Lecture Notes, Study notes of Statistics for Psychologists

Validity, Types of Validity, Content Validity, Criterion Related Validity, Concurrent Validity, Predictive Validity, Construct Validity, Attenuation Theory, Dichotomous Prediction are learning points of this lecture.

Typology: Study notes

2011/2012

Uploaded on 11/21/2012

ashakiran
ashakiran 🇮🇳

4.5

(27)

261 documents

1 / 34

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
13
Ch. 5. Validity
I. Definition: The extend to which a test measures what it is supposed to measure.
II. Types of validity
A. Content validity: assessing the content of a test using non-mathematical methods
to check if the behaviors sampled by the test are a representative sample of the
trait being measured.
B. Criterion-related validity: correlation between test scores (X) and criterion scores
(Y).
1. Concurrent validity: correlation between test scores (X) and criterion
scores (Y) currently available.
2. Predictive validity: correlation between test scores (X) and future criterion
scores (Y).
B. Construct validity: the validity of a test for a theoretical concept or trait the test is
designed to measure.
III. Topics concerning the criterion-related validity
A. Attenuation theory
1. The validity of a test is attenuated due to imperfect reliability of the tests.
2. We can obtain a correction for attenuation for the case of perfect
reliabilities in both tests, Spearman (1904).
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22

Partial preview of the text

Download Validity, Types of Validity - Basic Statistics for Behavioral Sciences - Lecture Notes and more Study notes Statistics for Psychologists in PDF only on Docsity!

Ch. 5. Validity

I. Definition: The extend to which a test measures what it is supposed to measure.

II. Types of validity A. Content validity: assessing the content of a test using non-mathematical methods to check if the behaviors sampled by the test are a representative sample of the trait being measured.

B. Criterion-related validity: correlation between test scores (X) and criterion scores (Y).

  1. Concurrent validity: correlation between test scores (X) and criterion scores (Y) currently available.
  2. Predictive validity: correlation between test scores (X) and future criterion scores (Y). B. Construct validity: the validity of a test for a theoretical concept or trait the test is designed to measure.

III. Topics concerning the criterion-related validity A. Attenuation theory

  1. The validity of a test is attenuated due to imperfect reliability of the tests.
  2. We can obtain a correction for attenuation for the case of perfect reliabilities in both tests, Spearman (1904).

ρXY ρTxTy = ─────── XX ' YY ' where, ρTxTy: validity of test X with criterion Y assuming no measurement error in both tests, ρXY: observed validity test X for criterion Y, ρXX': reliability of test X, and ρYY': reliability of test Y.

  1. It is possible to correct for attenuation in only one test. ρXY ρXTy = ────── YY ' for the criterion scores (Y) with no measurement error. or, ρXY ρTxY = ────── XX ' for the test scores (X) with no measurement error.

B. Confidence interval for a criterion (Y)

X X Y s

s Y r i X

Y i XY

If the regression assumptions are met, CI = Y ˆ ± zα/2sY.X where sY.X = sY 1 r^2 XY

C. Dichotomous prediction

Total correct predictions

  1. Hit rate = ──────────────── Total sample size
  2. Base rate a) Maximum chance criterion (PM) = the biggest prior probability. b) Chance criterion (PC) = P 21 + P 22 +.. + P (^2) K where, P (^2) K = prior probability. c) When P = .5 and K = 2, PM = PC, otherwise PM > PC.

of persons answering an item incorrectly

  1. E(distractor) = -------------------------------------------------------

    of distractors

  2. an extremely popular distractor is likely to lower the reliability and alidity of the test.

B. Item difficulty

of persons answering item i correctly

  1. pi = ----------------------------------------------------- total # of persons taking the item

  2. an item with the p-value of 0 or 1 is useless.

  3. If pi = .5, the variance of item score, piqi is maximized, but if inter-item correlation is 1, the test will classify examinees only into two groups.

  4. Items with range of .3 - .7 and average pi = .5 will be ideal. C. Item discrimination index

  5. di = Ui/niU - Li/niL where, Ui: # of people in the upper group who have the item i correct, niU: # of people in the upper group, Li: # of people in the lower group who have the item i correct, and niL: # of people in the lower group.

  6. In many cases, the upper group and the lower group are equal in sample size, thus, di = (Ui - Li)/ni

  7. di is the difference between the proportion of high-scoring examinees who get the item correct and the proportion of low-scoring examinees who get the item correct.

  8. niU and niL range between 10% and 33%. If the test scores are normally distributed, 27% is optimum.

  9. the range of di is -1 to +1. If di is negative, the item should be discarded. D. Item-total correlation (point-biserial or biserial)

i

i X

i iX (^) p

p s

X X

r 1 where Xi : the mean of the test score for those who have item i correctly, X and sX are the mean and SD of all examinees, and pi is item difficulty index.

  1. If the item score and the test score are highly correlated, the coefficient becomes high.
  2. The easier the item and the larger the difference between Xi and X , the higher the correlation. E. Item reliability and item validity
  3. Assume a case where we want to select k-items from N-items (N k) to build a test.
  1. si: item score standard deviation, pi (^1 pi ).
    1. Item reliability index: siriX* (X*: total test score from N items).
    2. Validity index: siriY (Y: criterion score).
    3. Using the previous information we can compute some statistics for the newly selected test with k items.

1

k

i

X pi

2

1

1

2

' 1 1 k

i

iiX

k

i

i XX s r

s

k

k r

k

i

iiX

k

i

iiY XY s r

sr r

1

F. Item-Characteristic Curves

  1. A graphical display of the relationship between the probability of passing an item and the examinee’s position on the underlying trait (test score).
  2. A positive slope and moderate difficulty level is good.
  3. Related to IRT. G. Factor analysis
  4. Items measuring the same trait should be highly correlated with each other and have high factor loadings to the same factor.
  5. Type of correlation and some demographic variables may influence the factor analysis results.

1

iX

k

i

sX sir

Ch. 2: Concepts, Assumptions, and Models

I. Concepts and Features A. An examinee’s performance on a test is a monotonically increasing function of a set of latent traits or abilities. B. The item characteristic curve (ICC) describes the function between an examinee’s performance on a test item and the latent trait. C. The IRT model was adopted from psychophysics (threshold) and biology (lethal dose).  OHP. D. The IRT models are falsifiable models. E. The IRT models offer the possibility of computing invariant item and person parameters.

II. Assumptions A. Unidimensionality of the latent space

  1. Only one latent trait is measured by items in one test.
  2. In reality, if items in one test measure one dominant latent trait, the unidimensionality assumption is met.
  3. Multi-dimensional IRT models can be developed, Mulaik (1972) and Samejima (1974).
  4. Factor analysis can be used to test the unidimensionality assumption. B. Local Independence (Conditional Independence)
  5. For a given ability level, the probability of getting one item correct is independent of the probability of getting other items correct. P(ui = 1 uj = 1| ) = P(ui = 1| )P(uj = 1| ), i j
  6. If we partial out the effect of ability ( ), then, the probability of getting the items correct is independent of each other.
  7. The ability specified in the model is the only factor influencing an examinee’s performance.
  8. If we meet the unidimensionality assumption, then the complete latent space has only one trait.
  9. Even though latent space has multiple traits (multi-dimensionality), as long as all traits are specified in the model, the local independence assumption will hold.
  10. If one item is related to another item (cue or hint), the local independence assumption will not hold.

III. IRT models A. Two mathematical Models

  1. Normal Ogive Models: utilize the cumulative normal curve.
  2. Logistic Models: mathematically simpler than the Normal Ogive Models, but the practical difference from the N.O.M. is less than .01 (OHP for ICC).

B. Normal Ogive Models: by Lord (1952)

  1. One-parameter Normal Ogive Model

a) Pi( ) = e dz

b i z 2

2

2

where Pi( ): The probability of getting item i correct for given , : Latent trait (ability or proficiency), bi: Item difficulty parameter, and

z:

X

b) The probability of getting an item(i) correct is a function of ability ( ) and item difficulty parameter (bi) only.

  1. Two-parameter Normal Ogive Model

a) Pi( ) = e dz

ai b i ( ) z 2

2

2

where ai: item discrimnation parameter. b) Pi( ) is a function of ai and bi in addition to.

  1. Three-parameter Normal Ogive Model

a) Pi( ) = c c e dz

ai bi z i i

( ) 2

2

2

where ci: Pseudo-chance parameter. b) Pi( ) is a function of ai, bi, ci, and.

  1. Since the Normal Ogive Models contain double integrals for the ability estimation, the models are for theoretical interests only.

C. Logistic Models (Birnbaum, 1968)

  1. One-parameter Logistic (1-pl) Model (Rasch Model)

a) Pi( ) = ( )

( )

1 i

i D b

D b

e

e = ( ) 1

(^) eD b i = [1 + ^ e D (^ bi )]-1.

*Original Rasch Model: Pi( ) =

b

(later).

b) D=1, a scaling factor for the N.O.M. c) Pi( ) is a function of bi.

  1. Two-parameter Logistic (2-pl) Model

a) Pi( ) = ( )

( )

1 i i

i i Da b

Da b

e

e = ( ) 1

(^) eDai b i.

b) Pi( ) is a function of ai and bi in addition to. c) D=1.7, a scaling factor for the N.O.M.

  1. Three-parameter Logistic (3-pl) Model

a) Pi( ) = ( )

( )

1

( 1 ) i i

i i Da b

Da b i i e

e c c = ( ) 1

Dai b i

i i e

c c.

b) Pi( ) is a function of ai, bi, ci, and.

e.g. For u ’ = [1, 1, 0], we can have L = P 1 *P 2 *Q 3 If P 1 = .50, P 2 = .40, and P 3 = .30, then, L = (.50)(.40)(.70) II. Ability Estimation A. For a given set of item responses, the main job of a psychometrician is to estimate the examinee’s true ability using the likelihood function of the response pattern. B. There are several estimation methods depending on the algorithm applied (MLE, BME, EAP, WLE). C. MLE

  1. The maximum value of the likelihood function can be obtained for the examinee’s true ability value of.
  2. Obtain the first derivative of the natural log of the likelihood function ( l ) to get the slope of the function.

l’ = ln L = i i i

i i i P Q

P '( u P )

where

i i i i

i i P c Q c

P Da P )( ) 1

  1. Set l’ to zero (the slope is set to zero), then solve the equation for.  MLE.
  2. There is no closed form solution for the equation.  Use numerical analysis methods.
  3. Either bi-sectional or Newton-Raphson method.
  4. When an examinee answers all items correctly or all items incorrectly, then ^ = - or +.  Truncation is needed.
  5. Sometimes the maximum value does not exist.  Aberrant responses.

*Bayes’ Rule

P(Ai|B) = ( )

P B

P Ai B

PB AP A PB AP A

PB AP A

(binary)

i i

i i

P B A P A

PB A P A

(polytomous)

PB AP AdA

P B Ai P Ai

( | ) ( )

(continuous)

Example: What is the probability that your house is on fire when the fire alarm goes off? A: your house is on fire. B: The fire alarm goes off. P(A): The prior probability, the marginal probability that a randomly chosen house is on fire (.0001), one out of 10K houses. P( A ): 1 – P(A) = .9999. P(B|A) = Hit ratio (.98). P(B| A ) = False alarm (.01).

P(A|B) =

PB AP A PB AP A

PB AP A

D. Bayesian Methods

  1. Basic Bayesian Model

P(A|B) =

i i

i i

P B A P A

PB A P A

(polytomous)

PB AP AdA

P B Ai P Ai

( | ) ( )

(continuous)

where P(A|B) = posterior probability of A given B, and P(A) = prior probability of A, marginal probability of A.

  1. In IRT the posterior distribution of given u can be expressed as,

P( | u ) = Lu d

Lu

( | ) ( )

where

L( u | ) =

n

i

ui i

ui Pi Q 1

1 , conditional probability of u given ,

( ) = the prior distribution of [e.g., (0, 1)], and P( | u ) = posterior probability of given u.

  1. EAP (Expected A Posteriori): The mean of the posterior distribution of.
  2. E(X) =

n

i

p X xi xi 1

(^) ( ) = p ( x ) xdx

III. Item parameter estimation A. Likelihood function

L L( u | , a, b, c) =

n

i

u i

u i Pi^ Q i 1

1

B. Log likelihood function l = lnL = [uilnPi + (1-ui)lnQi]. C. Compute the first derivative of l with regard to each item parameter. D. Set each of the derivatives to zero and simultaneously solve the equations with 3 unknowns. E. The multivariate Newton-Raphson method will be used for each item.

F. Local independence is not required since we are estimating each item. Instead the independence of examinees’ responses is required. IV. Joint estimation of the item and ability parameters A. Joint MLE

  1. Stage 1. a) Ability parameters are computed form X/(N-X) for each examinee as a starting value (X = number-right-score for a given test). b) ln[X/(N-X)] for each examinee is standardized to set a (0, 1) distribution (indeterminancy). c) The standardized ln[X/(N-X)] values are treated as known ability values. Then, item parameters are estimated with the known ability values.
  2. Stage 2: Using the estimated item parameters we can estimate the ability parameter.
  3. May have the same MLE problems as in the ability estimation (All correctly and incorrectly answered items will have +- infinity for their ability estimation). B. Marginal MLE (MMLE)
  4. Assume a prior form (usually normal) of ability distribution.
  5. Based on the prior (marginal) distribution, we estimate the item parameters (solve some problems of the joint MLE).
  6. A large number of examinees is important for the prior ability distribution.
  7. MMLE was implemented in BILOG.

V. Item information function and standard error.

A. I( , ui ) i i

i P Q

P '^2

2 2

[ ][ 1 ]

Dai bi Dai b i i

i i c e e

Da c

( | )

Var

Inverse of the variance of

given.

B. SE(

I

Var.

Ch. 4: Model-Data Fit I. Introduction A. IRT has different models (1, 2, 3 plm, and uni- and multidimensional models). B. If a model does not fit the data, IRT will lose its advantages over CTT. C. Three methods of checking the model-data fit.

  1. Assumption checking.
  2. Invariance checking.
  3. Prediction checking.

II. Assumption checking A. Unidimensionality checking

  1. Factor analysis for a one-factor solution on inter-item correlation matrix (tetrachoric).  Looking for one dominant eigenvalue.
  2. Local independence can be checked by investigating the variance- covariance matrix of items.
  3. The b-value can be checked from two different ability groups. If the b- values are in linear relationship between the two groups, then the unidimensionality assumption is met. B. Equal discrimination index checking (a)
  4. If the equal discrimination index assumption is violated, the Rasch model (1-plm) is not valid for the data.
  5. The item-test correlation distribution can be used. If it is homogeneous, the a-value is equal. C. Minimal guessing checking (c-parameter)
  6. If the c-parameter is not minimal, the 3-pl model is valid.
  7. If the performance of the low-ability students on the most difficult items is close to zero, the c-parameter is minimal. D. Non-speeded test checking
  8. The variance of number of omitted items and the variance of number of incorrectly answered items can be checked. If the ratio of two variances (O/I) is close to zero, the assumption of non-speeded test is met.
  9. The test scores of examinees under the specific time limit and without a time limit can be compared. If they are similar to each other, the assumption is met.
  10. The percentage of examinees completing the test, percentage of examinees completing 75% of the test, and the number of items completed by 80% ofthe examinees can be reviewed. If nearly all examinees complete nearly all of the items, speed is not an important factor.

III. Invariance checking A. Checking ability parameter ( )

  1. Make two tests with different b-values for a unidimensional ability item bank.
  2. Administer the two tests to a group of examines.

Ch. 5: Ability Scale I. Introduction A. In CTT, X (number-right-score) is an unbiased estimate of a person’s true score T,  E (X) T. B. In IRT, a person’s ability, , is monotonically related to the person’s true score. Monotonic relationship is a non-linear and strictly increasing relationship. C. General procedure in estimating the true ability of a person.

  1. Obtain item responses of 1s and 0s (binary item).
  2. When item parameters are known, use one of the estimation methods (MLE, BME, EAP, WLE).
  3. When item parameters are unknown, use joint MLE or MMLE for item parameters and then, use one of MLE, BME, EAP, and WLE.
  4. The estimated or transformed

can be reported.

II. Nature of ability scale A. In CTT,

  1. E (X) T
  2. X/N = proportion-correct in criterion-referenced test.
  3. Norm-reference test: stanines, percentiles, T-score, ACT, SAT, GRE.
  4. X is test- and person-dependent. B. In IRT
  5. (ability, trait, or proficiency level) is test-independent.
  6. may be an ordinal scale, but has a limited ratio scale interpretation.

III. Transformation of A. Linear transformation of , b, and a. Let * = , b* = b , and a* = a / , then 3-pl is,

P( *) = c + / [( )( )] 1

eDa b

c

= c + / [ ] 1

e Da b

c

= c + / .[ ] 1

eDa b

c

= c + ( ) 1

eDa b

c = P( ).

P( ) is invariant under the linear transformation of , b, and a.  Indeterminancy. (Woodcock-Johnson’s scale p. 80). B. Non-linear transformation of and b: partial ratio-scale interpretation only for the Rasch model.

  1. Let * = eD and b* = eDb where D = 1, a = 1.
  2. P( *) = ( )

( )

1 D b

D b

e

e = D Db

D Db

e

e 1

= D Db

D Db

e e

e e 1 /

= Db Db D Db

D Db

e e e e

e e / /

= Db D

D

e e

e

b P( *) =

b

(The first Rasch model)

If * = b*, then P( *) = .5. Q( *) = 1 – P( *)

= 1 -

b

b

b

b

b

  1. The odds for success, O

O = ( *)

Q

P

b b

b

b

  1. Ratio of success odds for two examinees ( 1 *,^2 *)

Given Op1 =

b

, and Op2 =

b

2

1 2

1 p

p O

O

If 1 * 2 2 *, then examinee 1 has twice the odds of examinee 2 in answering the item correctly (ratio-scale property).

  1. Ratio of success odds for two items (b 1 *, b 2 *)

Oi1 =

b 1

, Oi2 =

b 2

, then

1

2 2

1 b

b O

O

i

i.

If b 2 * = 2b 1 *, then item 1 has twice the odds of item 2 for an examinee to get the item correct (ratio-scale property).

  1. Log-odds (logits)

( 1 2 ) 2

1

2

1 2

1

* D

D

D

p

p (^) e e

e O

O

ln ( 1 2 ) 2

1 D

O

O

p

p.

In 1-pl model, D is omitted (D=1.0).

ln 1 2 2

1 p

p O

O

In the same way.

ln 2 1 2

(^1) b b O

O

i

i.

  1. Logits can also be obtained from the original model.

Pi( ) = ( )

( )

1 b

b

e

e , and Qi( ) = ( ) 1

(^) e b.

Ch. 6: Item and Test Information.

I. Item information function

A. Ii( ) = ( ) ( )

) ( )( ( ) )]

[(

2 ' 2

i i

i i i i

i

i i

i P Q

Q P c c

a

P Q

P

Var

where Pi’( ) = the first derivative of Pi( ), Pi( ) = item response function, and Qi( ) = 1 – Pi( ). B. Lord (1980) and Birnbaum (1968) showed that

Ii( ) = 1. 7 ( ) 1. 7 ( ) 2

2 2

[ ][ 1 ]

ai bi ai b i i

i i c e e

a c .

  1. When b is close to , e is minimized.  Ii( ) is maximized.
  2. When a is high, Ii( ) is large.
  3. When c approaches zero, Ii( ) is high.
  4. Compute Ii( ) for some items a=1, b=1.2, c=.2, =1.5  Ii( ) =. a=1, b=1, c=.2, =1.2  Ii( ) =. a=2, b=1.2, c=.2, =1.5  Ii( ) = 1. a=1, b=1.2, c=0, =1.5  Ii( ) =.

C. Item’s maximum information is obtained when is maximized.

ln[. 5 ( 1 1 8 )]

  1. 7

max i i

bi (^) a c

Assuming a=1,

  1. If c=0, then, ln(1) = 0 and (^) max=bi. If c=.2, then, ln(1.306)=.267, (^) max= bi + .157. If c=.5, then, ln(1.618)=.5812, (^) max= bi + .283.
  2. Examples on pp. 92-93. II. Test information function

A. I( ) =

n

i

Ii 1

n

i (^) i i

i P Q

P

1

' 2 .

B. Contribution of each item is independent of the contribution of other items.  Unique feature of IRT. C. Standard error of estimation, SE(

1. SE(

I

with MLE.

  1. Same as SEM in CTT, SEM = sX 1 r.
  2. SE(

) varies with , a, b, and c.

  1. The distribution of SE(

) is normal when the test is long.

  1. In an adaptive testing situation, we can stop the test by setting the criterion of SE(

) .20 or .25 (I>25) or n 40.

III. Relative efficiency

A. RE( ) = ( )

B

A I

I

B. RE( ) can be treated as a ratio-scale assuming that items have comparable statistical quality.

Ch. 7: Test Construction

I. Introduction A. Test construction is a process of item selection from an item bank for a specified purpose. B. In CTT item and test indices are variant over groups.

  1. The index group and the target group should be matched in their ability (school start, end, and year).
  2. Not all experimental (filler) items can be included in one test.  No guarantee for equivalent indices.
  3. Standard error of measurement (SEM) and test reliability are general indices for a whole test. We cannot have SEM and reliability for a specific ability range.
  4. Contribution of each item to a test cannot be isolated. Each item is related to other items in a test. C. In IRT, item parameters are invariant.
  5. Problems 1 – 4 in CTT are resolved in IRT.
  6. and b are on the same scale (0, 1).  Select appropriate items for a specific ability range.
  7. Item information of each item is independent of other items in a test. SE( ) can be set and items with large Ii( ) can be selected for a specific purpose of a test.

II. Procedure (Lord, 1977) A. Develop a target information function for a test (p. 104, f.7.1).

  1. Decide ability range (e.g. -2 2 ).
  2. Decide SE( ) = ( )

I

(e.g., .50, .33, .25).

B. Select items from the item bank to match the target information function. Pay special attention to the tails of the ability distribution. C. After each item is added to the test, calculate the test information function. D. Continue the item selection procedure until the test information function is similar to the target information function. E. We may set a criterion for the number of items for the test as an alternate for the