Understanding Test Reliability: Definitions, Calculations, and Importance, Slides of Psychology

The concept of test reliability, its relationship with test validity, and its impact on the correlation between observed scores and true scores. It covers the definition of true score, error, and reliability, and provides formulas for calculating reliability. The document also discusses the importance of reliability in various fields and its role in structural equation modeling.

Typology: Slides

2021/2022

Uploaded on 09/12/2022

agrata
agrata 🇺🇸

4

(7)

258 documents

1 / 38

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Reliability
William Revelle and David M. Condon
Northwestern University
Abstract
Separating the signal in a test from the irrelevant noise is a challenge for all
measurement. Low test reliability limits test validity, attenuates important
relationships, and can lead to regression artifacts. Multiple approaches to the
assessment and improvement of reliability are discussed. The advantages and
disadvantages of several different approaches to reliability are considered.
Practical advice on how to assess reliability using open source software is
provided.
Contents
Introduction 2
Reliability and Validity ..................................... 3
Using reliability 4
Correction for attenuation .................................... 4
Regression to the mean ..................................... 5
Standard Error of Observed Score ............................... 6
True score theory 6
Estimating reliability using parallel tests ............................ 6
Estimating reliability using τequivalent measures ....................... 7
Estimating reliability using congeneric measures ....................... 7
Reliability over what? 7
Reliability over alternate forms ................................. 8
Stability over time ........................................ 9
Split half reliability: the reliability of composites ....................... 10
Internal consistency estimates of reliability 12
KR-20, λ3, and αas indicators of internal consistency .................... 13
Standard error of alpha ..................................... 14
contact: William Revelle [email protected]
To appear in the Wiley-Blackwell Handbook of Psychometric Testing,
Paul Irwing, Tom Booth and David Hughes (Eds.)
Draft version of February 12, 2014
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26

Partial preview of the text

Download Understanding Test Reliability: Definitions, Calculations, and Importance and more Slides Psychology in PDF only on Docsity!

William Revelle and David M. Condon

Northwestern University

Abstract Separating the signal in a test from the irrelevant noise is a challenge for all measurement. Low test reliability limits test validity, attenuates important relationships, and can lead to regression artifacts. Multiple approaches to the assessment and improvement of reliability are discussed. The advantages and disadvantages of several different approaches to reliability are considered. Practical advice on how to assess reliability using open source software is provided.

Contents

Introduction 2 Reliability and Validity..................................... 3

Using reliability 4 Correction for attenuation.................................... 4 Regression to the mean..................................... 5 Standard Error of Observed Score............................... 6

True score theory 6 Estimating reliability using parallel tests............................ 6 Estimating reliability using τ equivalent measures....................... 7 Estimating reliability using congeneric measures....................... 7

Reliability over what? 7 Reliability over alternate forms................................. 8 Stability over time........................................ 9 Split half reliability: the reliability of composites....................... 10

Internal consistency estimates of reliability 12 KR-20, λ 3 , and α as indicators of internal consistency.................... 13 Standard error of alpha..................................... 14

contact: William Revelle [email protected] To appear in the Wiley-Blackwell Handbook of Psychometric Testing, Paul Irwing, Tom Booth and David Hughes (Eds.) Draft version of February 12, 2014

Reliability and item analysis................................... 15 Reliability of scales formed from dichotomous or polytomous variables........... 16

Domain sampling theory and structural measures of reliability 19 Seven measures of internal consistency: α, λ 3 , λ 6 , β, ωg, ωt , and λ 4.............. 20 When α goes wrong: the misuse of α.............................. 23

Other approaches 23 Generalizability theory: reliability over facets......................... 24 A special case of generalizability theory: the Intraclass correlations and the reliability of ratings........................................... 28 Reliability of composites..................................... 28 Reliability of difference scores.................................. 29

Conclusion 30

Appendix 35 R functions called........................................ 35 Sample R code for basic reliability calculations........................ 35

Introduction

All measures reflect an unknown mixture of interesting signal and uninteresting or irrelevant noise. Separating signal from noise is the primary challenge of measurement and is the fundamental goal of all approaches to reliability theory. What makes this challenge particularly difficult is that what is signal to some is noise to others. In climate science, short term variations in weather mask long term trends in climate. In oceanography, variations in waves mask tidal effects, waves and tides in turn mask long term changes in sea level. Within psychology, stable individual differences in affective traits contaminate state measures of momentary affective states; acquiescence and extreme response tendencies contaminate trait measures; moment to moment or day to day fluctuations in alertness or motivation affect measures of ability. All of these examples may be considered as problems of reliability: separating signal from noise. They also demonstrate that the classification of signal depends on what is deemed relevant. For indeed, meteorologists care about the daily weather, climate scientists about long term trends in climate; similarly, emotion researchers care about one’s current emotion, personality researchers care about stable consistencies and long term trends. Whether recording the time spent walking to work or the number of questions answered on an exam, people differ. They differ not only from each other, but from measure to measure. Thus, one of us walks to work in 16 minutes one day, but 15.5 minutes the next and 16.5 on the third day. We can say that his mean time is 16 minutes with a standard deviation of .5 minutes. When asked how long it takes him to get to work, we should say that our best estimate is 16 minutes but we would expect to observe anything between 15 and 17 minutes. We tend to emphasize his central tendency (16 minutes) and consider the variation in his walking rate as irrelevant noise. The expected score across multiple replications that minimizes squared deviations is just the arithmetic average score. In the “classical test theory” (CTT) as originally developed by Spearman (1904), the expected score across an infinite number of replications is known as the true score. True score defined this way should not be confused with Platonic Truth for if there is any systematic bias in the observations, then the mean score will show this bias (Lord & Novick, 1968).

Using reliability

There are three primary reasons to be concerned about a measure’s reliability. The first is that the relationship between any two constructs will be attenuated by the level of reliability of each measure: two constructs can indeed be highly related at a latent level, but if the measures are not very reliable, the observed correlation will be reduced. The second reason that understanding reliability is so important is the problem of regression to the mean. Failing to understand how relia- bility affects the relationship between observed scores and their expected values plagues economists, sports fanatics, and military training officers. The final reason to examine reliability is to estimate the true score given an observed score and to establish confidence intervals around this estimate based upon the standard error of the observed scores.

Correction for attenuation

The original development of reliability theory was to estimate the correlation of latent vari- ables in terms of observed correlations “corrected” for their reliability (Spearman, 1904). Measures of “mental character” showed almost the same correlations (.52) between pairs of brothers as did various physical characteristics (.52), but when the mental characters correlations were corrected for their reliability, they were shown to be much more related (.81) (Spearman, 1904). The logic of correcting for attenuation is straight forward. For even if observed scores are contaminated by error, the errors are independent of the true scores, and the covariance between the observed scores of two different measures, x and y, just reflects the covariance of their true scores. Consider two observed variables, x and y, which are imperfect measures of two latent traits, θ and ψ:

x = θ + ε y = ψ + ζ

with variances σ^2 x = σ^2 θ+ε σ^2 y = σ^2 ψ+ζ

and reliabilities

ρ^2 θx =

σ^2 θ σ^2 x ρ^2 ψy =

σ^2 ψ σ^2 y

and covariance σxy = σ(θ+ε)(ψ+ζ) = σθψ +((((((( σθζ + σψε + σ(εζ = σθψ.

Then the correlation between the two latent variables may be expressed in terms of the observed correlation and the two reliabilities:

ρθψ =

σθψ σθσψ

σxy √ ρ^2 θxσ^2 x ρ^2 ψyσ^2 y

rxy √ ρ^2 θxρ^2 ψy

That is, the correlation between the true parts of any two tests will the ratio of their observed correlation to the square root of their respective reliabilities. This correction for attenuation is perhaps the most important use of reliability theory, for it allows for an estimate of the true correlation between two constructs when the constructs are perfectly measured, without error. It does require, however, that we find the reliability of the separate tests. The concept that observed covariances reflect true covariances is the basis for structural equa- tion modeling in which relationships between observed scores are expressed in terms of relationships between latent scores and the reliability of the measurement of the latent variables. By correcting for unreliability in this way we are able to determine the underlying latent relationships without the distraction of measurement error.

Regression to the mean

First considered by Galton (1886, 1889) as he was developing the correlation coefficient, reversion to mediocrity was the observation that the offspring of tall parents tended to be shorter, just as those of short parents tended to be taller. Although originally interpreted as of interest only to geneticists, the concept of regression to the mean is a classic problem of reliability theory that is unfortunately not as well recognized as it should be (Stigler, 1986, 1997). Whenever groups are selected on the basis of being extreme on an observed variable, the scores on a retest will be closer to the mean than they were originally. Classic examples include the tendency of companies with award winning CEOs to become less successful than comparable companies whose CEOs do not win the award (Malmendier & Tate, 2009), for flight instructors to think that rewarding good pilots is counter productive because they get worse on their next flight (Kahneman & Tversky, 1973), for athletes who make the cover of Sports Illustrated to do less well following the publication (Gilovich, 1991), for training programs for disadvantaged children to help their students (Campbell & Kenny, 1999) and for the breeding success of birds to improve following prior failures (Kelly & Price, 2005). Indeed the effect of regression to mean artifacts on the market value of baseball players was the subject of the popular book and movie, Moneyball (Lewis, 2004). A critical review of various examples of regression artifacts in chronobiology has the the impressive title of “how to show that unicorn milk is a chronobiotic” and provides thoughtful simulated examples (Atkinson, Waterhouse, Reilly, & Edwards, 2001). Regression effects should be controlled for when trying to separate placebo from treatment effects in behavioral and drug intervention studies (Davis, 2002). So, if it is so well known, what is it? If observed score is imperfectly correlated with true score (Equation 3) then it is also correlated with error because

σ^2 x = σ^2 θ + σ^2 ε

σεx = σε(θ+ε) = (^) σεθ + σ^2 ε = σ^2 ε

and thus

ρεx = σεx √ σ^2 ε σ^2 x

σ^2 ε √ σ^2 ε σ^2 x

σε σx

That is, individuals with extreme observed scores might well have extreme true scores, but are most likely to also have extreme error scores. From Equation 6 we see that for any observed score, the expected true score is regressed towards the mean with a slope of ρ^2 xθ. In the case of pilot trainees, if the reliability of flying skill is .5, with a mean score of 50 and a standard deviation of 20, the top 10% of the fliers will have an average score of 75.6 on their first trial but will regress 12.8 points (the reliability value * their deviation score) towards the mean or have a score of 62.8 on the second trial. The bottom 10%, on the other hand, will have scores far below the mean with an average of 24.4 but improve 12.8 points on their second flight to 37.2. Flight instructors seeing this result will falsely believe that punishing those who do badly leads to improvement, perhaps due to heightened effort, while rewarding those who do well leads to a decrease in effort (Kahneman & Tversky, 1973). Similarly, because the mean batting average in baseball is ≈. 260 with a standard deviation of ≈. 0275 and has a year to year reliability of ≈. 38 , those who have a batting average of 3 σ or .083 above the average in one year (.343 instead of .260) are expected to be just 1. 1 σ ≈. 031 ) above the average or .291 in the succeeding year (Schall & Smith, 2000). That is, a spectacular year is most likely followed by a return, not to the overall average, but rather to the player’s average.

Estimating reliability using τ equivalent measures

The previous derivation requires the assumption that the two measures of the latent (unob- served) trait are exactly equally good and that they have equal error variances. These are very strong assumptions for “unlike the correlation coefficient, which is merely an observed fact, the reliability coefficient has embodied in it a belief or point of view of the investigator” (Kelley, 1942, p 75). Kelley, of course, was commenting upon the assumption of parallelism as well as the as- sumption that the test means the same thing as when it is given again. With the assumption of parallelism it is possible to solve the 3 equations (two for variances and one for the covariance) shown in the first two rows of Table 1 for the three unknowns (σ^2 θ, σ^2 ε , and λ 1 ). A relaxation of the exact parallelism assumption is to assume that the covariances of observed scores with true scores are equal (λ 1 = λ 2 = λ 3 ), but that the error variances are unequal (Table 1 lines 1-3). With this assumption of equal covariances with true score (known as tau equivalence) we have six equations (one for each correlation between the three tests, and one for each variance) and five unknowns (σ^2 θ, λ 1 = λ 2 = λ 3 , and the three error variances, ε^21 , ε^21 , ε^21 ) and we can solve using simple algebra.

Table 1: Estimating the parameters of parallel, τ equivalent, and congeneric tests. To solve for two parallel tests (lines 1-2) require the assumption of equal true (λ 1 σθ = λ 2 σθ) and error (ε^21 = ε^22 ) variances. To solve the six equations for three τ equivalent tests (lines 1-3) we can relax this assumption, but require the assumption of equal error variances. Congeneric measures (four or more tests) can be solved with no further assumptions.

Observed correlations and modeled parameters Variable Test 1 Test 2 Test 3 Test 4 Test 1 σ^2 x 1 = λ 1 σ^2 θ + ε^21 Test 2 σx 1 x 2 = λ 1 σθλ 2 σθ σ^2 x 2 = λ 2 σ^2 θ + ε^22 Test 3 σx 1 x 3 = λ 1 σθλ 3 σθ σx 2 x 3 = λ 2 σθλ 3 σθ σ^2 x 3 = λ 3 σ^2 θ + ε^23 Test 4 σx 1 x 4 = λ 1 σθλ 4 σt σx 2 x 4 = λ 2 σθλ 4 σθ σx 3 x 4 = λ 3 σθλ 4 σθ σ^2 x 4 = λ 4 σ^2 θ + ε^24

Estimating reliability using congeneric measures

If there are at least four tests, it is possible to solve for the unknown parameters (covariances with true score, true score variance, error score variances) without any further assumptions other than that all of the tests are imperfect measures of the same underlying construct (Table 1). In terms of factor analysis, the congeneric model merely assumes that all measures load on one common factor. Indeed, with four or more measures of the same construct it is possible to evaluate how well

each measure reflects the construct, λi and the amount of error variance in each measure, r^2 xiθ = λ

(^2) i σ^2 xi^.

Reliability over what?

The previous paragraphs discuss reliability in terms of the correlations between two or more measures. What is unstated is when or where are these measures given, as well as the meaning of alternative measures. Reliability estimates can be found based upon variations in the overall test, variations over time, variation over items in a test, and variability associated with who is giving the test. Each of these alternatives has a different meaning and sometimes a number of different estimates. In the abstract case of parallel tests or congeneric measurement, the domain of generalization (time, form, items) is not specified. It is possible, however, to organize reliability coefficients in terms of a simple taxonomy (Table 2). Each of these alternatives is discussed in more detail in the subsequent sections.

Table 2: Reliability is the ability to generalize about individual differences across alternative sources of variation. Generalizations within a domain of items use internal consistency estimates. If the items are not necessarily internally consistent, reliability can be estimated based upon the worst split half, β, the average split (corrected for test length) or the best split, λ 4. Reliability across forms or across time is just the Pearson correlation. Reliability across raters depends upon the particular rating design and is one of the family of Intraclass correlations. Functions in R may be used to find all of these coefficients. Except for cor, all functions are in the psych package.

Generalization over Type of reliability R function Name Unspecified Parallel tests cor(xx’) rxx Tau equivalent tests cov(xx’) and fa rxx Congeneric tests cov(xx’) and fa rxx Forms Alternative form cor(x,y) rxx Time Test-retest cor(time 1 time 2 ) rxx Split halves random split half splitHalf rxx worst split half iclust or splitHalf β best split half splitHalf λ 4 Items Internal consistency general factor (g) omega or omegaSem ωh average alpha or scoreItems α smc alpha or scoreItems λ 6 all common (h^2 ) omega or omegaSem ωt Raters Single rater ICC ICC 2 , ICC 2 , ICC 3 Average rater ICC ICC 1 k, ICC 2 k, ICC 3 k

Reliability over alternate forms

Perhaps the easiest two to understand because they are just raw correlations are the reliability of alternate forms and the reliability of tests over time (test-retest). Alternative form reliability is just the correlation between two tests measuring the same construct, both measures of which are thought to measure the construct equally well. Such tests might be the same items presented in a different order to avoid cheating on an exam, or made up different items with similar but not identical content (e.g., 2 + 4 =? and 4 + 2 = ?). Ideally, to be practically useful, such equivalent forms should have equal means and equal variances. Although the intuitive expectation might be that two such tests would correlate perfectly, they won’t, for all of the reasons previously discussed. Indeed, the correlation between two such alternate forms gives us an estimate of the amount of variance that each test shares with the latent construct being measured. If we have three or more alternate forms, then their correlations may be treated as if they were τ equivalent or congeneric measures and we can use factor analysis to find each test’s covariance (λi) with the latent factor, the square of which will be the reliability.

The construction of such alternate forms can be done formulaically by randomizing items from one form to prepare a second or third form, or by creating quasi-matching pairs of items across forms (“the capital of Brazil is ?” and “Brazilia is the capital of ?”). To control for subtle differences in difficulty, multiple groups of items can be matched across forms (e.g., 4 * 9 =? and 3 * 7 =? might be easier than 9 * 4 = ?, and 7 * 3 = ?, so form A could have 4 * 9 =? and 7 * 3

Split half reliability: the reliability of composites

For his dissertation research at the University of London, William Brown (1910) examined the correlations of a number of simple cognitive tasks (e.g., crossing out e’s and r’s from jumbled French text, adding up single digits in groups of ten) given two weeks apart. For each task, he measured the test-retest reliability by correlating the two time periods and then formed a composite based upon the average of the two scores. He then wanted to know the reliability of these composites so that he could correct the correlations with other composites for their reliability. That is, given a two test composite, X, with a reliability for each test, ρ, what would the composite correlate with a similar (but unmeasured) composite, X′? Consider X and X′, both made up of two subtests. The reliability of X is just its correlation with X′^ and can be thought of in terms of the variance-covariance matrix, ΣXX′ :

ΣXX′ =

Vx

. Cxx′ ............ Cxx′

. Vx′

and letting Vx = 1Vx 1 ′^ and CXX′ = 1CXX′ 1 ′^ where 1 is a a column vector of 1s and 1 ′^ is its transpose, the correlation between the two tests will be

ρxx′^ =

Cxx′ √ VxVx′

But the variance of a test is simply the sum of the true covariances and the error variances and we can break up each test into two subtests (X 1 and X 2 ) and their respective variances and covariances. The structure of the two tests seen in Equation 12 becomes

ΣXX′ =

Vx 1

. Cx 1 x 2 Cx 1 x′ 1

. Cx 1 x′ 2 ............................ Cx 1 x 2

. Vx 2 Cx 2 x′ 1

. Cx 2 x′ 1 Cx 1 x′ 1

. Cx 2 x′ 1 Vx′ 1

. Cx′ 1 x′ 2 Cx 1 x′ 2

. Cx 2 x′ 2 Cx′ 1 x′ 2

. Vx′ 2

Because the splits are done at random and the second test is parallel with the first test, the expected covariances between splits are all equal to the true score variance of one split (Vt 1 ), and the variance of a split is the sum of true score and error variances:

ΣXX′ =

Vt 1 + Ve 1

. Vt 1 Vt 1

. Vt 1 ............................................ Vt 1

. Vt 1 + Ve 1 Vt 1

. Vt 1 Vt 1

. Vt 1 Vt′ 1 + Ve′ 1

. Vt′ 1 Vt 1

. Vt 1 Vt′ 1

. Vt′ 1 + Ve′ 1

The correlation between a test made up of two halves with intercorrelation (r 1 = Vt 1 /Vx 1 ) with another such test is

rxx′ = 4 Vt 1 √ ( 4 Vt 1 + 2 Ve 1 )( 4 Vt 1 + 2 Ve 1 )

4 Vt 1 2 Vt 1 + 2 Vx 1

4 r 1 2 r 1 + 2

and thus rxx′^ =

2 r 1 1 + r 1

Equation 14 is known as the split half estimate of reliability. It is important to note that the split half reliability is not the correlation between the two halves, but rather is adjusted upwards by Equation 14. In the more general case where the two splits do not have equal variance, (Vx 1 6 = Vx 2 ) equa- tion 14 becomes a little more complicated and may be expressed in terms of the total test variance as well as the covariance between the two subtests, or in terms of the subtest variances and correlations (J. Flanagan as cited in Rulon, 1939):

rxx′ = 4 Cx 1 x 2 Vx

4 Cx 1 x 2 2 Cx 1 x 2 +Vx 1 +Vx 2

4 rx 1 x 2 sx 1 sx 2 2 Cx 1 x 2 +V x 1 +Vx 2

4 rx 1 x 2 sx 1 sx 2 2 rx 1 x 2 sx 1 σx 2 + s^2 x 1 + s^2 x 2

Because the total variance Vx 1 +x 2 = Vx 1 +Vx 2 + 2 Cx 1 x 2 , and the variance of the differences is Vx 1 −x 2 =

Vx 1 + Vx 2 − 2 Cx 1 x 2 , then Cx 1 x 2 = Vx 1 +Vx 2 −Vx 1 −x 2 2 , and we can express reliability as a function of the variances of differences scores between the splits and the variances of the two splits

rxx′ = 4 Cx 1 x 2 Vx

2 (Vx 1 +Vx 2 −Vx 1 −x 2 ) Vx 1 +Vx 2 + 2 Cx 1 x 2

2 (Vx 1 +Vx 2 −Vx 1 −x 2 ) Vx 1 +Vx 2 +Vx 1 +Vx 2 −Vx 1 −x 2

Vx 1 +Vx 2 −Vx 1 −x 2 Vx 1 +Vx 2 − Vx 1 −x 2 2

When calculating correlations was tedious compared to finding variances, Equation 16 was a par- ticularly useful formula because it just required finding variances of the two halves as well as the variance of their differences. It is still useful, for it expresses reliability in terms of test variances and recognizes that unreliability is associated with the variances of the difference scores (perfect reliability implies that Vx 1 −x 2 = 0 ). But how to decide how to split a test? Brown compared the scores at time one with those at time two and then formed a composite of the tests taken at both times. But estimating reliability based upon stability over time implies no change in the underlying construct over time. This is reasonable if measuring speed of processing but is a very problematic assumption if measuring something more complicated: ...the reliability coefficient has embodied in it a belief or point of view of the investigator. Consider the score resulting from the item, “Prove the Pythagorean theorem.” One teacher asserts that this is a unique demand and that there is no other theorem in geometry that can be paired with it as a similar measure. It cannot be paired with itself if there is any memory, conscious or subconscious, of the first attempt at proof at the time the second attempt is made, for then the mental processes are clearly different in the two cases. The writer suggests that anyone doubting this general principle take, say, a contemporary-affairs test and then retake it a day later. He will undoubtedly note that he works much faster and the depth and breadth of his thinking is much less,

  • he simply is not doing the same sort of thing as before. (Kelley, 1942, p 75-76) The alternative to estimating composite reliability by repeating the measure to get two splits is to split the test items from one administration. Thus, it is possible to consider splits such as the odd versus even items of a test. This would reflect differences in speed of taking a test in a different manner than would splitting a test into a first and second part (Brown, 1910). Unfortunately, the number of ways to split a n item test into two is an explosion of possible combinations

( (^) n n/ 2

= (^2) (nn/! 2 )! 2. A 16 item test has 6,435 possible 8 item splits and a 20 item test has 92,378 10 item splits. Most of these possible splits will yield slightly different split half estimates. Consider all possible splits

That is, the reliability of a composite of n tests (or items) increases as a function of the number of items and the average intercorrelation or covariance of the tests (items). By combining items, each of which is a mixture of signal and noise, the ratio of signal to noise (S/N) increases linearly with the number of items and the resulting composite is a purer measure of signal (Cronbach & Gleser, 1964). If we think of every item as a very weak thread (the amount of signal is small compared to the noise), we can make a very strong rope by binding many threads together (Equation 17). Considering how people differ from item to item and from trial to trial, Guttman (1945) defined reliability as variation over trials.

Using this definition, no assumptions of zero means for errors or zero correlations are needed to prove that the total variance of the test is the sum of the error variance and the variance of expected scores; this relationship between variances is an algebraic identity. Therefore, the reliability coefficient is defined without assumptions of independence as the complement of the ratio of error variance to total variance (Guttman, 1945, p 257).

That is, rxx =

σ^2 x − σ^2 e σ^2 x

σ^2 e σ^2 x

KR-20, λ 3 , and α as indicators of internal consistency Although originally developed to predict the reliability of a composite where the reliability of the subtests is found from their test-retest correlation, the Spearman-Brown methodology was quickly applied to estimating reliability based upon the internal structure of a particular test. Because of the difficulty of finding the average between-item correlation or covariance in equations 17 or 18, reliability was expressed in terms of the total test variance, Vx, and a function of the item variances, V xi. For dichotomous items with a probability of being correct, p, or being wrong, q, Vxi = piqi (Kuder & Richardson, 1937). This approach was subsequently generalized to polytomous and continuous items by Guttman (1945) and by Cronbach (1951). The approach to find σ^2 e for dichotomous items taken by Kuder & Richardson (1937) was to recognize that for an n-item test, that the average covariance between items estimates the reliable variance of each item, and the error variance for each item will therefore be

σ^2 ei = σ^2 xi − σ¯i j = σ^2 xi −

σ^2 x − Σσ^2 xi n(n − 1 )

= piqi − σ^2 x − Σpiqi n(n − 1 )

and thus

rxx = σ^2 x − σ^2 e σ^2 x

σ^2 x − Σ(piqi − σ

(^2) x −Σpiqi n(n− 1 ) ) σ^2 x or rxx = σ^2 x − Σ(piqi) σ^2 x

n n − 1

The derivation in terms of the total item variance and the sum of the (dichotomous) item variances, Equation 20 was the 20th equation in Kuder & Richardson (1937) and is thus is known as the Kuder- Richardson (20) or KR 20 formula for reliability. Generalizing this to the polytomous or continous item case, it is known either as α (Cronbach, 1951) or as λ 3 (Guttman, 1945):

rxx = α = λ 3 =

σ^2 x − Σσ^2 xi σ^2 x

n n − 1

Guttman (1945) considered six different ways to estimate reliability from the pattern of item correlations. His λ 3 coefficient used the average inter-item covariance as an estimate of the reliable variance for each item. He also suggested an alternative, λ 6 which is to use the amount of an item’s variance which is predictable by all of the other variables. That is, to find the squared multiple correlation or smc of the item with all the other items and then find the shared variance as Vsi = smciVxi

λ 6 =

Vx − ΣVxi + ΣVxsi Vx

Guttman (1945) also considered the maximum split half reliability (λ 4 ). Both λ 4 and λ 6 are obvi- ously more complicated to find than λ 3 or α. To find λ 4 requires finding the maximum among many possible splits and λ 6 requires taking the inverse of the correlation matrix to find the smc. But with modern computational power, it is easy to find λ 6 using the alpha, scoreItems or splitHalf functions in the psych package. It is a little more tedious to find λ 4 but this can be done by com- paring all possible splits for up to 16 items or by sampling thousands of times for larger data sets using the splitHalf function. Consider the 16 ability items with the range of split half correlations as shown in (Figure 1). Using the splitHalf function we find that the range of possible splits is from .73 to .87 with an average of .83, α =. 83 , λ 6 =. 84 and a maximum ( λ 4 ) of .87.

Standard error of alpha

There are at least two ways to find the standard error of the estimated α. One is through bootstrapping, the other is through normal theory. Consider the variability in values of α for the 16 ability items for 1525 subjects found in the ability data set. Using the alpha function to bootstrap by randomly resampling the data (with replacement) 10,000 times yields a distribution of alpha that ranges for .802 to .848 with a mean value of .829 (Figure 2). Compare this to the observed value of .829. Using the assumption of multivariate normality, Duhachek & Iacobucci (2004) showed that the standard error of α, ASE, is a function of the covariance matrix of the items, V, the number of items, n, and the sample size, N. Defining Q as

Q =

2 n^2 (n − 1 )^2 ( 1 ′V1)^3

[ 1 ′V 1 (trV^2 + tr^2 V) − 2 trV( 1 ′V^21 )] (23)

where tr is the trace of a matrix (the sum of the diagonal of a matrix), and 1 is a row vector of 1’s, then the standard error of α is

ASE =

Q

n

and the resulting 95% confidence interval is

α ± 1. 96

Q

n

These confidence intervals are reported in the alpha function. For the 1525 subjects in the 16 ability data set, the 95% confidence interval using normal theory is from 0.8123 to 0.8462 which is very similar to the empirical bootstrapped estimates of 0.8166 to 0.8403.

is just (^) NS = ρ

2 1 −ρ^2 , which for the same assumptions as for^ α, will be

S N

nr¯ 1 − r¯

That is, the S/N ratio increases linearly with the number of items as well as with the average intercorrelation. By thinking in terms of this ratio, the benefits of increased reliability due to increasing the number of items is seen not to be negatively accelerated as it appears when thinking just in reliability units (Equation 21). Indeed, while the S/N ratio is linear with the number of items, it is an accelerating function of the conventional measures of reliability. That is, while the S/N = 1 for a test with a reliability of .5, it is 2 for a test with a reliability of .66, 3 for .75, and 4 for .8, it is 9 for a test with a reliability of .9 and 19 for a reliability of .95 (right hand panel of Figure 3). Depending upon whether the test is norm referenced (comparing two individuals) or domain referenced (comparing an individual to a criterion), there are several different S/N ratios to consider, but all follow this same general form (Brennan & Kane, 1977). It is not unusual when creating a set of items thought to measure one construct to have some items that do not really belong. This is, of course, an opportunity to use factor analysis to explore the structure of the data. If just a few items are suspect, it is possible to find α and λ 6 for all the subsets found by dropping out one item. That is, if an item doesn’t really fit, the α and λ 6 values of a scale without that item will actually be higher (Table 3). In the example, five items measuring Agreeableness and one measuring Conscientiousness were scored using the alpha function. Although the α and λ 6 values for all six items was .66 and .65 respectively, if item C1 is dropped, the values become .70 and .68. For all other single items, dropping the item leads to a decrease in α and λ 6 either because it reduces the average r, (items A2 - A5) and also because the test length is less (items A1-A5). Note that the alpha function recognizes that one item (A1) needs to be reversed scored. If it were not reversed scored, the overall α value would be .44. This reverse scoring is done by finding the sign of the loading of each item on the first principal component of the item set and then reverse scoring those with a negative loading. If internal consistency were the only goal when creating a test, clearly reproducing the same item many times will lead to an extraordinary reliability. (It will not be one because given the same item repeatedly, some people will in fact change their answers.) But this kind of tautological consistency is meaningless and should be avoided. Items should have similar domain content, but not identical content.

Reliability of scales formed from dichotomous or polytomous variables

Whether using using true/false items to assess ability or 4-6 level polytomous (Likert-like) items to assess interests, attitudes, or temperament, the inter-item correlations are reduced from what would be observed with continuous measures. The tetrachoric and polychoric correlation coefficients are estimates of what the relationship would be between two bivariate, normally dis- tributed items if they had not be dichotomized (tetrachoric) or trichotomized, tetrachotomized, pentachotomized or otherwise broken into discrete but ordered categories (Pearson, 1901). The use of tetrachoric correlations to model what would be the case if the data were in fact bivariate normal had they not be dichotomized is not without critics. The most notable was Yule (1912) who sug- gested that some phenomena (vaccinated, not vaccinated, alive vs. dead) were truly dichotomous while Pearson & Heron (1913) defended the use of his tetrachoric correlation. Some have proposed that one should use tetrachoric or polychoric correlations when finding the reliability of categorical scales (Gadermann, Guhn, & Zumbo, 2012; Zumbo, Gadermann, & Zeisser, 2007). We disagree. The Zumbo et al. (2007) procedure estimates the correlation between

Table 3: Item analysis of five Agreeableness items and one Conscientiousness item from the bfi data set using the alpha function. Note that one item is automatically reversed. Without reverse scoring item A1, α =. 44 and λ 6 =. 52. The items are: A1: “Am indifferent to the feelings of others.”, A2: “Inquire about others’ well-being”, A3: “Know how to comfort others”, A4: “Love children”, A5: “Make people feel at ease” and C1: “Am exacting in my work.” The item statistics include the number of subjects who answered the item, the raw correlation (inflated by item overlap), a correlation that corrects for scale unreliability and item overlap, the correlation with the scale without that item, the mean and standard deviation for each item. Examining the effect of dropping one item at a time or by looking at the correlations of the item with the scale, item C1 does not belong to this set of items.

alpha(bfi[1:6])

Reliability analysis Call: alpha(x = bfi[1:6])

raw_alpha std.alpha G6(smc) average_r ase mean sd 0.66 0.66 0.65 0.25 0.014 4.6 0. lower alpha upper 95% confidence boundaries 0.63 0.66 0.

Reliability if an item is dropped: raw_alpha std.alpha G6(smc) average_r alpha se A1- 0.66 0.66 0.64 0.28 0. A2 0.56 0.56 0.55 0.20 0. A3 0.55 0.55 0.53 0.20 0. A4 0.61 0.62 0.61 0.24 0. A5 0.58 0.58 0.56 0.22 0. C1 0.70 0.71 0.68 0.33 0.

Item statistics n r r.cor r.drop mean sd A1- 2784 0.52 0.35 0.27 4.6 1. A2 2773 0.72 0.67 0.55 4.8 1. A3 2774 0.74 0.71 0.57 4.6 1. A4 2781 0.61 0.48 0.39 4.7 1. A5 2784 0.69 0.61 0.49 4.6 1. C1 2779 0.37 0.13 0.11 4.5 1.

Non missing response frequency for each item 1 2 3 4 5 6 miss A1 0.33 0.29 0.14 0.12 0.08 0.03 0. A2 0.02 0.05 0.05 0.20 0.37 0.31 0. A3 0.03 0.06 0.07 0.20 0.36 0.27 0. A4 0.05 0.08 0.07 0.16 0.24 0.41 0. A5 0.02 0.07 0.09 0.22 0.35 0.25 0. C1 0.03 0.06 0.10 0.24 0.37 0.21 0.

Table 4: The average inter-item correlation, and thus α varies as a function of the number of categories in a scale as well as the discrimination parameter (factor loadings) of the items. α based upon the raw correlations more closely approximates the squared correlation of the observed scores with the latent score, ρ^2 oθ than does the α based upon the polychoric correlations. The ratio of alpha to the squared correlation is shown for both the raw, α/ρ^2 , and the polychoric based α, αpoly/ρ^2. Simulated data using the sim.congeneric function.

Simulated results for 10,000 cases. Factor Number of loading categories r¯ α ρoθ ρ^2 oθ rpoly¯ αpoly ρpθ ρ^2 pθ α/ρ^2 αpoly/ρ^2 2 0.06 0.48 0.69 0.47 0.10 0.60 0.65 0.43 1.02 1. 3 0.07 0.53 0.73 0.53 0.10 0.61 0.71 0.51 1.00 1. 4 0.08 0.56 0.75 0.56 0.10 0.60 0.74 0.54 0.99 1. .33 5 0.09 0.58 0.76 0.59 0.10 0.61 0.75 0.57 0.99 1. 6 0.09 0.58 0.76 0.58 0.10 0.60 0.76 0.58 1.00 1. 7 0.09 0.59 0.77 0.59 0.10 0.61 0.76 0.57 1.00 1. 2 0.14 0.69 0.83 0.69 0.22 0.80 0.83 0.69 1.00 1. 3 0.16 0.73 0.85 0.73 0.22 0.80 0.85 0.73 1.00 1. 4 0.18 0.76 0.87 0.77 0.22 0.80 0.87 0.76 0.99 1. .47 5 0.19 0.77 0.88 0.77 0.22 0.80 0.88 0.77 0.99 1. 6 0.20 0.78 0.88 0.78 0.22 0.80 0.88 0.78 1.00 1. 7 0.21 0.79 0.89 0.79 0.22 0.80 0.89 0.79 1.00 1. 2 0.26 0.83 0.90 0.80 0.39 0.90 0.90 0.81 1.03 1. 3 0.29 0.85 0.92 0.85 0.39 0.90 0.91 0.84 1.00 1. 4 0.33 0.87 0.93 0.87 0.39 0.90 0.93 0.87 1.00 1. .63 5 0.35 0.88 0.94 0.88 0.39 0.90 0.94 0.88 1.00 1. 6 0.36 0.89 0.94 0.89 0.39 0.90 0.94 0.89 1.00 1. 7 0.37 0.89 0.94 0.89 0.39 0.90 0.94 0.89 1.00 1.

Domain sampling theory and structural measures of reliability

A great deal of space has been devoted to finding λ 3 or α. This is not because we recommend the routine use of either, for we don’t. They are important to discuss both for historical reasons and because so many applied researchers use them. It would seem that one can not publish a paper without reporting “Cronbach’s α”. This is unfortunate, for as we (Revelle, 1979; Revelle & Zinbarg, 2009) and many others (e.g., Bentler, 2009; Green & Yang, 2009; Lucke, 2005; Schmitt, 1996; Sijtsma, 2009), including Cronbach & Shavelson (2004), have discussed, α is neither a measure of how well a test measures one thing (Revelle, 1979; Revelle & Zinbarg, 2009; Zinbarg, Revelle, Yovel, & Li, 2005), nor the greatest lower bound for reliability (Bentler, 2009). The basic problem is that α assesses neither the reliability of a test, nor the internal consis- tency of a test unless the test items all represent just one factor. This is generally not the case. When we think about a test made up of specific items thought to measure a construct, we are con- cerned not so much with those particular items as we are with how those items represent the larger (perhaps infinite) set of possible items that reflect that construct. Thus, extraversion is not just responding with strong agreement to an item asking about enjoying lively parties, but it also reflects a preference for talking to people rather than reading books, to seeking out exciting situations, to taking charge, and many, many more affective, behavioral, cognitive and goal directed items (Wilt

& Revelle, 2009). Nor is general intelligence just the ability to do spatial rotation problems, to do number or word series, or the ability to do a matrix reasoning task (Gottfredson, 1997). Items will correlate with each other not just because they share a common core or general factor , but also because they represent some subgroups of items which share some common affective, behavioral or cognitive content, i.e., group factors. Tests made up of such items will correlate with other tests to the extent they both represent the general core that all items share, but also to the extent that specific group factors match across tests.

By a general factor, we mean a factor on which most if not all of the items have a substantial loading. It is analogous to the general 3 ◦^ background radiation in radio astronomy used as evidence for the “Big Bang”. That is, it pervades all items (Revelle & Wilt, 2013). Group factors, on the other hand, represent item clusters where only some or a few items share some common variance in addition to that shared with the general factor. These group factors represent systematic content (e.g., party going behavior vs. talkativeness in measures of extraversion, spatial and verbal content in measures of ability) over and above what is represented by the general factor. Typically, when we assign a name to a scale we are implicitly assuming that a substantial portion of that scale does in fact reflect one thing: the general factor.

Reliability is both the ratio of true score variance to observed variance as well as the corre- lation of a test with a test just like it (Equation 11). But what does it mean to be a test just like another test? If we are concerned with a test made up of a set of items sampled from a domain, then the other test should also represent samples from that same domain. If we are interested in what is common to all the items in the domain, we are interested in the general factor saturation of the test. If we are interested in a test that shares general as well as group factors with another test, then we are concerned with the total reliability of the test.

Seven measures of internal consistency: α, λ 3 , λ 6 , β, ωg, ωt , and λ 4

This distinction between general, group, and total variance in a test, and the resulting cor- relations with similar tests has led to at least seven different coefficients of internal consistency. These are: α (Cronbach, 1951) (Equation 21) and its equivalent, λ 3 (Guttman, 1945) (Equation 21, which are estimates based upon the average inter item covariance, λ 6 , an estimate based upon the squared multiple correlations of the items (Equation 22); β, defined as the worst split half reliability Revelle (1979); ωg (McDonald, 1999; Revelle & Zinbarg, 2009; Zinbarg et al., 2005), the amount of general factor saturation; ωt , the total reliable variance estimated by a factor model; and λ 4 , the greatest split half reliability.

As an example of the use of these coefficients, consider the 16 ability items discussed earlier (Figure 1). We have already shown that this set has α = λ 3 =. 83 with a λ 6 =. 84 and a λ 4 =. 87. To find the other coefficients requires either cluster analysis for β or factor analysis for the two ω coefficients. A parallel analysis of random data (Horn, 1965) suggests that two principal components or four factors should be extracted. When four factors are found, the resulting structure may be seen in the left hand panel of Figure 4. But these factors are moderately correlated, and when the matrix of factor correlations is in turn factored, the hierarchical structure may be seen in the right hand panel of Figure 4.

Although hierarchical, higher level, or bifactor models of ability have been known for years (Holzinger & Swineford, 1937, 1939; Schmid & Leiman, 1957), it is only relatively recently that these models have been considered when addressing the reliability of a test (McDonald, 1999; Revelle & Zinbarg, 2009; Zinbarg et al., 2005). Rather than consider the reliable variance of a test as reflecting