Reliability in Psychometric Assessments: Split Half, Test-Retest, and IRT, Study Guides, Projects, Research of Psychology

The concept of reliability in psychometric assessments, discussing various measures such as split half, test-retest, and IRT approaches. It covers the importance of reliability, the distinction between specific and error variance, and the impact of test length on reliability estimates. The document also introduces the data box and Cattell's dimensions of consistency (across occasions, items, and people).

Typology: Study Guides, Projects, Research

2021/2022

Uploaded on 09/27/2022

bartolix
bartolix 🇬🇧

4.8

(17)

301 documents

1 / 32

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Reliability from αto ω: A Tutorial
William Revelle and David M. Condon
Northwestern University and University of Oregon
Abstract
Reliability is a fundamental problem for measurement in all of science. Al-
though defined in multiple ways, and estimated in even more ways, the basic
concepts seem straight forward and need to be understood by practioners as
well as methodologists. Reliability theory is not just for the psychometrician
estimating latent variables, it is for everyone who wants to make inferences
from measures of individuals or of groups. For the case of a single test ad-
ministration, we consider multiple measures of reliability, ranging from the
worst (β) to average (α,λ3) to best (λ4) split half reliabilities, and consider
why model based estimates (ωh, ωt) should be reported. We also address
the utility of test-retest and alternate form reliabilities. The advantages of
immediate versus delayed retests to decompose observed score variance into
specific, state, and trait scores is discussed. But reliability is not just for test
scores, it is also important when evaluating the use of ratings. Estimates
that may be applied to continuous data include a set of intraclass correla-
tions while discrete categorical data needs to take advantage of the family of
κstatistics. Examples of these various reliability estimates are given using
state and trait measures of anxiety given with different delays and under
different conditions. An online supplement is provided with more detail and
elaboration. The online supplement is also used to demonstrate applications
of open source software to examples of real data, and comparisons are made
between the many types of reliability.
Public Significance: A tutorial on the estimation of the reliability of test
scores considers classical and model based approaches. Examples using open
source software applied to several real world data sets are provided.
Keywords: Reliability; Generalizability; Classical Test Theory; R packages
Reliability
Reliability is a fundamental problem for measurement in all of science for “(a)ll measurement
is befuddled by error" (p 294 McNemar,1946). Perhaps because psychological measures are more
befuddled than those of the other natural sciences, psychologists have long studied the problem of
reliability (Spearman,1904b;Kuder & Richardson,1937;Guttman,1945;Lord,1955;Cronbach,
contact: William Revelle [email protected]
Preparation of this manuscript was funded in part by grant SMA-1419324 from the National Science Foundation to
WR. This is the revised version as submitted for review to Psychological Assessment on June 11, 2019.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20

Partial preview of the text

Download Reliability in Psychometric Assessments: Split Half, Test-Retest, and IRT and more Study Guides, Projects, Research Psychology in PDF only on Docsity!

Reliability from α to ω : A Tutorial

William Revelle and David M. Condon

Northwestern University and University of Oregon

Abstract Reliability is a fundamental problem for measurement in all of science. Al- though defined in multiple ways, and estimated in even more ways, the basic concepts seem straight forward and need to be understood by practioners as well as methodologists. Reliability theory is not just for the psychometrician estimating latent variables, it is for everyone who wants to make inferences from measures of individuals or of groups. For the case of a single test ad- ministration, we consider multiple measures of reliability, ranging from the worst ( β ) to average ( α, λ 3 ) to best ( λ 4 ) split half reliabilities, and consider why model based estimates ( ωh, ωt ) should be reported. We also address the utility of test-retest and alternate form reliabilities. The advantages of immediate versus delayed retests to decompose observed score variance into specific, state, and trait scores is discussed. But reliability is not just for test scores, it is also important when evaluating the use of ratings. Estimates that may be applied to continuous data include a set of intraclass correla- tions while discrete categorical data needs to take advantage of the family of κ statistics. Examples of these various reliability estimates are given using state and trait measures of anxiety given with different delays and under different conditions. An online supplement is provided with more detail and elaboration. The online supplement is also used to demonstrate applications of open source software to examples of real data, and comparisons are made between the many types of reliability. Public Significance: A tutorial on the estimation of the reliability of test scores considers classical and model based approaches. Examples using open source software applied to several real world data sets are provided. Keywords: Reliability; Generalizability; Classical Test Theory; R packages

Reliability Reliability is a fundamental problem for measurement in all of science for “(a)ll measurement is befuddled by error" (p 294 McNemar, 1946). Perhaps because psychological measures are more befuddled than those of the other natural sciences, psychologists have long studied the problem of reliability (Spearman, 1904b; Kuder & Richardson, 1937; Guttman, 1945; Lord, 1955; Cronbach,

contact: William Revelle [email protected] Preparation of this manuscript was funded in part by grant SMA-1419324 from the National Science Foundation to WR. This is the revised version as submitted for review to Psychological Assessment on June 11, 2019.

1951; Feldt & Brennan, 1989; McDonald, 1999) and it remains an active topic of research (Sijtsma, 2009; Revelle & Zinbarg, 2009; Bentler, 2009; McNeish, 2017; Wood, Harms, Lowman, & DeSimone, 2017). Unfortunately, although recent advances in the theory and measurement of reliability have gone far beyond the earlier contributions, much of this literature is more technical than readable and is aimed for the specialist rather than the practitioner. (This is not a new problem, e.g., Anastasi, 1967; Glass, 1986, bemoan this tendency). We hope to remedy this issue somewhat, for an appreciation of the problems and importance of reliability is critical to the activity of measurement across many disciplines. Reliability theory is not just for the psychometrician estimating latent variables, but also for the baseball manager trying to predict how well a high performing player will perform the next year, for accurately estimating agreement among doctors in patient diagnoses, and in evaluations of the extent to which stock market advisors under-perform the market. Issues of reliability are fundamental to understanding how correlations between observed variables are (attenuated) underestimates of the relationships between the underlying constructs, how observed estimates of a person’s score are biased estimates of their latent score, and how to estimate the confidence intervals around any particular measurement. Understanding the many ways to estimate reliability as well as the ways to use these estimates allows one to better assess individuals and to evaluate selection and prediction techniques. This is not just a problem for measurement specialists but for all who want to make theoretical inferences from observed data. Schmidt & Hunter (1996) discuss 26 ways that not correcting for the effects of reliability and measurement error can hinder progress in many areas of psychological research. However, Borsboom & Mellenbergh (2002) take a contrary position and suggest that using classical measures of reliability for such corrections is an error. The fundamental question in reliability is to what extent do scores measured at one time and place with one instrument predict scores at another time and/or place and perhaps measured with a different instrument? That is, given a person’s score on test 1 at time 1, what score should be expected at a second measurement occasion? The naive belief is that if the tests measure the same construct, then people will do just as well on the second measure as they did on the first. This mistaken belief contributes to several errors including the common view that punishment improves and rewards diminish subsequent performance (Kahneman & Tversky, 1973) and other popular phenomena like the “sophomore slump" and the “Sports Illustrated jinx" (Schall & Smith, 2000). More formally, the expectation for the second measure is just the regression of observations at time 2 on the observations at time 1. If both the time 1 and time 2 measures are equally “befuddled by error" then the observed relationship is the reliability of the measure: the ratio of the latent score variance to the observed score variance.

Reliability as a variance ratio

The basic concept of reliability seems to be very simple: observed scores reflect an unknown mixture of signal and noise. To detect the signal, we need to reduce the noise. Reliability thus defined is a function of the ratio of signal to noise. The signal might be something as esoteric as a gravity wave produced by a collision of two black holes, or as prosaic as estimating the expected batting average of a baseball player based upon the performance of the prior year. The noise in gravity wave detectors include the seismic effects of cows wandering in fields near the detector as well as passing ice cream trucks. The noise in batting averages include the effect of opposing pitchers, variations in wind direction, and the effects of jet lag and sleep deprivation. We can enhance the signal/noise ratio by either increasing the signal or reducing the noise. Unfortunately, this classic statement of reliability ignores the need for unidimensionality of our measures and equates expected scores with construct scores, a relationship that needs to be tested rather than

Construct 1 Construct 2

Latent Measures

Observed Measures

χ η

x ′^ x^ y^ y

ρχη

V alidity

rxx ′ √ rxx ′^ √ ryy ′ √ r yy

rxx ′^ ryyrxy Observed

Figure 1. The basic concept of reliability and correcting for attenuation. Adjusting observed correlations ( rxy ) by reliabilities ( rxx ′ , ryy ′ ) estimates underlying latent correlations ( ρχη ). (See Equation 3). Observed variables and correlations are shown in conventional Roman fonts, latent variables and latent paths in Greek fonts.

Equations 1 - 6 are intellectually interesting, but not very helpful, for they decompose an observed measure into the two unobservable variables of latent score and latent error. To make it even more complicated, all tests are assumed to measure something stable over time (denoted as T for trait like), something that varies over time (reflecting the current state and denoted as S), some specific variance (s) that is stable but does not measure our trait of interest, and some residual, random error (E) (Baltes, 1987; Cattell, 1966b; Hamaker, Schuurman, & Zijlmans, 2017; Kenny & Zautra, 1995). Although ultimately interested in the precision of a score for each individual, reliability is expressed as a ratio of variances between individuals^3 : The reliability of X ( rxx ) is just the percentage of total variability ( VX ) that is not error. Unless we have repeated measures, error is an unknown mixture of variability due to differences in item means, the person x item interaction, and some un-modeled residual. The reliable between person variance is a mixture of Trait, State, and specific variance. (For instance, an item such as “I enjoy a lively party" is an unknown mixture of trait extraversion, state positive affect, and the specific wording of the item – how one interprets lively and party.) If we are interested in the stable trait scores, reliability is the ratio of (unobservable) trait variance ( σ^2 T ) to (observable) total test variance ( Vx ). (We use σ^2 to represent unobservable variances, V to represent observable variance.) That is,

rtt = σ^2 T VX

σ^2 T σ^2 T + σ^2 S + σ s^2 + σ^2 e

(^3) We can also find within subject reliability across time. This will be discussed later and in the online supplement.

However, if we are interested in how well we are measuring a particular fluctuating state (e.g. an emotion) we want to know

rss = σ^2 S VX

σ S^2 σ T^2 + σ^2 S + σ s^2 + σ^2 e

The problem becomes how to find σ^2 T or σ S^2 and how to separate their effects. Although Trait scores are thought to be stable over time, State scores, while fluctuating, show some (unknown) short term temporal stability. Consider a measure of depression. Part of an individual’s depression score will reflect long term trait neuroticism and some of it reflects current negative emotional state. Two measures taken a few hours apart should produce similar trait and state values, although measures taken a year apart should reflect just the trait. In all cases, we are interested in the scores for the individuals being measured. To make the problem even more complicated, it is likely that our Trait or State scores reflect some aggregation of item responses or of the ratings of judges. Thus, we want to assess the variance due to Traits or States that is independent of the effects of items or judges, how much variance is due to the items or judges, and finally how much variance is due to the interactions of items/judges with the Trait/State measures^4. To be consistent with much of the literature, we will treat Trait and State as both latent sources of variance for the observed score X and refer to Trait as a stable across time and State as varying across time. We recognize, of course that Traits do change over longer periods of time but will use this stable/unstable distinction for relatively short temporal durations. Although some prefer to think of specific variance ( σ s^2 ) and error variance ( σ e^2 ) as hopelessly confounded, we prefer to separate them for there are some designs (e.g., test-retest vs. parallel forms) that allow us to distinguish them. Reliability as defined in equations 7 and 8 is not just a function of the test, but also of who is being tested, where they are tested and when they are tested. Because it is a variance ratio, increasing between person variance without increasing the error variance will increase the reliability. Similarly, decreasing between person variance will decrease reliability. Generalizabilty theory (Cronbach, Rajaratnam, & Gleser, 1963; Gleser, Cronbach, & Rajaratnam, 1965) is one way to estimate the individual variance components rather than their ratio. Another approach is Item Response Theory (e.g., Embretson, 1996; Lord & Novick, 1968; Lumsden, 1976; Mellenbergh, 1996; Rasch, 1966; Reise & Waller, 2009) which addresses this problem by attempting to get a measure of precision for a person’s estimate that is independent of the variance of the population and depends upon just the probability of a particular person answering particular items.

Consistency, reliability and the data box

When Cattell (1946) introduced the data box it was a three way organization of measures taken over people, tests, and time. In typical Cattellian fashion, over the years this simple idea grew to as many as 10 dimensions (Cattell, 1966a; Cattell & Tsujioka, 1964). However, the three primary distinctions are still useful today (Nesselroade & Molenaar, 2016; Revelle & Wilt, 2016). Using these dimensions, Cattell (1964) distinguished between three ways that tests can be consistent: across occasions (reliability), across items (homogeneity), and across people (transferability or hardiness). We consider the first two of these concepts and leave the latter to a discussion of validity. These various types of reliability may be summarized graphically in terms of latent traits, paths, observed variables and correlations (Figure 2).

(^4) Unfortunately, some prefer to use State to reflect the measure at a particular time point and to decompose this “State" into Trait and Occasion components (Cole, Martin, & Steiger, 2005).

Table 1 Steps toward reliability analysis: choosing the appropriate R function to find reliability. All func- tions except for the cfa and cor function are in the psych package.

Steps Statistic R function Preliminaries Hypothesis development Data collection Data input read.file Data screening Descriptive statistics μ, σ , range describe Analysis of internal structure Exploratory Factor Analysis R = F φ F ′^ + U^2 fa Hierarchical structure ωh , ωt omega, omegaSem Confirmatory Factor Analysis lavaan ::cfa Estimation of various reliabilities Items (dichotomous, polytomous or continuous) One occasion general factor saturation ωh omega total common variance ωt omega average interitem r rij omega, alpha median interitem r omega, alpha mean test retest (tau equivalent) α, λ 3 omega, alpha smallest split half reliability β splitHalf, iclust greatest split half reliability λ 4 splitHalf, guttman Two occasions test-retest correlation r cor variance components σ^2 p , σ^2 i , σ t^2 testRetest Multiple occasions within subject reliability α multilevel.reliability variance components σ^2 p , σ^2 i , σ t^2 multilevel.reliability Ratings (Ordinal or Interval) Single rater reliability ICC 1 .. 31 ICC Multiple rater reliability ICC 1 .. 3 k ICC Ratings (Categorical) Two raters κ cohen.kappa

An impressive example of a correlation of the same measure over time is the correlation of .66 of ability as measured by the Moray House Exam at age 11 with the same test given to the same participants 69 years later when they were 80 years of age (Deary, Whiteman, Starr, Whalley, & Fox, 2004). This correlation was partially attenuated due to restriction of range for the 80 year old participants. (The less able 11 year olds were less likely to appear in the 80 year old sample.) When correcting for this restriction (Sackett & Yang, 2000), the correlation was found to be .73. But the Scottish Longitudinal Study is unusually long, and is it more common to take test- retests over much shorter periods of time. In most cases it is important that we do not assume that the State effect is 0 (Chmielewski & Watson, 2009). It is more typical to find a pattern of correlations diminishing as a function of the time lag but not asymptotically approaching zero (Cole et al., 2005; Damian, Spengler, Sutu, & Roberts, 2018). This pattern is taken to represent a mixture of stable Trait variance and diminishing State effects such that the test-retest reliability across two time periods as shown in Equation 9 will become smaller the greater the time lag. Unfortunately, with only two time points we can not distinguish between the Trait and State effects. However, with three or more time points ( t 1 , t 2 , t 3 , ..., tn ), we can decompose the resulting correlations ( rx 1 x 2 , rx 1 x 3 , rx 2 x 3 , ...), into Trait and State components using Structural Equation Modeling (SEM) procedures (Hamaker et al., 2017) or simple path tracing rules (Chmielewski & Watson, 2009) and the resolution continues to improve with four or more time points (Cole et al., 2005; Kenny & Zautra, 1995). (See Figure 1 in the online supplement).

A large test-retest correlation over a long period of time indicates temporal stability (Boyle, Stankov, & Cattell, 1995; Cattell, 1964; Chmielewski & Watson, 2009). This should be expected if we are assessing something trait like (such as cognitive ability or perhaps emotional stability or extraversion) but not if we are assessing something thought to represent an emotional state (e.g., alertness or arousal). Because we are talking about correlations, mean levels can increase or decrease over time with no change in the correlation^5. Measures of trait stability are a mixture of immediate test-retest dependability and longer term trait effects (Cattell, 1964; Chmielewski & Watson, 2009). For Boyle et al. (1995) and Cattell (1964), dependability was the immediate test-retest correlation, for Chmielewski & Watson (2009) the time lag of two weeks is considered an index of dependability. To Wood et al. (2017), dependability is assessed by repeating the same items later in one testing session. All of these indicators of dependability and stability are in contradiction to the long held belief that a problem with test-retest reliability is that it introduces memory effects of learning and practice (Kuder & Richardson, 1937). As evidence for the memory effect, Wood et al. (2017) reports that the average response times to the second administration of identical items in the same session is about 80% of the time of the first administration. In the online supplement (Table 1) we compare multiple estimates of reliability for three different example data sets available in the psychTools package (Revelle, 2019b) in the R open source statistical system (R Core Team, 2019). We describe these data sets in some detail as they are useful demonstrations of trait and state variations. We compare immediate retest to short ( minutes) and then longer delay (one to seven days) on 10 mood items and 9-24 item trait scales for one to four weeks. To compare the effects of an immediate retest versus a short delay versus a somewhat longer delay, consider the msqR, sai and tai example data sets^6 the analyses discussed below are demon-

(^5) For example, participants in the Scottish Longitudinal Study performed better in adulthood than they did as 11 year olds but the correlations showed remarkable stability. (^6) The data 4, ( N > 4 , 000 ) were collected as part of a long term series of studies of the interrelationships between

2006; Shrout & Lane, 2012). This is implemented in the multilevel.reliability function and discussed in more detail in a tutorial for analyzing dynamic data (Revelle & Wilt, 2019) as well as in the online supplement. Although these variance components can be found using traditional repeated measures analysis of variance, it is more appropriate to use multi-level techniques, particularly in the case of missing data. Stability needs to be adjusted for dependability and thus the .36 stability over two days of the SAI (online supplement Table 1) should be adjusted for the immediate dependability of .85 to suggest a two day stability of anxious mood of .42 which is notably similar to that of the state-trait correlation of .43. When measuring mood, we need to disentangle the episodic memory components of the state measure from the semantic memory involved when answering trait like questions (Cattell, 1964; Chmielewski & Watson, 2009). State measures of affectivity probably involve episodic memory whereas trait measures of similar constructs (e.g., trait anxiety or neuroticism) likely tap semantic memory (Klein, Cosmides, Tooby, & Chance, 2002). With only two measures of state anxiety and one of trait anxiety, we can not disentangle how much of the trait measure is state (Equation 9) but if we had more measures over longer periods of time we would be able to do so.

Alternate Forms

If we do not want to wait for a long time and we do not want to exactly repeat the same items, we can estimate reliability by creating another test (an alternate form) that has conceptually similar but semantically different items. If measuring the same construct (e.g. arithmetic performance) we can subtly duplicate items on each form and even match for possible difficulty of order effects (a1: what is 6+3?, a2: what is 4 + 5? versus b1: what is 3+6? and b2: what is 5 + 4 ?). Cattell (1964) discusses “Herringbone" consistency, which are essentially parallel forms: Each half of the test is made up of half of the items of multiple constructs, and each is duplicated in the other half (math, english, social studies). Although creating alternate forms by hand is tedious, it has become possible for ability items to generate alternate forms using computer Automatic Item Generation techniques (Embretson, 1999; Leon & Revelle, 1985; Loe & Rust, 2017). Alternate forms given at the same time eliminate the effect of the specific item variance but do not remove any motivational state effect: Sometimes alternate forms can be developed when a longer test is split into multiple shorter tests. As an example of this, consider the sai data set discussed in the online supplement which includes 20 items, 10 of which overlapped with the msqR data set and were used for our examples of test-retest and repeated measure reliability. The other 10 can be thought of as an alternate form of the anxiety measure and indeed correlate. with the target items from the sai and msqR. These correlations are less than when we actually repeat the same items by correlating the overlapping items of the sai and msqR (.85); are almost identical when we consider their short term dependability (.76); but less than estimates of internal consistency such as α or the average split half reliability (.83 - .87).

Split half (adjusted for test length)

If we have gone to the trouble of developing two alternate forms for a concept, and then administered both forms to a sample of participants, it is logical to ask what is the reliability of the composite formed from both of these tests. That is, if we have the correlation between two five item tests, what would be the reliability of the composite 10 item test? With a bit of algebra, we can predict it using a formula developed by Spearman (1910) and Brown (1910):

rxx = 2 ∗ rx 1 x 2 1 + rx 1 x 2

It is important to note the correlation between the two parts ( rx 1 x 2 ) is not the split half reliability, but is used to find the split half reliability ( rxx ) found by the “Spearman-Brown prophecy formula" (Equation 12) Given that we have written n items and formed them into two splits of length n/ 2 , what if we formed a different split? How should we split the items into two groups? Odd/even, first half/last half, randomly? This is a combinatorially difficult problem, in that there are (^) 2( n/ 2)!( n! n/ 2)! unique ways to split a test into two equal parts. While there are only 126 possible splits for the 10 anxiety items discussed above, this becomes 6,435 for a 16 item ability test, 1,352,078 for the 24 item EPI Extraversion scale (Eysenck & Eysenck, 1964) and over 4.5 billion for a 36 item test. The splitHalf function will try all possible splits for tests of 16 items or less, and then sample 10, splits for tests longer than that. The distribution of all possible splits for the 10 state anxiety items discussed earlier show that greatest split-half reliability is .92, the average is .87, and the lowest is .66 (Figure 3 panel A). This is in contrast to all the possible splits of 16 ability items taken from the International Cognitive Ability Resource (ICAR, Condon & Revelle, 2014) collected as part of the SAPA project (Revelle, Wilt, & Rosenthal, 2010), (Revelle et al., 2016) where the greatest split half reliability was .87, the average is .83, and the lowest is .73 (Figure 3 panel B). The 24 items of the EPI show strong evidence for non-homogeneity, with a maximum split half reliability of .81, an average of .73, and a minimum of .42 (Figure 3 part C). This supports the criticism that the EPI E scale tends to measure the two barely related constructs of sociability and impulsivity (Rocklin & Revelle, 1981). The EPI-N scale, on the other hand, shows a maximum split half of .85, a mean of .80, and a minimum of .65, providing strong evidence for a relatively homogeneous scale (Figure 3 part D) Examining these various splits is one way to understand the homogeneity of the test. For if the various spits differ a great deal (e.g., the EPI E scale) this can be taken as a warning that the test is not unidimensional.

Internal consistency and domain sampling

All of the above procedures are finding the correlation between two forms or occasions of a test. But what if there is just one form and one occasion? The approaches that consider just one test are collectively known as internal consistency procedures but also borrow from the concepts of domain sampling and can use the variance decomposition techniques discussed earlier. Some of these techniques, e.g., Cronbach (1951); Guttman (1945); Kuder & Richardson (1937) were developed before advances in computational speed made it trivial to find the factor structure of tests, and were based upon test and item variances. These procedures ( α , λ 3 , KR20) were essentially short cuts for estimating reliability. The variance decomposition procedures continued this approach but expanded to be known as generalizability theory (Cronbach et al., 1963; Gleser et al., 1965; Vispoel, Morris, & Kilinc, 2018) and allow for the many reliability estimates discussed before. There are a number of different approaches for estimating reliability when there is just one test and one time. The earliest was to split the test into two random split halves and then adjust the resulting correlation between these two splits using the Spearman-Brown prophecy formula (Brown, 1910; Spearman, 1910). Unfortunately, as we showed in Figure 3 not all random splits produce equal estimates. If we consider all of the items in the test to be randomly sampled from some larger domain (e.g., trait- descriptive adjectives sampled from all words in the Oxford Unabridged Dictionary or sociability items sampled from a potentially infinite number of ways of being sociable) then we can think of the test as a sample of that domain. Because the item covariances should reflect just shared domain variance, but item variance will be an unknown mixture of domain and specific and error variance, the amount of domain variance in a test would vary as the square of the number of items in the

(Teo & Fan, 2013). In addition to λ 3 , Guttman (1945) considered five alternative ways of estimating reliability by correcting for the error variance of each item. All of these equations recognize that some of the item is reliable variance, the problem is how much? λ 3 and α assume that the average item covariance is a good estimate, λ 6 uses the Squared Multiple Correlation (smc) for each item as an estimate of its reliable variance while λ 2 uses a function of the average squared item covariance. λ 4 is just the maximum split half reliability. One advantage of using the mean item covariance is that it can be identified from an analysis of variance perspective rather than actually finding all the inter-covariances. That is, just decompose the total test variance into three components: the between person variance σ P^2 , the between item

variance, σ^2 I , and the interaction of person x item, σ^2 e. Then reliability is just 1 − σ

(^2) e σ^2 P^ (Feldt, Woodruff, & Salih, 1987; Hoyt, 1941). By expressing it in this manner, Feldt et al. (1987) were able to derive an F distribution for α , and thus a means for finding confidence intervals. This is implemented as the alpha.ci function in the psych package. Alternative procedures for the confidence interval for α have been developed by Duhachek & Iacobucci (2004). Perhaps the biggest advantage to the variance approach to KR20, α , or λ 3 was that in the 1930s-1950s calculations were done with desk calculators rather than computers and it was far simpler to find the n item variances and one total test variance than it was to find the n(n-1)/2 item covariances. In the modern era, such short cuts are no longer necessary. Two problems with α.* Although easy to calculate from just the item statistics and the total score, α and λ 3 are routinely criticized as poor estimates of reliability because they do not reflect the structure of the test (Bentler, 2009; Cronbach & Shavelson, 2004; S. Green & Yang, 2009; Revelle & Zinbarg, 2009; Sijtsma, 2009). Perhaps because the ability to find α is available in easy to use software packages, it is routinely used. This is unfortunate; except for very rare condi- tions, α is both an underestimate of the reliability of a test (because of the lack of τ equivalancy, Bentler, 2009),(Bentler, 2017; Sijtsma, 2009) and an overestimate of the fraction of test variance that is associated with the general variance in the test (Revelle, 1979; Revelle & Zinbarg, 2009; Zinbarg, Revelle, Yovel, & Li, 2005). As we show in the online supplement (Table 2), α provides no information about the constancy or stability of the test. For our mood items, α (.83 - .87) exceeded the short term constancy estimates (.42 - .76) and greatly exceeded the two day stability coefficients (.36 - .39). For the trait measures (particularly of impulsivity), the low α (.51) did not reflect the relatively high (.70) two-four week stability of the measures. That is to say, knowing α told us nothing about test-retest constancy or stability. If not an estimate of reliability, does α measure internal consistency? No. For it is just a function of the number of items and the average correlation between the items. It is not a function of the uni-dimensionality of the test. It is easy to construct example tests with equal α values that reflect one test with homogenous items, two slightly related subtests or even two unrelated subtests each with homogeneous items (see, e.g., Revelle, 1979; Revelle & Wilt, 2013).

Model based estimates

That “internal consistency" estimates do not reflect the internal structure of the test becomes apparent when we apply “model based" techniques to examine the factor structure of the test. These procedures actually examine the correlations or covariances of the items in the test. Thanks to improvements in computational power, the task of finding correlations and the factor structure of a 10 item test has been transformed over the past two generations from being a summer research project for an advanced graduate student to an afternoon homework assignment for undergraduates. Using the latent variable modeling approach of factor analysis, these procedures decompose the test

variance into that which is common to all items ( g , a general factor), that which is specific to some items (orthogonal group factors, f ) and that which is unique to each item (typically confounding specific, s , and error variance, e ). Many researchers have discussed this approach in great detail (e.g., Bentler, 2017; McDonald, 1999; Revelle & Zinbarg, 2009; Zinbarg et al., 2005) and we just summarize the main points here. Most importantly for applied researchers, as we show in the online supplement, model based techniques are just as easy to implement in modern software as are the more conventional approaches. The observed score on a test item may be modeled in terms of the sum of the products of factor scores ( g , f , s , e ) and loadings ( c , A , D ) on these factors:

x = cg + Af + Ds + e (14)

Because the reliable variance of the test is that which is not error, the reliability of a test with standardized items should be

ωt = 1cc1 + 1AA1 Vx

Σ(1 − h^2 i ) Vx

Σ u^2 i Vx

where h^2 i is the item communality and u^2 i is the item uniqueness. The percentage of the total variance that is due to the general factor ( ωg , McDonald, 1999) is

ωg = 1cc1 VX

1cc1 1cc1 + 1AA1 + 1DD1 + 1ee1

ci )^2 Vx

where the total test variance ( Vx ) is the sum of the elements of all the item variances and covariances and (Σ ci )^2 is the squared sum of the loadings on the general factor. Normally, the specific item variance is confounded with the residual item (error) variance, but if we have a way of estimating the specific variance by examining the correlations with items not in the test, (e.g., repeated items, Wood et al., 2017) then we can include it as part of the reliable variance (Bentler, 2017):

ωt = 1cc1 + 1AA1 + 1DD1 VX

1cc1 + 1AA1 + 1DD1 1cc1 + 1AA1 + 1DD1 + 1ee1

Unfortunately, in his development of ω , McDonald (1999) refers to two formulae (6.20a and 6.20b) one for ωt and one for ωg and calls them both ω (Zinbarg et al., 2005). These two coefficients are very different, for one is an estimate of the total reliability of the test ( ωt ), the second is an estimate of the amount of variance in the test due to single, general factor ( ωg ). Then to make it even more complicated, there are two ways to find the general factor. One method uses a bifactor solution (Holzinger & Swineford, 1937; Reise, 2012; Rodriguez, Reise, & Haviland, 2016) using structural equation modeling software (e.g., lavaan , Rosseel, 2012), the other extracts a higher order factor from the correlation matrix of lower level factors and then applies a transformation developed by Schmid & Leiman (1957) to find the general loadings on the original items. The bi- factor solution ( ωg ) tends to produce slightly larger estimates than the Schmid-Leiman procedure ( ωh ) because it forces all the cross loadings of the lower level factors to be 0. Following Zinbarg et al. (2005) we designate the Schmid-Leiman solution as ωh recognizing the hierarchical nature of the solution. Both approaches are implemented in the psych package. An important question when examining a hierarchical structure is how many group factors to specify when calculating ωh? The Schmid-Leiman procedure is defined if there are three or more group factors, and with only two group factors the default is to assume that they are both equally

Table 2 Calculating multiple measures of internal consistency reliability demonstrated on 10 items from the Motivational State Questionnaire (msqR data set, N = 3032.) The ten items may be thought of as measures of state anxiety. Five are positively scored, five negatively. General factor loadings (g) and group factor loadings were found from the omegaSem function which applies a bi-factor solution. The hierarchical solution from omega applies the Schmid-Leiman transformation and has slightly lower general factor loadings. Split half calculations were done by finding all possible splits of the test. Although the statistics shown are done by hand, they are all done automatically in various psych functions (see Table 1).

10 anxiety items from the msqR data set Variable anxis jttry nervs tense upset at.s- calm- cnfd- cntn- rlxd- g F1* F2* h anxious 1.00 0.41 0.60 0.00 0. jittery 0.47 1.00 0.47 0.41 0.00 0. nervous 0.54 0.48 1.00 0.43 0.60 0.00 0. tense 0.57 0.48 0.57 1.00 0.54 0.58 0.00 0. upset 0.29 0.16 0.35 0.45 1.00 0.34 0.32 0.00 0. at.ease- 0.23 0.24 0.29 0.35 0.30 1.00 0.66 0.00 0.47 0. calm- 0.28 0.32 0.31 0.36 0.23 0.62 1.00 0.69 0.00 0.31 0. confident- -0.01 -0.02 0.08 0.07 0.17 0.44 0.31 1.00 0.11 0.00 0.77 0. content- 0.07 0.04 0.13 0.19 0.29 0.56 0.45 0.61 1.00 0.31 0.00 0.74 0. relaxed- 0.27 0.34 0.30 0.40 0.29 0.60 0.56 0.34 0.45 1.00 0.68 0.00 0.33 0. SMC 0.42 0.37 0.44 0.52 0.27 0.55 0.47 0.40 0.51 0. rii 0.73 0.69 0.70 0.77 0.76 0.63 0.65 0.70 0.73 0.

Formula Calculation Reliability measure Total variance = VX = Σ( Rij ) = 39_._ 60 Total reliable item variance = Σ rii = 6_._ 97 glb = Vxtr ( R Vx )+Σ( rii )^39_.^6039 −10+6._ 60_.^97 =._ 923 r best split (A= 1, 2, 5, 6, 8 vs B = 3, 4, 7, 9, 10) = .81 λ 4 = best split half = (^) 1+^2 rrabab 1+^2 ∗ ..^8181 =. 895 Total common variance = Σ h^2 i = 5_._ 37 ωt = Vxtr ( R )+Σ h

(^2) i Vx 39_._ 60 −10+5_._ 37 39_._ 60 =^.^883 Total squared multiple correlation Σ( SM C ) = 4_._ 43 λ 6 = Vxtr ( R )+Σ( Vx SM C )^39_.^6039 −10+4._ 60_.^43 =._ 859 Average squared correlation= ¯ r^2 = Σ R

(^2) ijtr ( R (^2) ) n ∗( n −1) = .137^ λ^2 =^ Vxtr ( R )+√¯ r^2 ∗ n/ ( n −1) Vx

39_._ 6 −10+.. 137 ∗ 10 / 9 39_._ 60 =^.^841 α = (^) nn − 1^ VxVtrx ( R )^1093939_.^60._ 60 −^10 =. 831 Average correlation = r ¯ = VX n^ ∗−( trn −( V 1) X^ )= 0_._ 329 α = (^) 1+( nn ¯ r −1)¯ r 1+9^10 ∗∗.^329_._ 329 =. 831 r worst split (A = 1-5 vs. B= 6-10) = .385 β = worst split half = (^) 1+^2 rabrab 1+^2 ∗ ..^385385 =. 556 Sum of g loadings = 4.65 (bi-factor) ωg = (Σ gi ) 2 VX 4_._ 652 39_._ 60 =^.^545 Sum of g loadings = 4.09 (Schmid-Leiman) ωh = (Σ gi ) 2 VX^4_._^09

2 39_._ 60 =^.^422

becomes less as a fraction of the total test variance. Thus, the limit of the glb, λ 4 , ωt, λ 6 , λ 2 , α as n increases to infinity is 1. ωh does not have this problem as it will increase towards the limit of ωg ∞ =^1 ′ cc1 VX =^

1cc1 1cc1 + 1AA1 + 1DD1.^ When comparing reliabilities between tests of different lengths, it is useful to include the reliability of each test as if they were just one item each. In the case of α , α 1 = ¯ rij , Other single item reliability measures are the average item test retest ( glb 1 = ¯ rii ), the average communality ( ωt 1 = ¯ h^2 i ), the average SMC ( λ 61 = SM Ci ), or the square

root of the average squared correlation ( λ 21 =

r ¯^2 ij ).

Generalizability Theory

Most discussions of reliability consider reliability as the correlation of a test with a test just like it. Test-retest and alternate form reliabilities are the most obvious examples. Internal consistency measures are functionally estimating the correlation of a test with an imaginary test just like it. These estimates are based upon the patterns of correlations of the items within the test. An alternative approach makes use of Analysis of Variance procedures to decompose the total test variance into that due to individuals, to items, to time, relevant interactions, and to residual (Cronbach et al., 1963; Gleser et al., 1965; Shavelson, Webb, & Rowley, 1989; Vispoel et al., 2018). We have already discussed this in the context of test-retest reliability. This technique is most frequently applied to the question of the reliability of judges who are making ratings of targets, but the logic can be applied equally easily to item analysis.

Reliability of raters

Consider the case where we are rating numerous subjects with only a few judges. We might do a small study first to determine how much our judges agree with each other, and depending upon this result, decide upon how many judges to use going forward. As an example, examine the data from 5 judges (raters) who are rating the anxiety of 10 subjects (Table 5 in the online supplement). If raters are expensive, we might want to use the ratings of just one judge rather than all five. In this case, we will want to know how ratings of any single judge will agree with those from the other judges. In this case, differences in leniency (the judges’ means) between judges will make a difference in their judgements. In addition, different judges might use the scale differently, with some having more variance than others. We also need to think about how we will use the judges. Will we use their ratings as given, will we use their ratings as deviations from their mean, or will we pool the judges? All of these choices lead to different estimates of generalizability. Shrout & Fleiss (1979) provide a very clear exposition of three different cases and the resulting equations for reliability. (See Equations 17-19 in the supplement.) Although they express their treatment in terms of Mean Squares derived from an analysis of variance (e.g., the aov function in R), it is equally easy to do this with variance components estimated using a mixed effects linear model (e.g., lmer from the lme4 package (Bates et al., 2015) in R). Both of these procedures are implemented in the ICC function in the psych package. This is discussed in more detail in the online supplement. The intraclass correlation is appropriate when ratings are numerical, but sometimes ratings are categorical (particularly in clinical diagnosis or in evaluating themes in stories). This then leads to measures of agreement of nominal ratings. Rediscovered multiple times and given different names (Conger, 1980; Scott, 1955; Hubert, 1977; Zapf, Castell, Morawietz, & Karch, 2016) perhaps the most standard coefficient is known as Cohen’s Kappa (Cohen, 1960, 1968) which adjusts observed proportions of agreement by the expected proportion:

κ =

pope 1 − pe

fofe Nfe

Composite Scores

The typical use of reliability coefficients is to estimate the reliability of relatively homogeneous tests. Indeed, the distinctions made between ωh, α, and ωt are minimized if the test is completely homogeneous. But if the test is intentionally made up of unrelated or partly unrelated content, then we need to consider the reliability of a composite score. A composite is sometimes referred to as a stratified test, where the strata may be difficulty or content based (Cronbach, Schönemann, & McKie, 1965). The stratified reliability ( ρxxs ) of a composite test is found by replacing the variance of each subtest in the total test with its reliable variance and then dividing the resulting sum by the total test variance:

ρxxs = Vt − Σ vi + Σ pxxi vi Vt

where ρxxi is reliability of the subtest and vi is the variance of the subtest (Rae, 2007). Conceptually, this approach is very similar to ωt (McDonald, 1999). A procedure for weighting the elements of the composite to maximize the reliability of com- posite scores is discussed by Cliff & Caruso (1998) who suggest this as a procedure for Reliable Components Analysis (RCA) which they see as an alternative to a EFA or PCA.

Reliability of a difference score

Logically similar to the reliability of a composite is the reliability of a difference score (equa- tion 20). Sometimes researchers want to find the difference between two scores (e.g., verbal and spatial ability or anxiety and depression). Even though the two tests themselves are highly reliable ( ρxx, ρyy ) , if they also have a high correlation, ( rxy ) the reliability of the difference will be be substantially lower. Indeed, if the correlation between the two scales matches their reliability, the reliability of the difference will be 0. Given this reduction in reliability, individual differences in change or in pattern should be interpreted cautiously. We give an example of this problem when comparing the difference of two cognitive tests (i.e., verbal vs. spatial reasoning) in the online supplement.

ρxy =

ρxx + ρyy − 2 rxy 2 ∗ (1 − rxy )

Beyond Classical Test Theory

Reliability is a joint property of the test and the people being measured by the test (refer back to Equation 2). For fixed amount of error, reliability is a function of the variance of the people being assessed. Scores from a test of ability will be reliable if given to a random sample of 18- year olds, but much less reliable if given to students at a particularly selective college because there will be less between person variance. The reliability of scores of emotional stability will be higher if given to a mixture of psychiatric patients and their spouses than it will be if given just to the patients. That is, reliability is not a property of test scores independent of the people taking it. This is the basic concept of Item Response Theory (IRT), called by some the “new psychometrics" (Embretson, 1996, 1999; Embretson & Reise, 2000) and which models the individual’s patterns of response as a function of parameters (discrimination, difficulty) of the item. By focusing on item difficulty (endorsement frequency) it is possible to consider the range of application of our scores. Items are most informative if they are equally likely to be passed or failed (endorsed or not endorsed). But this can only be the case for a particular person taking the test and can not be the case for a person with a higher or lower latent score. Although test scores are maximally reliable if all of the items are equally difficult, such scores will not be very discriminating

at any other than at that level (Loevinger, 1954). Thus, we need to focus on spreading out the items across the range to be measured. The essential assumptions of IRT is that items can differ in how hard they are, as well as how well they measure the latent trait. Although seemingly quite different from classical approaches, there is a one-to-one mapping between the difficulty and discrimination parameters of IRT and the factor loadings and item response thresholds found by factor analysis of the polychoric correlations of the items (Kamata & Bauer, 2008; Markon, 2013; McDonald, 1999). The relationship of the IRT approach to classical reliability theory is given a very clear explication by Markon (2013) who examines how test information (and thus the reliability) varies by subject variance as well as trait level. A test can be developed to be reliable for certain discriminations (e.g. between psychiatric patients) and less reliable for discriminating between members of a control group. The particular strength of IRT approaches is the use in tailored or adaptive testing where the focus is on the reliability for a particular person at a particular level of the latent trait. (See the discussion of IRT in the supplement, particularly Figure 3 which shows how reliability differs as a function of latent score.)

The several uses of reliability Reliability is measured for at least three different purposes: correcting for attenuation, es- timating expected scores, and providing confidence intervals around these estimates. When com- paring test reliabilities, it is useful to remember that reliability has non-linear relations with the standard error as well as with the signal/noise ratio (Cronbach et al., 1965). That is, seemingly small differences in reliability between tests can reflect large differences in the ratio of reliable signal to unreliable noise or the size of the standard error of measurement. Consider the signal to noise ratio of tests with reliability of .7, .8., .9, and .95.

Signal N oise

ρxx 1 − ρxx

Thus an improvement in reliability from .7 (..^73 = 2_._ 33 ) to .8 (..^82 = 4) is a much smaller change in signal to noise than that from .8 to .9 (..^91 = 9) which in turn is much less than from .9 to. (..^9505 = 19).

Corrections for attenuation

Reliability theory was originally developed to adjust observed correlations between related constructs for the error of the measurement in each construct (Spearman, 1904b). Such correc- tions for attenuation were perhaps the primary purpose behind reliability and are the reason that some recommend routinely correcting for reliability when doing meta analyses (Schmidt & Hunter, 1999). However such a correction is appropriate only if the measure is seen as the expected value of a single underlying construct. Examples of when the expected score of a test is not the same as the theoretical construct that accounts for the correlations between the observed variables include chicken sexing (Lord & Novick, 1968) or the diagnosis of Alzheimers (Borsboom & Mellenbergh, 2002). Modern software for Structural Equation Modeling (e.g., Rosseel, 2012) models the pat- tern of observed correlations in terms of a measurement (reliability) model as well as a structural (validity) model.

Reversion to mediocrity

Given a particular observed score, what do we expect that score to be if the measure is given again? That high scores decrease and low scores increase is just a function of the reliability of the