02a: Test-Retest and Parallel Forms Reliability, Slides of Mechanics

02a: Test-Retest and Parallel Forms Reliability. Quantitative Variables. 1. Classic Test Theory (CTT). 2. Correlation for Test-retest (or Parallel Forms): ...

Typology: Slides

2022/2023

Uploaded on 03/01/2023

myboy
myboy 🇺🇸

4.4

(76)

260 documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
02a: Test-Retest and Parallel Forms Reliability
Quantitative Variables
1. Classic Test Theory (CTT)
2. Correlation for Test-retest (or Parallel Forms): Stability and Equivalence for Quantitative Measures
3. Consistency vs Agreement
4. Intraclass Correlation Coefficient, ICC
5. ICC with SPSS
6. ICC Real Data
7. Comparison of Results with Menon
8. ICC with More than Two Assessment Periods or Forms
9. Single Item Test-retest
10. Published Examples of Test-retest (to be updated)
Qualitative Variables
11. Percent Agreement (see presentation “07a Coder Agreement for Nominal Data”)
12. Nominal Variables: ICC, Kappa, Scott’s Pi, Krippendorff’s alpha (see “07a Coder Agree. for Nominal Data”)
13. Ordinal Variables: Weighted Kappa (see “07b Coder Agreement for Ranked Data”)
Quantitative Variables
1. Classic Test Theory (CTT)
CTT tells us that when we attempt to measure something, like test anxiety, we understand that the score we
observe, the observed score X, is made of two parts, a true score (T) and error (E):
X = T + E
We would like to know how much error, E, is included when we use observed scores, X, because the more error, the
worse our measurement and the less confidence we have that X measures what we hope it measures.
Since there will almost always be variability in scores, we can say that the variance for scores will be greater than 0.00. If
we use the symbol X for test anxiety scores, we can indicate the variance like this:
VAR(X)
We can also expect variance in both true scores, T, and error in measurement, E, so we can symbolize these variances
too:
VAR(T) and VAR(E)
Reliability is defined as the ratio of true score variance to observed score variance:
Reliability, rxx = VAR(T)
VAR(X)
Since X = T + E, we can show that reliability is the ratio of true score variance to true score variance plus error variance:
Reliability, rxx = VAR(T)
VAR(T)+ VAR(E)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download 02a: Test-Retest and Parallel Forms Reliability and more Slides Mechanics in PDF only on Docsity!

02 a: Test-Retest and Parallel Forms Reliability Quantitative Variables

**1. Classic Test Theory (CTT)

  1. Correlation for Test-retest (or Parallel Forms): Stability and Equivalence for Quantitative Measures
  2. Consistency vs Agreement
  3. Intraclass Correlation Coefficient, ICC
  4. ICC with SPSS
  5. ICC Real Data
  6. Comparison of Results with Menon
  7. ICC with More than Two Assessment Periods or Forms
  8. Single Item Test-retest
  9. Published Examples of Test-retest (to be updated) Qualitative Variables
  10. Percent Agreement (see presentation “07a Coder Agreement for Nominal Data”)
  11. Nominal Variables: ICC, Kappa, Scott’s Pi, Krippendorff’s alpha (see “07a Coder Agree. for Nominal Data”)
  12. Ordinal Variables: Weighted Kappa (see “07b Coder Agreement for Ranked Data”) Quantitative Variables
  13. Classic Test Theory (CTT)** CTT tells us that when we attempt to measure something, like test anxiety, we understand that the score we observe, the observed score X, is made of two parts, a true score (T) and error (E): X = T + E We would like to know how much error, E, is included when we use observed scores, X, because the more error, the worse our measurement and the less confidence we have that X measures what we hope it measures. Since there will almost always be variability in scores, we can say that the variance for scores will be greater than 0.00. If we use the symbol X for test anxiety scores, we can indicate the variance like this: VAR(X) We can also expect variance in both true scores, T, and error in measurement, E, so we can symbolize these variances too: VAR(T) and VAR(E) Reliability is defined as the ratio of true score variance to observed score variance: Reliability, rxx = VAR(T) VAR(X) Since X = T + E, we can show that reliability is the ratio of true score variance to true score variance plus error variance: Reliability, rxx = VAR(T) VAR(T)+ VAR(E)

Reliability is the

  • proportion of true score variance to observed score variance;
  • should not be less than 0.00;
  • should not be greater than 1.00;
  • r or rxx or rxx is the sample symbol for reliability,
  • ρ or ρxx or ρxx is the population symbol for reliability, and
  • unfortunately, both r and ρ are also symbols for Pearson correlation, so easy to confuse the two. If there were no error in measurement, then VAR(E) would be zero, VAR(E) = 0.00, and reliability would be equal to 1.00: = VAR(T) VAR(T)+ VAR(E) = VAR(T) VAR(T)+ 0 = VAR(T) VAR(T) = 1. A reliability of 1.00 means no measurement error and therefore we have true scores. Assumptions of CTT:
  • Expected value of E = 0.00 (i.e., mean of errors will be 0.00)
  • Covariance T and E = 0.00; Cov(T,E) = 0.00 (i.e., correlation of T with E = 0.00)
  • Covariance Ej and Ek = 0.00, Cov(Ej,Ek) = 0.00 (i.e., correlation of Ej with Ek = 0.00) In words, CTT indicates that measurement error, E, is random and therefore correlates with nothing; if E does show a correlation with something, it will likely be a weak correlation that is random (i.e., varies across samples and due to sampling variation). Technical note: VAR(X) = VAR(T) + VAR(E) + 2Cov(T,E) Since E does not correlate with anything, VAR(X) = VAR(T) + VAR(E) + 2Cov(T,E) 2. Correlation for Test-retest or Parallel Forms: Stability and Equivalence for Quantitative Measures As a reminder, recall that test-retest reliability refers to situations in which an instrument is administered to participants, time elapses, then the instrument is re-administered to the same participants. Scores from both time periods are assessed to determine stability of scores. For parallel forms reliability , one administers two forms of an instrument, both designed to measure the same thing and provide the same scores for a given individual, to participants and then assess equivalence of scores. Both test-retest and parallel forms reliability follow the same mechanics and use the same reliability estimates, so the logic and estimation methods presented below apply equally to both test-retest and parallel forms. According to CTT, the Pearson product moment correlation, r, is a measure of reliability between two parallel measures, or test-retest measures that provide quantitative scores: Pearson, r = Reliability, rxx = VAR(T) VAR(X)

Example 2 : True Scores with Error Added Student True Score Error Time 1 Error Time 2 Observed Time 1 (True + Error 1) Observed Time 2 (True + Error 2) 1 95 3 - 3 98 92 2 90 - 3 - 3 87 87 3 85 3 3 88 88 4 80 - 3 3 77 83 5 75 - 3 3 72 78 6 70 3 3 73 73 7 65 - 3 - 3 62 62 8 60 3 - 3 63 57 Note: Cov(e1,e2) = 0.00, Cov(e1,T) = 0.00, Cov(e2, T) = 0.00; errors uncorrelated with each other and true scores. How well does Pearson r work if “random” measurement error is introduced to true scores? Variances for true scores and observed scores in Example 2 are reported below. VAR(T) = 140.00 (Variance of 16 true scores to mimic test and retest situation) VAR(X) = 149.60 (Variance of both Time 1 and Time 2 observed scores combined) The CTT reliability is rxx = VAR(T) VAR(X) =^

  1. 00
  2. 60 = 0. which means that 93.5% of variance in observed scores is due to true score variance, or 100(1 - .935) = 6. 5 % is error variance. Pearson correlation for these data is r = 0. (Note: Demonstrate for class that Pearson r obtained in SPSS or Excel is .935). Results show that Pearson r works well to measure reliability when only random measurement error is included, and the means for both sets of scores are the same or similar. In Example 2 above, the means for Observed Time 1 = 77.50 and for Observed Time 2 = 77.50. However, Pearson r can fail when non-random error is included that changes means between the two sets of scores.

3. Consistency vs. Agreement Consistency refers to the relative position of scores across two sets of scores. Consistency is an assessment of whether two sets of scores tend to rank order something in similar positions. Agreement refers to the degree to which two sets of scores agree or show little difference in actual scores; the lower the absolute difference, the greater the agreement between scores. Pearson r is designed to provide a measure of consistency. Loosely described, this means Pearson r helps assess whether relative rank appears to be replicated from one set of scores to another.

Pearson r does not assess magnitude of absolute differences and can therefore present a misleading assessment of reliability when test-retest scores or parallel scores show large differences. As Example 3 demonstrates, Pearson r shows a value of .91 for the Relative Reliability scores, but note that the actual scores are very different (Mean for Test 1 = 77.50, mean for Test 2 = 16.62). Example 3: Relative vs. Absolute Reliability Relative Reliability, Consistency Absolute Reliability, Agreement Student Test 1 Rank 1 Test 2 Rank 2 Test 1 Test 2 Difference 1 95 1 44 1 95 92 3 2 90 2 22 2 90 91 - 1 3 85 3 20 3 85 83 2 4 80 4 19 4 80 79 1 5 75 5 10 5 75 78 - 3 6 70 6 9 6 70 72 - 2 7 65 7 8 7 65 64 1 8 60 8 1 8 60 61 - 1 Test 1 and 2 Pearson r = .91 Test 1 and 2 Pearson r =. Example 4 helps to solidify the problem with using Pearson r to assess test-retest and parallel forms reliability. In Example 4, note that time 2 scores have error, but also has a growth component of 20 points from time 1. The two sets of observed scores, Time 1 and Time 2, are no longer equivalent, so scores are no longer stable over time. Example 4 : True Scores with Error and Systematic Difference Added Student True Score Error Time 1 Error Time 2 Time 2 Change Observed Time 1 (True + Error 1) Observed Time 2 (True + Error 2 + Change) 1 95 3 - (^3 20 98 ) 2 90 - 3 - 3 20 87 107 3 85 3 3 20 88 108 4 80 - 3 3 20 77 103 5 75 - 3 3 20 72 98 6 70 3 3 20 73 93 7 65 - 3 - 3 20 62 82 8 60 3 - 3 20 63 77 Variances for true scores and observed scores: VAR(T) = 140.00 (Variance of 16 true scores to mimic test and retest situation) VAR(X) = 256.26 (Variance of both Time 1 and Time 2 observed scores combined) The CTT reliability is rxx = VAR(T) VAR(X)

  1. 00
  2. 26

which means that 54.6% of variance in observed scores is due to true score variance. The Pearson correlation, however, between Observed scores at Time 1 and 2, is

Figure 1: Data Entry in SPSS for obtain ANOVA Estimates to Calculate ICC Figure 2: SPSS Commands for ANOVA Results Figure 3: ANOVA Commands in Univariate Figure 4: SPSS Univariate ANOVA Output The three variances needed are: VAR(R) = (Mean Square Between – Mean Square Error) / k = (310.286-10.286) / 2 = 150. VAR(T) = (Mean Square Time – Mean Square Error) / n = (1600-10.286) / 8 = 198. VAR(E) = Mean Square Error = 10. To calculate ICC for absolute agreement for a single rater or single form, the type we typically see with test-retest and parallel form studies, the formula follows: ICCAgreement = VAR(R) VAR(R)+VAR(T)+VAR(E) =^ 150 150 + 198. 714 + 10. 286 =^ 150 359 =. Recall that the CTT reliability for these data is rxx = VAR(T) VAR(X) =^

  1. 00
  2. 26 = 0.

so the ICC of .417 provides a more realistic estimate of reliability than does the Pearson correlation of .935 provided above. The ICC can also be calculated to estimate consistency of scores, just like Pearson r. The ICC formula for consistency, rather than agreement, omits variance due to time or differences across the two sets of scores. In this example, the ICC for consistency matches the Pearson r for these data: ICCConsistency = VAR(R) VAR(R)+ VAR(E) =^ 150 150 + 10. 286 =^ 150

  1. 286 =.^935 As stated, the ICC consistency formula above omits the variance for time; this shows that variance resulting from differences in scores across time is ignored, hence the agreement between the ICC for consistency and Pearson r.

5. ICC with SPSS Below are screenshots showing how to obtain the ICC in SPSS for test-retest and parallel forms reliability. Note that data entry for SPSS requires two columns, one for the first set of scores and one for the second set. It is critical that scores be matched to respondents otherwise obtained estimates will be incorrect; this is also true for Pearson r. Commands are AnalyzeScaleReliability Analysis Figure 5: SPSS Data Entry for Test-retest Figure 6: SPSS Command Selection With the new screen, identify the two variables that represent the test-retest scores, in this case Time 1 and Time 2 as shown in Figure 7. Move both to the Items box. Figure 7: Selection of Variables

Perceived Control Q1: I can influence the way work is done in my department Q2: I can influence decisions taken in my department Q3: I have the authority to make decisions at work Goal Internalization Q4: I am inspired by what we are trying to achieve as an organization Q5: I am inspired by the goals of the organization Q6: I am enthusiastic about working toward the organization’s objectives Perceived Competence Q7: I have the capabilities required to do my job well Q8: I have the skills and abilities to do my job well Q9: I have the competence to work effectively None of these required reverse coding (if this term has not been presented already, it will be presented in this course), so composite variables can be computed directly by taken the mean across the three indicators for each construct. SPSS data file link: http://www.bwgriffin.com/gsu/courses/edur9131/2018spr-content/06-reliability/06-EDUR9131-EmploymentThoughts- Merged.sav Items with _1 are from the first administration, and those with _2 are from the second. Two composite scores have been created for Perceived Control and Goal Internalization using TransformCompute command for Time 1 and 2. (If Transform and Computer for composite scores has not presented already, this will be covered in more detail soon.) Figure 10: Transform, Compute to Calculate Composite Scores

Figure 11: Compute Windows, Use Mean(x,y,z) Function to Calculate Mean Score Composites Find Pearson r and ICC Absolute Agreement for

  • Perceived Control times 1 and 2
  • Goal Internalization times 1 and 2 Also
  • Compute composite scores for time 1 and 2 for Perceived Competence, then
  • Find Pearson r and ICC Absolute Agreement Perceive Competence time 1 and 2 7. Comparison of Results with Menon Note that estimates of reliability are sample specific so one should always check reliability for each sample. It can be useful to know how well an instrument behaves across samples. How do our results for this sample of graduate students at Georgia Southern compare with results reported by Menon ( 2001 ) p. 164? Menon reports test-retest reliability estimates but does not indicate which form of test-retest estimate was used. Pearson r is the traditional assessment for test-retest reliability, so Menon likely used Pearson r. For test-retest the following estimates were obtained with the GSU sample data linked above. Menon Pearson r?

EDUR 9131

Pearson r

EDUR 9131

ICC Absolute P. Control = .87 .85 2. Goal Internalization = .86 .806. P. Competence = .77 .512.

Figure 13: ICC Results for Q1_1Work, Q2_1Decisions, Q3_1Authority The value for the ICC of .788 indicates satisfactory level of agreement among the three items. This illustrates that ICC can be extended beyond two time periods for test-retest and beyond two forms with parallel forms. 9. Single Item Test-retest Sometimes researchers use measures that are composed of just one item. For example, often a single item works well for assessing student evaluations of instruction, e.g., Overall, how would you rate this instructor? Sample response options: Very Poor, Poor, Satisfactory, Good, Very Good A similar item could be used to measure job satisfaction, e.g., Overall, how satisfied are you with your job? Sample response options: Very Poor, Poor, Satisfactory, Good, Very Good To assess test-retest reliability for such items, use the same procedures outlined above. 10. Published Examples of Test-retest (to be updated) Below are sample publications in which test-retest or parallel forms reliability is presented or discussed. These examples help show how to present reliability results. Published Examples of Test-Retest Assessment

  • Kush & Watkins 1996, see Table 2 for test-retest correlations, repeated measures ANOVA to test mean differences p. 317
  • Sapountzi-Krepia 2005, see Table 3 for use of Kappa to assess stability of categorical responses; Table 4 for intra- class correlation and rho (Pearson r) for stability and mean differences (also, nice discussion of correlation limitation, alternatives) Published Examples of Test-Retest Assessment for Single Item
  • K Milton, F C Bull, A Bauman 2010 Reliability and validity testing of a single-item physical activity measure, Br J Sports Med; see Tables 3 and 4, Table 1 presents similar studies.
  • Moss 2008 Single Item Measures, psychlopedia entry; no tables of examples, instead it presents citations for studies with examples

**Qualitative Variables

  1. Percent Agreement – See Inter-rater Agree for Nominal Data (presented later in course)
  2. Dichotomous Variables: ICC, Kappa, Scott’s Pi, Krippendorff’s alpha – See Inter-rater Agreement for Nominal Data
  3. Ordinal Variables: Weighted Kappa (to be added)** Scoring/rating agreement for judges or evaluators in test-retest situations can be examined using the same procedures for establishing inter-rater agreement. Detail of these procedures are discussed later in the course under agreement for multiple coders or raters when evaluating written responses to open-ended or essay/short answer type items. For dichotomous (binary) variables, such as pass vs fail, or employed vs not employed, one could use ICC as illustrated above. (More to be added) References Bobko, P., Roth, P., & Buster, M. (2007). The usefulness of unit weights in creating composite scores. Organizational Research Methods, 10, 289–709. Hendrickson A, Patterson B, & Melican G (2008). The Effect of Using Different Weights for Multiple-Choice and Free- Response Item Sections. Presentation at the National Council for Measurement in Education, New York. Menon, S. (2001). Employee Empowerment: An Integrative Psychological Approach. Applied Psychology: An International Review, 50, 153-180. Shrout, P.E. and Fleiss, J. L. (1979) Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull., 86: 420-428.