








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
02a: Test-Retest and Parallel Forms Reliability. Quantitative Variables. 1. Classic Test Theory (CTT). 2. Correlation for Test-retest (or Parallel Forms): ...
Typology: Slides
1 / 14
This page cannot be seen from the preview
Don't miss anything!









02 a: Test-Retest and Parallel Forms Reliability Quantitative Variables
**1. Classic Test Theory (CTT)
Reliability is the
Example 2 : True Scores with Error Added Student True Score Error Time 1 Error Time 2 Observed Time 1 (True + Error 1) Observed Time 2 (True + Error 2) 1 95 3 - 3 98 92 2 90 - 3 - 3 87 87 3 85 3 3 88 88 4 80 - 3 3 77 83 5 75 - 3 3 72 78 6 70 3 3 73 73 7 65 - 3 - 3 62 62 8 60 3 - 3 63 57 Note: Cov(e1,e2) = 0.00, Cov(e1,T) = 0.00, Cov(e2, T) = 0.00; errors uncorrelated with each other and true scores. How well does Pearson r work if “random” measurement error is introduced to true scores? Variances for true scores and observed scores in Example 2 are reported below. VAR(T) = 140.00 (Variance of 16 true scores to mimic test and retest situation) VAR(X) = 149.60 (Variance of both Time 1 and Time 2 observed scores combined) The CTT reliability is rxx = VAR(T) VAR(X) =^
3. Consistency vs. Agreement Consistency refers to the relative position of scores across two sets of scores. Consistency is an assessment of whether two sets of scores tend to rank order something in similar positions. Agreement refers to the degree to which two sets of scores agree or show little difference in actual scores; the lower the absolute difference, the greater the agreement between scores. Pearson r is designed to provide a measure of consistency. Loosely described, this means Pearson r helps assess whether relative rank appears to be replicated from one set of scores to another.
Pearson r does not assess magnitude of absolute differences and can therefore present a misleading assessment of reliability when test-retest scores or parallel scores show large differences. As Example 3 demonstrates, Pearson r shows a value of .91 for the Relative Reliability scores, but note that the actual scores are very different (Mean for Test 1 = 77.50, mean for Test 2 = 16.62). Example 3: Relative vs. Absolute Reliability Relative Reliability, Consistency Absolute Reliability, Agreement Student Test 1 Rank 1 Test 2 Rank 2 Test 1 Test 2 Difference 1 95 1 44 1 95 92 3 2 90 2 22 2 90 91 - 1 3 85 3 20 3 85 83 2 4 80 4 19 4 80 79 1 5 75 5 10 5 75 78 - 3 6 70 6 9 6 70 72 - 2 7 65 7 8 7 65 64 1 8 60 8 1 8 60 61 - 1 Test 1 and 2 Pearson r = .91 Test 1 and 2 Pearson r =. Example 4 helps to solidify the problem with using Pearson r to assess test-retest and parallel forms reliability. In Example 4, note that time 2 scores have error, but also has a growth component of 20 points from time 1. The two sets of observed scores, Time 1 and Time 2, are no longer equivalent, so scores are no longer stable over time. Example 4 : True Scores with Error and Systematic Difference Added Student True Score Error Time 1 Error Time 2 Time 2 Change Observed Time 1 (True + Error 1) Observed Time 2 (True + Error 2 + Change) 1 95 3 - (^3 20 98 ) 2 90 - 3 - 3 20 87 107 3 85 3 3 20 88 108 4 80 - 3 3 20 77 103 5 75 - 3 3 20 72 98 6 70 3 3 20 73 93 7 65 - 3 - 3 20 62 82 8 60 3 - 3 20 63 77 Variances for true scores and observed scores: VAR(T) = 140.00 (Variance of 16 true scores to mimic test and retest situation) VAR(X) = 256.26 (Variance of both Time 1 and Time 2 observed scores combined) The CTT reliability is rxx = VAR(T) VAR(X)
which means that 54.6% of variance in observed scores is due to true score variance. The Pearson correlation, however, between Observed scores at Time 1 and 2, is
Figure 1: Data Entry in SPSS for obtain ANOVA Estimates to Calculate ICC Figure 2: SPSS Commands for ANOVA Results Figure 3: ANOVA Commands in Univariate Figure 4: SPSS Univariate ANOVA Output The three variances needed are: VAR(R) = (Mean Square Between – Mean Square Error) / k = (310.286-10.286) / 2 = 150. VAR(T) = (Mean Square Time – Mean Square Error) / n = (1600-10.286) / 8 = 198. VAR(E) = Mean Square Error = 10. To calculate ICC for absolute agreement for a single rater or single form, the type we typically see with test-retest and parallel form studies, the formula follows: ICCAgreement = VAR(R) VAR(R)+VAR(T)+VAR(E) =^ 150 150 + 198. 714 + 10. 286 =^ 150 359 =. Recall that the CTT reliability for these data is rxx = VAR(T) VAR(X) =^
so the ICC of .417 provides a more realistic estimate of reliability than does the Pearson correlation of .935 provided above. The ICC can also be calculated to estimate consistency of scores, just like Pearson r. The ICC formula for consistency, rather than agreement, omits variance due to time or differences across the two sets of scores. In this example, the ICC for consistency matches the Pearson r for these data: ICCConsistency = VAR(R) VAR(R)+ VAR(E) =^ 150 150 + 10. 286 =^ 150
5. ICC with SPSS Below are screenshots showing how to obtain the ICC in SPSS for test-retest and parallel forms reliability. Note that data entry for SPSS requires two columns, one for the first set of scores and one for the second set. It is critical that scores be matched to respondents otherwise obtained estimates will be incorrect; this is also true for Pearson r. Commands are Analyze → Scale → Reliability Analysis Figure 5: SPSS Data Entry for Test-retest Figure 6: SPSS Command Selection With the new screen, identify the two variables that represent the test-retest scores, in this case Time 1 and Time 2 as shown in Figure 7. Move both to the Items box. Figure 7: Selection of Variables
Perceived Control Q1: I can influence the way work is done in my department Q2: I can influence decisions taken in my department Q3: I have the authority to make decisions at work Goal Internalization Q4: I am inspired by what we are trying to achieve as an organization Q5: I am inspired by the goals of the organization Q6: I am enthusiastic about working toward the organization’s objectives Perceived Competence Q7: I have the capabilities required to do my job well Q8: I have the skills and abilities to do my job well Q9: I have the competence to work effectively None of these required reverse coding (if this term has not been presented already, it will be presented in this course), so composite variables can be computed directly by taken the mean across the three indicators for each construct. SPSS data file link: http://www.bwgriffin.com/gsu/courses/edur9131/2018spr-content/06-reliability/06-EDUR9131-EmploymentThoughts- Merged.sav Items with _1 are from the first administration, and those with _2 are from the second. Two composite scores have been created for Perceived Control and Goal Internalization using Transform → Compute command for Time 1 and 2. (If Transform and Computer for composite scores has not presented already, this will be covered in more detail soon.) Figure 10: Transform, Compute to Calculate Composite Scores
Figure 11: Compute Windows, Use Mean(x,y,z) Function to Calculate Mean Score Composites Find Pearson r and ICC Absolute Agreement for
Pearson r
ICC Absolute P. Control = .87 .85 2. Goal Internalization = .86 .806. P. Competence = .77 .512.
Figure 13: ICC Results for Q1_1Work, Q2_1Decisions, Q3_1Authority The value for the ICC of .788 indicates satisfactory level of agreement among the three items. This illustrates that ICC can be extended beyond two time periods for test-retest and beyond two forms with parallel forms. 9. Single Item Test-retest Sometimes researchers use measures that are composed of just one item. For example, often a single item works well for assessing student evaluations of instruction, e.g., Overall, how would you rate this instructor? Sample response options: Very Poor, Poor, Satisfactory, Good, Very Good A similar item could be used to measure job satisfaction, e.g., Overall, how satisfied are you with your job? Sample response options: Very Poor, Poor, Satisfactory, Good, Very Good To assess test-retest reliability for such items, use the same procedures outlined above. 10. Published Examples of Test-retest (to be updated) Below are sample publications in which test-retest or parallel forms reliability is presented or discussed. These examples help show how to present reliability results. Published Examples of Test-Retest Assessment
**Qualitative Variables