Reliability Analysis: ICC & Bland-Altman Plot for Therapist Agreement, Summaries of Mathematics

Instructions on how to calculate the intraclass correlation coefficient (icc) and create a bland-altman plot to assess the reliability of measurements between two therapists. The analysis is based on data collected from 10 participants and uses spss and ms excel. The document also discusses the importance of reliability measures in establishing test-retest reliability and provides references for further reading.

Typology: Summaries

2021/2022

Uploaded on 07/05/2022

lee_95
lee_95 🇦🇺

4.6

(59)

999 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Statistical Analysis 9: Some reliability measures
Research question type: Reliability of repeated measurements
What kind of variables? Continuous (scale/interval/ratio)
Common Applications: A repeatability study required to help establish and quantify
reproducibility, and thus provide an indication of the 'test-retest' reliability of a measurement.
The measurements could be from two people (or two types of equipment), or the same person on
two, or more, occasions.
Table 1 shows data used for illustration in the following examples. These examples are based on
those provided by Rankin & Stokes (1998), of which a pdf and data files can be found in
W:\EC\STUDENT\ MATHS SUPPORT CENTRE STATS WORKSHEETS\.
Two techniques exploring the variability of the data to gauge reliability are demonstrated;
intraclass correlation coefficient (ICC) and Bland & Altman plot. Both SPSS and MS Excel are used
in this worksheet.
There are various forms of ICC and they are discussed in the paper, along with their associated
labels and formulae for calculation, although the worksheet uses SPSS for their calculations. The
Bland & Altman plot is illustrated in MS Excel.
An ICC is measured on a scale of 0 to 1; 1 represents perfect reliability with no measurement
error, whereas 0 indicates no reliability.
Table 1: Collected data from 2 therapists (GR & MS)
Participant
Therapist 1 (GR)
1st reading
Therapist 2
(MS)
Therapist 1 (GR)
2nd reading
1
17.13
18.78
16.78
2
16.08
17.42
16.31
3
10.91
10.73
10.60
4
14.96
15.65
14.70
5
13.00
11.52
12.63
6
18.27
17.51
18.57
7
14.99
15.81
15.81
8
15.64
16.88
15.22
9
10.93
12.19
13.46
10
16.48
18.16
17.51
Example 1 (Interrater reliability):
A comparison of the reliability of measurements from two therapists was performed.
Data from real time ultrasound imaging of a muscle in 10 participants, one reading per therapist,
are recorded in columns 2 and 3 in Table 1.
[NB At this stage we are not using the second set of readings]
Research question: Do the two therapists produce 'reliable' readings?
Loughborough University Mathematics Learning Support Centre
pf3
pf4

Partial preview of the text

Download Reliability Analysis: ICC & Bland-Altman Plot for Therapist Agreement and more Summaries Mathematics in PDF only on Docsity!

Statistical Analysis 9: Some reliability measures

Research question type: Reliability of repeated measurements

What kind of variables? Continuous ( scale/interval/ratio )

Common Applications: A repeatability study required to help establish and quantify

reproducibility, and thus provide an indication of the ' test-retest ' reliability of a measurement. The measurements could be from two people (or two types of equipment), or the same person on two, or more, occasions.

Table 1 shows data used for illustration in the following examples. These examples are based on those provided by Rankin & Stokes (1998), of which a pdf and data files can be found in W:\EC\STUDENT\ MATHS SUPPORT CENTRE STATS WORKSHEETS.

Two techniques exploring the variability of the data to gauge reliability are demonstrated; intraclass correlation coefficient (ICC) and Bland & Altman plot. Both SPSS and MS Excel are used in this worksheet.

There are various forms of ICC and they are discussed in the paper, along with their associated labels and formulae for calculation, although the worksheet uses SPSS for their calculations. The Bland & Altman plot is illustrated in MS Excel.

An ICC is measured on a scale of 0 to 1; 1 represents perfect reliability with no measurement error, whereas 0 indicates no reliability.

Table 1: Collected data from 2 therapists (GR & MS)

Participant Therapist 1 (GR) 1 st (^) reading^ Therapist 2 (MS)^ Therapist 1 (GR) 2 nd (^) reading

1 17.13 18.78 16. (^2) 16.08 17.42 16. (^3) 10.91 10.73 10. (^4) 14.96 15.65 14. (^5) 13.00 11.52 12. (^6) 18.27 17.51 18. (^7) 14.99 15.81 15. (^8) 15.64 16.88 15. (^9) 10.93 12.19 13. (^10) 16.48 18.16 17.

Example 1 (Interrater reliability):

A comparison of the reliability of measurements from two therapists was performed. Data from real time ultrasound imaging of a muscle in 10 participants, one reading per therapist, are recorded in columns 2 and 3 in Table 1. [NB At this stage we are not using the second set of readings]

Research question: Do the two therapists produce 'reliable' readings?

Loughborough University Mathematics Learning Support Centre

Coventry University Mathematics Support Centre Steps in SPSS (PASW) to obtain an ICC:

With data entered as shown in columns 1-3 in Figure 1 (see Rankin.sav)

  • choose Analyse>Scale>Reliability…
  • move the variables for comparison into the Items: list (in this case Therapist1 and Therapist2 )
  • select the Statistics… button
  • select Intraclass Correlation Coefficient
  • select Item in the Descriptives for list
  • select Consistency in the Type: list
  • Continue and OK

Figure 1: Steps in SPSS to obtain ICC

Results: Tables 2 & 3 show some of the output from the reliability analysis, showing the mean (SD) of the data from each therapist. Overall, it appears that therapist 2 measures slightly higher and more variably than therapist 1 (see means & standard deviations in Table 2).

Table 3 shows information relating to the ICC calculations. Use the 'Single Measures' option, as individual values are collected.

Our estimated reliability between therapists is 0.92, with 95% CI (0.72, 0.98), which is quite 'wide'.

Conclusion: We have evidence to support the reliability of this measurement between the two therapists.

See the Rankin & Stokes paper for more detail in the calculation of this ICC. There are several ICCs – this one is coded (3,1)

Table 2: Item Statistics

Mean

Std. Deviation N Therapist1 14.84 2.50 10 Therapist2 15.47 2.93 10

Table 3: Intraclass Correlation Coefficient

Intraclass Correlation

95% Confidence Interval F Test with True Value 0 Lower Bound

Upper Bound Value df1 df2 Sig Single Measures (^) .92. 72 .98 24.37 9 9. Average Measures .96^.^84 .99^ 24.37^9 9.

ICC

Results: The ICC = 0.93, with 95% CI (0.75, 0.98). Hence, there is evidence for the repeatability of measurements between scans for therapist 1. A copy of the Bland and Altman plot for this data is given in rankin.xlsx, which shows good agreement for most cases (seven are nearer zero), but with one outlier (ie one value outside the LOA).

You might like to repeat the analysis for the data given in the paper for day 1, and compare your results with those given in Table 4 on page 191 of the paper, and the plot in Figure 2 on page 192.

Comments

The Rankin & Stokes (1998) paper gives much more detailed discussion around measures of reliability. In particular they give references for the following comments:

 Pearson’s correlation coefficient is an inappropriate measure of reliability because the strength of linear association, and not agreement, is measured (it is possible to have a high degree of correlation when agreement is poor.

 A paired t-test assesses whether there is any evidence that two sets of measurements agree on average. However, it is the difference between within-subjects scores that is of interest (taking the mean score of all subjects has potential to provide misleading estimates).

 A high scatter of individual differences can result in the difference between the means being non-significant.

 It is no longer considered to be appropriate (in most cases) to use the coefficient of variation (CV) to calculate reliability.

'Single measure' applies to single measurements—for example, the rating of judges, individual item scores, or the body weights of individuals. 'Average measure', however, applies to average measurements, for example, the average rating of k judges, or the average score for a k-item test.

The Rankin paper also discusses an ICC (1,2) for a reliability measure using the average of two readings per day.

For data measured at nominal level, eg agreement ( concordance ) by 2 health professionals of classifying patients 'at risk' or 'not at risk' of a fall, use of Cohen's Kappa test (based on the chi- squared test) is made.

Rankin G & Stokes M (1998) Statistical analysis of reliability studies Clinical Rehabilitation 12 187-