Inter-Rater Reliability: Assessing Agreement among Observers in Research | Exams Psychology

Tutorials in Quantitative Methods for Psychology

2012, Vol. 8(1), p. 23-34.

Computing Inter-Rater Reliability for Observational Data:

An Overview and Tutorial

Kevin A. Hallgren

University of New Mexico

Many research designs require the assessment of inter-rater reliability (IRR) to

demonstrate consistency among observational ratings provided by multiple coders.

However, many studies use incorrect statistical procedures, fail to fully report the

information necessary to interpret their results, or do not address how IRR affects the

power of their subsequent analyses for hypothesis testing. This paper provides an

overview of methodological issues related to the assessment of IRR with a focus on

study design, selection of appropriate statistics, and the computation, interpretation,

and reporting of some commonly-used IRR statistics. Computational examples include

SPSS and R syntax for computing Cohen’s kappa and intra-class correlations to assess

IRR.

The assessment of inter-rater reliability (IRR, also called

inter-rater agreement) is often necessary for research designs

where data are collected through ratings provided by

trained or untrained coders. However, many studies use

incorrect statistical analyses to compute IRR, misinterpret

the results from IRR analyses, or fail to consider the

implications that IRR estimates have on statistical power for

subsequent analyses.

This paper will provide an overview of methodological

issues related to the assessment of IRR, including aspects of

study design, selection and computation of appropriate IRR

statistics, and interpreting and reporting results.

Computational examples include SPSS and R syntax for

computing Cohen’s kappa for nominal variables and intra-

class correlations (ICCs) for ordinal, interval, and ratio

variables. Although it is beyond the scope of the current

paper to provide a comprehensive review of the many IRR

statistics that are available, references will be provided to

other IRR statistics suitable for designs not covered in this

tutorial.

A Primer on IRR

The assessment of IRR provides a way of quantifying the

degree of agreement between two or more coders who make

independent ratings about the features of a set of subjects. In

this paper, subjects will be used as a generic term for the

people, things, or events that are rated in a study, such as

the number of times a child reaches for a caregiver, the level

of empathy displayed by an interviewer, or the presence or

absence of a psychological diagnosis. Coders will be used as

a generic term for the individuals who assign ratings in a

study, such as trained research assistants or randomly-

selected participants.

In classical test theory (Lord, 1959; Novick, 1966), observed

scores (X) from psychometric instruments are thought to be

composed of a true score (T) that represents the subject’s

score that would be obtained if there were no measurement

error, and an error component (E) that is due to

measurement error (also called noise), such that

or in abbreviated symbols,

. (1)

Equation 1 also has the corresponding equation

, (2)

where the variance of the observed scores is equal to the

variance of the true scores plus the variance of the

measurement error, if the assumption that the true scores

and errors are uncorrelated is met.

Measurement error (E) prevents one from being able to

observe a subject’s true score directly, and may be

introduced by several factors. For example, measurement

error may be introduced by imprecision, inaccuracy, or poor

Inter-Rater Reliability: Assessing Agreement among Observers in Research, Exams of Psychology

Related documents

Partial preview of the text

Download Inter-Rater Reliability: Assessing Agreement among Observers in Research and more Exams Psychology in PDF only on Docsity!