Understanding Reliability and Its Role in Research: Types and Theories, Summaries of Human Physiology

The concept of reliability in research, discussing its various definitions, theories, and types. Reliability is crucial for ensuring the consistency and dependability of research measures, and it is a necessary, but not sufficient, condition for validity. reliability errors, generalizability theory, interrater and intrarater reliability, test-retest reliability, parallel forms reliability, and split-half reliability. Understanding reliability is essential for researchers to design effective and accurate research studies.

Typology: Summaries

2021/2022

Uploaded on 07/04/2022

jacqueline_nel
jacqueline_nel 🇧🇪

4.4

(242)

3.2K documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
43
CHAPTER 4
Reliability
Achieving consistency in research is as complicated as it is in everyday life. We may often
have the expectation that most things we plan for on a daily basis are actually going to
happen. Whether you are in the working world or a college student, you are faced with the
daily task of getting from where you live to either work or college. Regardless if you get to
your expected destination by train, car, bicycle, or whatever mode of transportation you
take, you have the expectation that it will consistently get you where you need to be. What
would you do if every time you get in your car, you are faced with never knowing if the car
will start or not? What would you do if sometimes the train does not arrive at the time it is
supposed to or if it stops running during the commute? We all have the expectation that
there is a level of consistency with everything we do.
With that being said, research is no different. We expect some level of consistency
when conducting research. The process of consistency in research is referred to as reli-
ability. Prior to beginning a discussion on reliability, it is logical to ask, “What is reliabil-
ity?” Reliability has a variety of different definitions such as the extent to which a
measure is dependable or consistent (Gatewood & Field, 2001), the consistency of a
measure across subsequent tests or over time, the stability of results on a measure, the
preciseness of a measure, systematic or consist scores (Schwab, 2005), consistency
(Shadish, Cook, & Campbell, 2002), or the degree to which the results can be replicated
under similar conditions (McBride, 2010). Regardless of the definition, the common
theme among the various definitions is that when a measure is reliable, then the results
are consistent, dependable, precise, or stable. Reliability is based on probability with a
reliability coefficient ranging from 0 to 1. A reliability coefficient of 1 would mean that
there is 100% reliability in the measure, and a reliability coefficient of 0 would mean that
there is 0% reliability in the measure.
In addition to reliability, another important con-
cept is discussed in Chapters 5 and 6, which is
validity. Validity is related to the accuracy of the
results or process in a study. Not only are we con-
cerned about how consistent or reliable the mea-
sures used in an experiment are, but we also need
to ensure that these results are accurate or valid.
reliability: The extent to which a
measure or process is consistent,
dependable, precise or stable
validity: The extent to which a measure
or process is accurate
©SAGE Publications
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Understanding Reliability and Its Role in Research: Types and Theories and more Summaries Human Physiology in PDF only on Docsity!

C H A P T E R 4 Reliability Achieving consistency in research is as complicated as it is in everyday life. We may often have the expectation that most things we plan for on a daily basis are actually going to happen. Whether you are in the working world or a college student, you are faced with the daily task of getting from where you live to either work or college. Regardless if you get to your expected destination by train, car, bicycle, or whatever mode of transportation you take, you have the expectation that it will consistently get you where you need to be. What would you do if every time you get in your car, you are faced with never knowing if the car will start or not? What would you do if sometimes the train does not arrive at the time it is supposed to or if it stops running during the commute? We all have the expectation that there is a level of consistency with everything we do. With that being said, research is no different. We expect some level of consistency when conducting research. The process of consistency in research is referred to as reli- ability. Prior to beginning a discussion on reliability, it is logical to ask, “What is reliabil- ity ?” Reliability has a variety of different definitions such as the extent to which a measure is dependable or consistent (Gatewood & Field, 2001), the consistency of a measure across subsequent tests or over time, the stability of results on a measure, the preciseness of a measure, systematic or consist scores (Schwab, 2005), consistency (Shadish, Cook, & Campbell, 2002), or the degree to which the results can be replicated under similar conditions (McBride, 2010). Regardless of the definition, the common theme among the various definitions is that when a measure is reliable, then the results are consistent, dependable, precise, or stable. Reliability is based on probability with a reliability coefficient ranging from 0 to 1. A reliability coefficient of 1 would mean that there is 100% reliability in the measure, and a reliability coefficient of 0 would mean that there is 0% reliability in the measure. In addition to reliability, another important con- cept is discussed in Chapters 5 and 6, which is validity. Validity is related to the accuracy of the results or process in a study. Not only are we con- cerned about how consistent or reliable the mea- sures used in an experiment are, but we also need to ensure that these results are accurate or valid. reliability: The extent to which a measure or process is consistent, dependable, precise or stable validity: The extent to which a measure or process is accurate

44 PART I FOUNDATION OF RESEARCH METHODS

The problem with measuring variables within an organization is that human behavior or any process relying on human interaction is not always 100% predictable. There will always be some variation within the measurement of any variable. Regardless of this variability, reliability is important for two reasons:

  1. Reliability is a necessary, but not sufficient condition for validity.
  2. Reliability is the upper limit to validity. The first statement implies that having a reliable measure does not mean that it will always be valid. For example, you may have a scale that consistently measures an indi- vidual’s weight. However, if the scale is set back five pounds without anyone knowing, then the weight is not valid. The second statement implies that the validity coefficient will never be higher than the reliability coefficient. This means that if the reliability coefficient that is calculated from your measure is 0.6, then the validity coefficient cannot be higher than 0.6. This is critical because when studying human behavior, no one test is perfect. RELIABILITY THEORIES As previously mentioned, reliability is concerned with the consistency of a measure with the goal of reducing errors in measurement. Almost every measure of human behavior has some degree of error associated with the tool. Reliability errors are referred to as random errors and systematic errors , but the terms random errors and nonrandom errors may be used respectively. While the purpose of this book is to provide a researcher with the tools to conduct well- developed applied research or evaluate existing research, we must still refer to some critical theoretical concepts. Two such theories for errors in measurement are as follows:
  3. classical test theory or true score theory
  4. generalizability theory The purpose of classical test theory or true score theory is based on an assumption that measurement error exists. This theory is derived from the thought that a raw score (X) of a measure is comprised of a true com- ponent (T) and a random error (E) component, such that the formula for a raw score is X = T + E. The true compo- nent portion of the formula represents the score that the participant received on a measure. The random error component represents the amount that the participant’s score was influenced by other factors unrelated to the construct at the time the measurement was observed. The combination of the true component and random random error: A type of error in measurement where any factor or variable randomly has an impact of the measured variable systematic error: A type of error in measurement where any factor or variable that is not random has an impact on the measured variable classical test theory: Also referred to as True Score Theory. A measurement error theory derived from the thought has a raw score consists on a true and random component. true score theory: Also referred to as classical test theory. A measurement error theory derived from the thought that a raw score consists of a true and random component.

46 PART I FOUNDATION OF RESEARCH METHODS

assess job performance ratings over time to determine if the measures are consistent with the passage of time (Salgado, Moscoso, & Lado, 2003; Viswesvaran, Ones, & Schmidt, 1996). Interrater Reliability or Intrarater Reliability – A researcher may desire to ensure that multiple items on a given survey or questionnaire produce a similar participant response to all the items on the survey or questionnaire: for example, a measure of job performance based on the same person rating performance versus different people rating performance (Viswesvaran et al., 1996). Parallel Forms Reliability or Equivalent Forms Reliability – If multiple tests are developed to measure the same construct or variable, then these subsequent tests should measure similar items. For example, standardized tests, such as the graduate record exam (GRE), scholastic assessment test (SAT), or graduate management admission test (GMAT), etc. all have multiple versions that are designed to measure the same constructs. Bing, Stewart, and Davison (2009) examined the multiple forms of the Personnel Test for Industry- Numerical Test and an Employee Aptitude Survey Numerical Ability Test examining results based on using a calculator versus not using a calculator. To test this comparison, they utilized Forms A and B, which are two different, but identical tests for assessing ability and found support for both tests reliably assessing the same constructs. Split Half Reliability – A test may be divided into multiple parts and compared to ensure they are measuring the same constructs. For example, Damitz, Manzey, Kleinmann, and Severin (2003) conducted a study examining the validity of using an assessment center to select pilots. The assessment center consisted of data on nine cognitive ability tests, four different assessment exercises measuring nine behavioral dimensions, and nine behavior- ally anchored peer ratings on training performance. Split half reliability was calculated by using the peer ratings because each group of peers rated the same student. Therefore, they were randomly divided into two groups to calculate a mean rating for each group and then used a Spearman-Brown correction to estimate reliability. Internal Consistency / Coefficient Alpha – Items that measure similar constructs that appear throughout a test should be related to each other. For example, Cheng, Huang, Li, and Hsu (2011) conducted a self-administered questionnaire to Taiwanese workers to exam- ine the extent to which burnout had an impact on employment insecurity and workplace justice. In total, there were six items on employment insecurity and nine items on work- place justice. The purpose was to determine if these items measure the variables they were developed for. Test-Retest Reliability The first type of reliability is intuitive from the name of it. Test-retest reliability is when a researcher provides a participant with the same test at two different points in time. The purpose of this type of reliability is to show that scores are consistent on mul- tiple administrations of the same test over time. When a test is found to have test-retest reliability, it is expected that a participant’s scores on multiple administrations would be similar.

CHAPTER 4 Reliability 47

Salgado et al. (2003) examined the test-retest reliability with measures of job performance. Since supervisory ratings are frequently used for validation purposes within selection research (Barrick, Mount, & Judge, 2001), Salgado et al. (2003) conducted a study assessing the reliability of supervisory ratings on several dimensions of job performance and overall job performance. They found support that the test-retest reliability of overall job performance was 0.79 and other measures of performance ranged from 0.40 to 0.67: thus, providing support that there is test- retest reliability on ratings of performance. Similarly, Sturman, Cheramie, and Cashen (2005) conducted a meta-analysis using 22 studies on the test-retest reliability of job performance ratings over time. They found test-retest reliability coeffi- cients over time for low complexity jobs to be 0. and 0.50 for high complexity jobs. Despite these findings, they state that it is impossible to estimate the true stability of job performance because time is an important factor with impacting the job per- formance ratings. The main issue with test-retest reliability as Sturman et al. (2005) point out is that the difference in measures between the first and second adminis- tration could impact the reliability due to the fol- lowing factors:

  1. Time interval between test administrations
  2. The test or other factors associated with the participant With respect to the time interval, a researcher measured job dissatisfaction through negative affectivity and hypothesized that this measure is stable over time. As a result, the researcher mea- sured the negative affectivity 1 year later. In this case, the time lapse between both administrations of the same test may influence the reliability of the measurement. The result was that the true value of this measure a year later may have been underestimated. The error in measurement was associated to transient factors, such as the participant’s mood, emo- tion, or feeling at the time. These measures could be different a week or day later and test-retest reliability: The consistency to which the test scores are similar when participants are given the same test more than once interrater reliability: The consistency to which the test scores are similar when participants are given the same test more than once intrarater reliability: The extent to which measurement ratings are consistent among the same raters parallel forms reliability: Also referred to as equivalent forms reliability. The extent to which two tests are developed to measure the same construct of interest. equivalent forms reliability: Or also referred to as parallel forms reliability. The extent to which two tests are developed to measure the same construct of interest. split half reliability: Measures the internal consistency of items on a test when different items assessing the same construct throughout the test are compared internal consistency: Also referred to as coefficient alpha. Measures the consistency of the same items on a test that measure the same construct. coefficient alpha: Also referred to as internal consistency reliability. Measures the consistency of the same items on a test that measure the same construct.

CHAPTER 4 Reliability 49

Parallel or Equivalent Forms Reliability The next type of reliability is parallel or equivalent forms reliability. By definition, this type of reliability is where a researcher creates two different but similar tests that measure the same construct. One of the more well-known tests that are parallel or equivalent forms is standardized tests. The process to developing a standardized test is extremely arduous and requires an extreme precision to ensure the psychometric properties of the items are similar. In practice, it is possible to create parallel or equivalent forms of a test, but it may not be widely used due to the process of developing multiple tests. The idea behind parallel or equivalent forms reliability is to have two conceptually iden- tical tests that utilize separate questions to measure the same construct of interest. The number of items used to measure a particular construct of interest can be unlimited. Therefore, it is not possible for a test or measure to include every possible item to measure the constructs of interest. This has an important implication on reliability because creating a test to measure human behavior with a reliability coefficient of 1.0 is unlikely. On the other hand, having multiple items to measure the same construct could be a benefit for using parallel or equivalent forms reliability to create multiple similar but different tests. The challenge you face when multiple tests are created to measure the same construct is that the items on both versions of the same test may not actually measure the same construct. From an applied perspective, Chan, Schmitt, Deshon, Clause, and Delbridge (1997) were interested in the relationships that factors such as race, test-taking motivations, and performance had on a cognitive ability test. To do this, a parallel form cognitive ability test battery was created that was used in an actual employment testing project. They found that the correlation between the first test and the parallel test was 0.84 (p<0.05). This indicates that the two forms of the cognitive ability test were adequate in regards to paral- lel form reliability. Similarly, Bing et al. (2009) conducted a study that utilized multiple forms (Form A and B) of a Personal Test for Industry-Numerical Test and an Employee Aptitude Survey Numerical Ability Test that involved the comparison of results for participants using a calculator com- pared to participants not using a calculator. Multiple comparisons were conducted to exam- ine the reliability of these different conditions, and they found support that the results of both the calculator and noncalculator condition were similar on both forms of the test. Split Half Reliability The next type of reliability is split half reliability. The purpose of the split half reliability is to divide the test or measure into two halves and test the internal consistency of the items used. Split half reliability is similar but different to the parallel or equivalent forms reliability with a couple exceptions. Parallel or equivalent forms reliability requires two versions of a test. With split half reliability, a researcher only conducts one administration of the test or measure and splits the test in half (i.e., even vs. odd questions or the first half vs. the second half). This is dif- ferent from parallel or equivalent forms because multiple versions of a test are not necessary. One common criticism of this technique is determining where to split the test because of how the items are divided within the test or measure. A few techniques to split the test or

50 PART I FOUNDATION OF RESEARCH METHODS

measure include using odd and even numbered items, randomly selecting the items, or using the first half and second half of the test. The most commonly used method of split- half reliability within research is through odd and even items (Aamodt, 2007). As an exam- ple, Damitz et al. (2003) examined the validity of an assessment center used to select pilots. As a part of the assessment center, each group of peers had rated the same students and therefore, they randomly divided the group into two equal-sized subgroups. This grouping allowed for calculation of split half reliability utilizing the Spearman-Brown correction. Internal Consistency/Coefficient Alpha The last reliability technique is internal consistency reliability and is also referred to as coefficient alpha or Cronbach’s alpha. This is most common and widely used reliability technique for purposes of reporting the reliability of a test or measure in experiments in applied settings (Edwards, Scott, & Raju, 2003) and is also by far the most commonly When looking to select the most appropriate employee for a position, it is important to properly evaluate their ability to succeed because the cost to replace an employee can be extremely costly. In the day and age of cutting budgets to conserve money, an employer must be able to consistently (reliability) and accurately (validity) select one qualified candidate from a pool of applicants. One methodology to select employees is through the use of an assessment center. Within an assessment center, an applicant is given a variety of test batteries that may include simulations, tests, exercises, etc. in which they are designed to perform in a simulated work environment (Berry, 2002). One such test battery in an assessment center is an in-basket exercise. The purpose of this is to provide an applicant with the ability to manage a variety of issues that could be accumulated in a day such as letters, memos, telephone messages, reports, or other items that may come up throughout the course of a day (Berry, 2002). In an effort to assess the reliability of the in-basket test, Schippmann, Prien, and Katz (1990) reviewed the existing literature on various components of the reliability of an in-basket test. In terms of reliability, the psychometric properties of an in-basket test reliability was examined through interrater reliability, parallel forms reliability, and split-half reliability. While none of these three reliability techniques proved superior in assessing the reliability of the in-basket test, a lot of useful information was learned. In terms of interrater reliability, it was found that the range of reliability coefficients for this technique suggests that some other variable may create the rating patterns that may be a function of rater training. Parallel form reliability differences in coefficients may potentially be a result of being confounded with performance on the test. Lastly, for split-half reliabilities of odd and even numbers, Schippmann et al. (1990) suggest that there may be a need in further developing the test content or a more systematic and objective approach to scoring the test may yield more encouraging reliability coefficients. In summary, the in-basket test for reliability and validity provides only modest support for the usefulness. Box 4.1 In-Basket Test Reliability

52 PART I FOUNDATION OF RESEARCH METHODS

the goal is to have a reliability coefficient close to 1, which would indicate a high degree of consistency and a low degree of measurement error with the measure. Simply having a high reliability coefficient in your study does not necessarily equate to the assumption that the measure on your test is valid. The reason is because there may be potential threats to validity that can provide an explanation to a high reliability coefficient. When discussing the different reliability methods, the main conclusion drawn between the results of the different types of reliability is the explanation of the results found within the research. For example, when evaluating research that states the best predictor of future performance is past behavior, you, as a consumer of information, have to know that this result is true and that there is no other explanation that can justify this result. Whenever a measure of human behavior exists, there is some level of measurement error that occurs. While the reliability coefficient ranges from 0 to 1, with 1 being perfectly reliable, we know that no measure of human behavior is capable of achieving perfect reliability. There is bound to be some error associated with any measurement. Even classical test theory posits that the raw score of a measure is comprised of a true component and a random error component. Therefore, the goal of being a consumer of information is to know and understand the various aspects of reliability techniques as well as understand the relationship between reliability and validity. Whenever a possibility exists that the relationship within an exper- iment can be explained by alternative explanations this means that there is a threat to the validity of the experiment and the reliability of the results. Validity is discussed in detail in Chapters 5 and 6. An alter- native explanation for a result means that some other vari- able can explain the relationship between the cause and effect relationship. CHAPTER SUMMARY

  • Reliability is the consistency of a measure with a coefficient between 0 and 1 and is important for two reasons: (1) Reliability is a necessary, but not sufficient, condition for validity and (2) reliability is the upper limit to validity. This implies that a reliable measure may not always be valid and a validity coefficient can never be higher than the reliability coefficient.
  • When developing tests to measure a construct, it may not be perfect, and there could be a degree of error associated with the test, but techniques can be utilized to improve the reliability of a test. Error in measurement can be categorized as random or systematic (nonrandom) errors.
  • All tests or measures have some degree of error associated with the measurement, and there are two theories aimed at understanding these errors in measurement. Classical test theory or true score theory is based on an assumption that every raw score observation is comprised of two components, which are a true measurement and an error measurement. Generalizability theory extends the principles of classical test theory/true score theory by the premise of devel- oping a set of observations that can be generalized from the sample collected to the popula- tion it was sampled from. validity: The accuracy of the results of a research study

CHAPTER 4 Reliability 53

  • There are five main goals of reliability and they relate to the different types of reliability, which are as follows: test-retest, interrater/intrarater, parallel or equivalent forms, split half, and internal consistency/coefficient alpha reliability. These different types of reliability are aimed at ensuring that a test consistently measures a construct of interest. DISCUSSION QUESTIONS
  • What are some ways that a researcher or practitioner can reduce systematic errors within a study design?
  • How does classical test theory/true score theory or generalizability theory apply to research design?
  • Given internal consistency reliability is the most commonly reported reliability technique, how might you use split half, parallel forms, intra/inter or test-retest reliability to demonstrate the consistency of your measures? CHAPTER KEY TERMS Classical Test Theory Generalizability Theory Random Error Reliability Reliability, Coefficient Alpha Reliability, Equivalent Forms Reliability, Internal Consistency Reliability, Interrater Reliability, Intrarater Reliability, Parallel Forms Reliability, Split Half Reliability, Test-Retest Systematic Error True Score Theory Validity