






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concept of reliability in research, discussing its various definitions, theories, and types. Reliability is crucial for ensuring the consistency and dependability of research measures, and it is a necessary, but not sufficient, condition for validity. reliability errors, generalizability theory, interrater and intrarater reliability, test-retest reliability, parallel forms reliability, and split-half reliability. Understanding reliability is essential for researchers to design effective and accurate research studies.
Typology: Summaries
1 / 11
This page cannot be seen from the preview
Don't miss anything!







C H A P T E R 4 Reliability Achieving consistency in research is as complicated as it is in everyday life. We may often have the expectation that most things we plan for on a daily basis are actually going to happen. Whether you are in the working world or a college student, you are faced with the daily task of getting from where you live to either work or college. Regardless if you get to your expected destination by train, car, bicycle, or whatever mode of transportation you take, you have the expectation that it will consistently get you where you need to be. What would you do if every time you get in your car, you are faced with never knowing if the car will start or not? What would you do if sometimes the train does not arrive at the time it is supposed to or if it stops running during the commute? We all have the expectation that there is a level of consistency with everything we do. With that being said, research is no different. We expect some level of consistency when conducting research. The process of consistency in research is referred to as reli- ability. Prior to beginning a discussion on reliability, it is logical to ask, “What is reliabil- ity ?” Reliability has a variety of different definitions such as the extent to which a measure is dependable or consistent (Gatewood & Field, 2001), the consistency of a measure across subsequent tests or over time, the stability of results on a measure, the preciseness of a measure, systematic or consist scores (Schwab, 2005), consistency (Shadish, Cook, & Campbell, 2002), or the degree to which the results can be replicated under similar conditions (McBride, 2010). Regardless of the definition, the common theme among the various definitions is that when a measure is reliable, then the results are consistent, dependable, precise, or stable. Reliability is based on probability with a reliability coefficient ranging from 0 to 1. A reliability coefficient of 1 would mean that there is 100% reliability in the measure, and a reliability coefficient of 0 would mean that there is 0% reliability in the measure. In addition to reliability, another important con- cept is discussed in Chapters 5 and 6, which is validity. Validity is related to the accuracy of the results or process in a study. Not only are we con- cerned about how consistent or reliable the mea- sures used in an experiment are, but we also need to ensure that these results are accurate or valid. reliability: The extent to which a measure or process is consistent, dependable, precise or stable validity: The extent to which a measure or process is accurate
The problem with measuring variables within an organization is that human behavior or any process relying on human interaction is not always 100% predictable. There will always be some variation within the measurement of any variable. Regardless of this variability, reliability is important for two reasons:
assess job performance ratings over time to determine if the measures are consistent with the passage of time (Salgado, Moscoso, & Lado, 2003; Viswesvaran, Ones, & Schmidt, 1996). Interrater Reliability or Intrarater Reliability – A researcher may desire to ensure that multiple items on a given survey or questionnaire produce a similar participant response to all the items on the survey or questionnaire: for example, a measure of job performance based on the same person rating performance versus different people rating performance (Viswesvaran et al., 1996). Parallel Forms Reliability or Equivalent Forms Reliability – If multiple tests are developed to measure the same construct or variable, then these subsequent tests should measure similar items. For example, standardized tests, such as the graduate record exam (GRE), scholastic assessment test (SAT), or graduate management admission test (GMAT), etc. all have multiple versions that are designed to measure the same constructs. Bing, Stewart, and Davison (2009) examined the multiple forms of the Personnel Test for Industry- Numerical Test and an Employee Aptitude Survey Numerical Ability Test examining results based on using a calculator versus not using a calculator. To test this comparison, they utilized Forms A and B, which are two different, but identical tests for assessing ability and found support for both tests reliably assessing the same constructs. Split Half Reliability – A test may be divided into multiple parts and compared to ensure they are measuring the same constructs. For example, Damitz, Manzey, Kleinmann, and Severin (2003) conducted a study examining the validity of using an assessment center to select pilots. The assessment center consisted of data on nine cognitive ability tests, four different assessment exercises measuring nine behavioral dimensions, and nine behavior- ally anchored peer ratings on training performance. Split half reliability was calculated by using the peer ratings because each group of peers rated the same student. Therefore, they were randomly divided into two groups to calculate a mean rating for each group and then used a Spearman-Brown correction to estimate reliability. Internal Consistency / Coefficient Alpha – Items that measure similar constructs that appear throughout a test should be related to each other. For example, Cheng, Huang, Li, and Hsu (2011) conducted a self-administered questionnaire to Taiwanese workers to exam- ine the extent to which burnout had an impact on employment insecurity and workplace justice. In total, there were six items on employment insecurity and nine items on work- place justice. The purpose was to determine if these items measure the variables they were developed for. Test-Retest Reliability The first type of reliability is intuitive from the name of it. Test-retest reliability is when a researcher provides a participant with the same test at two different points in time. The purpose of this type of reliability is to show that scores are consistent on mul- tiple administrations of the same test over time. When a test is found to have test-retest reliability, it is expected that a participant’s scores on multiple administrations would be similar.
Salgado et al. (2003) examined the test-retest reliability with measures of job performance. Since supervisory ratings are frequently used for validation purposes within selection research (Barrick, Mount, & Judge, 2001), Salgado et al. (2003) conducted a study assessing the reliability of supervisory ratings on several dimensions of job performance and overall job performance. They found support that the test-retest reliability of overall job performance was 0.79 and other measures of performance ranged from 0.40 to 0.67: thus, providing support that there is test- retest reliability on ratings of performance. Similarly, Sturman, Cheramie, and Cashen (2005) conducted a meta-analysis using 22 studies on the test-retest reliability of job performance ratings over time. They found test-retest reliability coeffi- cients over time for low complexity jobs to be 0. and 0.50 for high complexity jobs. Despite these findings, they state that it is impossible to estimate the true stability of job performance because time is an important factor with impacting the job per- formance ratings. The main issue with test-retest reliability as Sturman et al. (2005) point out is that the difference in measures between the first and second adminis- tration could impact the reliability due to the fol- lowing factors:
Parallel or Equivalent Forms Reliability The next type of reliability is parallel or equivalent forms reliability. By definition, this type of reliability is where a researcher creates two different but similar tests that measure the same construct. One of the more well-known tests that are parallel or equivalent forms is standardized tests. The process to developing a standardized test is extremely arduous and requires an extreme precision to ensure the psychometric properties of the items are similar. In practice, it is possible to create parallel or equivalent forms of a test, but it may not be widely used due to the process of developing multiple tests. The idea behind parallel or equivalent forms reliability is to have two conceptually iden- tical tests that utilize separate questions to measure the same construct of interest. The number of items used to measure a particular construct of interest can be unlimited. Therefore, it is not possible for a test or measure to include every possible item to measure the constructs of interest. This has an important implication on reliability because creating a test to measure human behavior with a reliability coefficient of 1.0 is unlikely. On the other hand, having multiple items to measure the same construct could be a benefit for using parallel or equivalent forms reliability to create multiple similar but different tests. The challenge you face when multiple tests are created to measure the same construct is that the items on both versions of the same test may not actually measure the same construct. From an applied perspective, Chan, Schmitt, Deshon, Clause, and Delbridge (1997) were interested in the relationships that factors such as race, test-taking motivations, and performance had on a cognitive ability test. To do this, a parallel form cognitive ability test battery was created that was used in an actual employment testing project. They found that the correlation between the first test and the parallel test was 0.84 (p<0.05). This indicates that the two forms of the cognitive ability test were adequate in regards to paral- lel form reliability. Similarly, Bing et al. (2009) conducted a study that utilized multiple forms (Form A and B) of a Personal Test for Industry-Numerical Test and an Employee Aptitude Survey Numerical Ability Test that involved the comparison of results for participants using a calculator com- pared to participants not using a calculator. Multiple comparisons were conducted to exam- ine the reliability of these different conditions, and they found support that the results of both the calculator and noncalculator condition were similar on both forms of the test. Split Half Reliability The next type of reliability is split half reliability. The purpose of the split half reliability is to divide the test or measure into two halves and test the internal consistency of the items used. Split half reliability is similar but different to the parallel or equivalent forms reliability with a couple exceptions. Parallel or equivalent forms reliability requires two versions of a test. With split half reliability, a researcher only conducts one administration of the test or measure and splits the test in half (i.e., even vs. odd questions or the first half vs. the second half). This is dif- ferent from parallel or equivalent forms because multiple versions of a test are not necessary. One common criticism of this technique is determining where to split the test because of how the items are divided within the test or measure. A few techniques to split the test or
measure include using odd and even numbered items, randomly selecting the items, or using the first half and second half of the test. The most commonly used method of split- half reliability within research is through odd and even items (Aamodt, 2007). As an exam- ple, Damitz et al. (2003) examined the validity of an assessment center used to select pilots. As a part of the assessment center, each group of peers had rated the same students and therefore, they randomly divided the group into two equal-sized subgroups. This grouping allowed for calculation of split half reliability utilizing the Spearman-Brown correction. Internal Consistency/Coefficient Alpha The last reliability technique is internal consistency reliability and is also referred to as coefficient alpha or Cronbach’s alpha. This is most common and widely used reliability technique for purposes of reporting the reliability of a test or measure in experiments in applied settings (Edwards, Scott, & Raju, 2003) and is also by far the most commonly When looking to select the most appropriate employee for a position, it is important to properly evaluate their ability to succeed because the cost to replace an employee can be extremely costly. In the day and age of cutting budgets to conserve money, an employer must be able to consistently (reliability) and accurately (validity) select one qualified candidate from a pool of applicants. One methodology to select employees is through the use of an assessment center. Within an assessment center, an applicant is given a variety of test batteries that may include simulations, tests, exercises, etc. in which they are designed to perform in a simulated work environment (Berry, 2002). One such test battery in an assessment center is an in-basket exercise. The purpose of this is to provide an applicant with the ability to manage a variety of issues that could be accumulated in a day such as letters, memos, telephone messages, reports, or other items that may come up throughout the course of a day (Berry, 2002). In an effort to assess the reliability of the in-basket test, Schippmann, Prien, and Katz (1990) reviewed the existing literature on various components of the reliability of an in-basket test. In terms of reliability, the psychometric properties of an in-basket test reliability was examined through interrater reliability, parallel forms reliability, and split-half reliability. While none of these three reliability techniques proved superior in assessing the reliability of the in-basket test, a lot of useful information was learned. In terms of interrater reliability, it was found that the range of reliability coefficients for this technique suggests that some other variable may create the rating patterns that may be a function of rater training. Parallel form reliability differences in coefficients may potentially be a result of being confounded with performance on the test. Lastly, for split-half reliabilities of odd and even numbers, Schippmann et al. (1990) suggest that there may be a need in further developing the test content or a more systematic and objective approach to scoring the test may yield more encouraging reliability coefficients. In summary, the in-basket test for reliability and validity provides only modest support for the usefulness. Box 4.1 In-Basket Test Reliability
the goal is to have a reliability coefficient close to 1, which would indicate a high degree of consistency and a low degree of measurement error with the measure. Simply having a high reliability coefficient in your study does not necessarily equate to the assumption that the measure on your test is valid. The reason is because there may be potential threats to validity that can provide an explanation to a high reliability coefficient. When discussing the different reliability methods, the main conclusion drawn between the results of the different types of reliability is the explanation of the results found within the research. For example, when evaluating research that states the best predictor of future performance is past behavior, you, as a consumer of information, have to know that this result is true and that there is no other explanation that can justify this result. Whenever a measure of human behavior exists, there is some level of measurement error that occurs. While the reliability coefficient ranges from 0 to 1, with 1 being perfectly reliable, we know that no measure of human behavior is capable of achieving perfect reliability. There is bound to be some error associated with any measurement. Even classical test theory posits that the raw score of a measure is comprised of a true component and a random error component. Therefore, the goal of being a consumer of information is to know and understand the various aspects of reliability techniques as well as understand the relationship between reliability and validity. Whenever a possibility exists that the relationship within an exper- iment can be explained by alternative explanations this means that there is a threat to the validity of the experiment and the reliability of the results. Validity is discussed in detail in Chapters 5 and 6. An alter- native explanation for a result means that some other vari- able can explain the relationship between the cause and effect relationship. CHAPTER SUMMARY