CHAPTER 3 VALIDITY AND RELIABILITY, Summaries of Logic

Reliability and validity form the crux of any measurement since they are important in establishing the credibility and truthfulness of the findings.

Typology: Summaries

2022/2023

Uploaded on 03/01/2023

sandipp
sandipp 🇺🇸

4.3

(11)

223 documents

1 / 32

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
49
CHAPTER 3
VALIDITY AND RELIABILITY
3 VALIDITY AND RELIABILITY
3.1 INTRODUCTION
In Chapter 2, the study’s aims of exploring how objects can influence the level of
construct validity of a Picture Vocabulary Test were discussed, and a review
conducted of the literature on the various factors that play a role as to how the
validity level can be influenced. In this chapter validity and reliability are discussed.
The reason why they are discussed in a chapter of their own is to provide a better
understanding of what their role was in this study.
3.2 VALIDITY
Although there is much debating going on about validity and an entire thesis can be
written about it only the relevant aspect to this study are mentioned and discussed in
the sections to follow.
3.2.1 Validity
Bond (2003, p. 179) comments that:
…validity is foremost on the mind of those developing measures and that genuine
scientific measurement is foremost in the minds of those who seek valid outcomes
from assessment.
From this above quote, validity can be seen as the core of any form of assessment
that is trustworthy and accurate (Bond, 2003, p. 179). Validity, according to Messick
(1989, p. 6)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20

Partial preview of the text

Download CHAPTER 3 VALIDITY AND RELIABILITY and more Summaries Logic in PDF only on Docsity!

CHAPTER 3

VALIDITY AND RELIABILITY

3 VALIDITY AND RELIABILITY

3.1 INTRODUCTION

In Chapter 2, the study’s aims of exploring how objects can influence the level of construct validity of a Picture Vocabulary Test were discussed, and a review conducted of the literature on the various factors that play a role as to how the validity level can be influenced. In this chapter validity and reliability are discussed. The reason why they are discussed in a chapter of their own is to provide a better understanding of what their role was in this study.

3.2 VALIDITY

Although there is much debating going on about validity and an entire thesis can be written about it only the relevant aspect to this study are mentioned and discussed in the sections to follow.

3.2.1 Validity

Bond (2003, p. 179) comments that:

…validity is foremost on the mind of those developing measures and that genuine scientific measurement is foremost in the minds of those who seek valid outcomes from assessment.

From this above quote, validity can be seen as the core of any form of assessment that is trustworthy and accurate (Bond, 2003, p. 179). Validity, according to Messick (1989, p. 6)

… always refers to the degree to which empirical evidences and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores.

Borsboom, Mellenbergh and van Heerden (2004, p. 1061) have a different take on validity stating that: “…a test is valid for measuring an attribute if (a) the attribute exists and (b) variations in the attribute causally produce variation in the measurement”. They do not agree with Messick’s conception of validity. In this study Messick’s viewpoint will be followed.

Forming the crux of this research project, not only is validity an essential issue for assessment but for measurement as a whole. In addition, the assessments can be used across countries and cultures, but if this is not the case, assessments can be seen as being biased. What is more, validity influences the way that instruction changes once the results of an assessment have been correctly interpreted (Gregory, 2000; Linn, 1998; Mahoney, 2008; Messick, 1989; Popham, 2003; Frederiksen & Collins, 1989; Gay & Airasian, 2003; Sullivan, Karlsson & Ware, 1995).

For instance, an Intelligent Quotient (IQ) test measures the intelligence of the learner (existing attribute), and not all learners will have the same intelligence (variations in the attribute). This implies that when a specific attribute needs to be investigated, the interpretations or inferences made from the test have to be valid. As Ukrainetz and Blomquist (2002, p. 60) put it: “…how well a test measures what it is purported to measure”.

Validity is an evolving complex concept because it relates to the inferences regarding assessment results. Focusing on the consequences of the inferences made implies that they should be appropriate and adequate. Messick (1989, p. 6) points out that inferences are hypotheses, and when these inferences are validated it amounts to hypothesis-testing. As a result, validity is seen as evaluative judgements that are made on the inferences of assessment results or test scores, that is whether correct interpretations are made and actions are taken based on the inferences. These evaluative judgements need to be correct and reflective of the truth. An assessment

Content-related validity is also another type of validity. As its name implies it explores how the content of the assessment performs. In order to determine content-related validity the researcher is concerned with determining whether all areas or domains are appropriately covered within the assessment. Furthermore, it deals with how the assessment is designed, for example the size of the font, sufficiency of work space for learners, correct language usage and clarity of instructions (Fraenkel & Wallen, 2003). With the original PIPS instrument the content-related validity was carefully considered and of a high standard.

3.2.2 Construct Validity

In 1955, Cronbach and Meehl elaborated on the various methods to determine construct validity. For a construct to be scientifically acceptable it had to be located in a ‘nomological network’, made up of laws that are either statistical or deterministic. These laws tie observable properties to one another, in other words the same topics or constructs are grouped together (Garrison, 1994; Moss, 1992; Cronbach & Meehl, 1955).

The main purpose of a researcher by exploring construct validity is to determine whether the inferences made about the results of the assessment are meaningful and serve the purpose of the assessment. Construct validity is empirically explored by means of Rasch and, as mentioned above, is central to any quality assessment. Whenever a certain attribute has to be measured, construct validity is involved, as it is the most applicable form of validity to assess measurements (Andrews, 1984; Creswell, 2005; Mahoney, 2008; Messick, 1981, 1989; Popham, 2003; Embretson & Gorin, 2001; Gay & Airasian, 2003; McMillan & Schumacher, 2006).

In particular, construct validity is concerned with the efficacy of a test to gauge learner knowledge about the relevant topics of concern. The test must be relevant, appropriate and utilised correctly, with the focal point being the integration of evidence that produces inferences about assessment results. These inferences must be meaningful, trustworthy and serve the purpose of the assessment for construct validity to reach its goal (Messick, 1989).

If these three steps are integrated into an assessment then the degree of construct validity will be high. In the same way, this study explores whether these three steps are successfully implemented in the Picture Vocabulary Test, as discussed in Chapter 1. On the other hand, if the construct validity of an assessment is not the central focus, it means that the assessment does not assess what it is supposed to, causing the validity level to lower. If an assessment does not produce the same results across different groups then the level of construct validity comes into question.

Cronbach and Meehl (1955, p. 283) view construct validity as asking to what extent a test is culture-free. There are two major threats that exist for construct validity, the first major one being that the construct is underrepresented because it has limited facets of the construct, or too few relevant items to accurately assess the desired topic. The next major threat is ‘construct-irrelevant variance’, meaning that the test has too much reliable variance, for instance, making certain items easier or harder for certain learners, in such a way that is irrelevant to the construct being measured (Messick, 1989; Moss, 1992).

Linking the above to this study, the focus of construct for the overall SAMP assessment was to determine the relevant level of the academic knowledge and skills possessed by Grade 1 learners across different fields (see Chapter 1, Section 1.2.4). As for this study, the construct of the Picture Vocabulary Test was under investigation, in particular its implementation across three different language groups. When assessing learners using a Picture Vocabulary Test, the objects used in the test must be familiar to all participating learners, implying that most of the objects used in the three pictures of this study must have been observed by the learners at some time or another in their past. For the Picture Vocabulary Test to have a high level of construct validity, all the objects (items) that need to be identified by all learners have to perform the same across all three groups. Alternatively, if this is not the case this study will provide suggestions as to how the level of construct validity can be increased. Meaningful, accurate and justifiable inferences can only be achieved through a high level of construct validity.

group that had similar or equal abilities (Bond & Fox, 2001). If items, or in this case the objects order used in the pictures, are not replicable across the three groups, then the reliability of the test is in jeopardy. In this study the real person and real item separation reliabilities were explored. Similar to internal consistency, separation reliability values of between 0 and 1 had to be obtained, and those that are high are beneficial to an assessment (Scherman, 2007).

The items performance can be assessed through Rasch analyses which can alert a person to the ordering of the items. Each item’s difficulty is situated along a logit scale, together with its degree of error. The more information available about the difficulty of the item, the more the estimation error decreases. If items are clumped then the difficulty of the items are not equally dispersed, which influences the level of validity as well as the reliability of a test. With the help of Rasch, items that are clumped or too easy or difficult can be identified and dealt with accordingly. If an assessment is focussed only on reliability, the validity level of the assessment will decrease. As was pointed out earlier Messick’s (1989) unified concept of validity also includes reliability and overlaps each other. Consequently, a balance has to be reached between reliability and validity.

3.4 CONCLUSION

With this study empirical investigations took place by means of Rasch analyses, in order to determine the level of validity of the Picture Vocabulary Test. Following Messick’s (1989) concept of validity and reliability interrelating within an assessment. These empirical investigations that took place lead the study to integrate a Positivist theoretical framework which aided in investigating the items and the level of validity of the assessment distinct from the learners or other factors. Statistical procedures were used as noted earlier and these are discussed in Chapter 4 to follow.

CHAPTER 4

METHODOLOGY

4 METHODOLOGY

4.1 INTRODUCTION

As mentioned before this research study was quantitative, with the research design illustrating the procedures followed and how the data was collected and analysed in order to answer the research questions. The results and findings of this study will be discussed in Chapter 5.

A detailed description of the Rasch analyses that were conducted to analyze the data of the Picture Vocabulary Test is provided in Section 4.1.6. In Section 4.1, the research methodology pertaining to this study is discussed, how the data was collected and the sample selected. The ethical considerations were taken into account in Section 4.2, before the conclusion in Section 4.3.

The main research question that was asked in this study was:

How do objects used in a Picture Vocabulary Test influence the level of validity? The main research question has been broken down into more detailed questions that can lead the research study to explore objective answers.

What barriers to validity used in a Picture Vocabulary Test can be identified from literature? Literature was explored to identify barriers most applicable to this study to the validity level of the Picture Vocabulary Test. These areas were language, culture and Visual Literacy as discussed in Chapter 2.

French philosopher, in his famous Meditations, wrote that he had accepted false opinions to be true from his youth, and promised to rid himself of the opinions he had adopted. Locking himself in a small room with a fireplace, during the winter, he examined all his beliefs. The false opinions created a desire in Descartes to establish a rigid superstructure of the sciences, but the only concrete, secure belief he could find was ‘I think, therefore I am’. Descartes being a Foundationalist and a member of its rationalist division, identified the foundation of Positivism based on “…what could not possibly be rationally doubted and seemed indubitably true should be accepted as true” (Phillips & Burbules, 2000, p. 6). In other words, truth and reality do exist and can be measured and explained if found, leaving no room for doubt (Clark, 1998; Guba, 1990). These premises were the foundation upon which Positivism was built.

The Positivist approach has been a recurring theme since Plato, who believed that nature had certain unalterable ideas (Loving, 1997) that needed to be tested and proven true. These seeds of Positivism began sprouting in the 17 th^ and 18 th^ century during the period of Enlightment (Ponterotto, 2005). Francis Bacon (1561-1626) had a sincere commitment towards the rules of evidence, feeling that modern science’s purpose was to investigate a nature that was waiting to be discovered, and defined by man through induction. The two notions of induction proposed by Bacon was one of pure discovery and the other a method of observing and then testing hypotheses which lead to logic or justification (Abraham, 1996; Alexander, 2006; Loving, 1997; Phillips & Burbules, 2000). Later, in the 1920's the philosophy of Logical Positivism was developed by the Vienna Circle, a small group of philosophers, physical scientists, social scientists and mathematicians, who focused on the ‘Received View’ (Abraham, 1996; Phillips & Burbules, 2000), postulating that what is seen, is believed. It was seen as a pointless task to make statements about happenings that could not be verified by the senses. Positivism was developed further by B.F. Skinner, a behaviourist psychologist who also had a major influence on scholars in the Positivist movement (Abraham, 1996; Buchanan, 1998; Kidd, 2002; Lather, 2006; Phillips & Burbules, 2000; Schulze, 2003; Wardlow, 1989). In Table 3.1 the basic constructs of Positivism and how they link with this research study (highlighted in light green ) are depicted.

Table 4.1: Positivism linked to the research study (Adapted and combined from Scherman, 2007 & Guba & Lincoln, 1994) Positivism linked to this research study Inquiry Aim To explain, predict or control In this study To explain how the items function in the Picture Vocabulary Test Method Quantitative In this study A scientific Rasch analysis will make the study quantitative Logic Deductive In this study If the items do not function as expected then we can deduce that the Construct validity which has implications for interpretations about validity Epistemology Objectivist - objectively true In this study The researcher and the subjects are independent of each other Ontology Realism In this study When the items are analysed they show a true reflection of how they function in reality to the learners

The basic constructs of Positivism are linked to this study and further explained.

Inquiry Aim The aim of the inquiry is to make use of a Rasch analysis to determine how each item functions in the Picture Vocabulary Test. The data will be analyzed and the order of difficulty of the objects as experienced by the learners will be revealed. This will all be shown in an item pathway, a figure in which the items are represented on a vertical axis from the easiest at the base to the most difficult at the top. If certain items do not follow the logical flow according to the Guttman scale they will be identified and an explanation given. On a Guttman scale items are arranged in a

study. The most beneficial factor of Positivism is that it is open to professional inspection and critique (Fischer, 1991).

Through Positivism, science determines to find the truth about how physical, social and personal worlds are configured. This is done by means of empirical testing and evidence in order for the truth to be claimed. Science is seen as universal, and though different methods are incorporated, the same methodology is used. Science may be considered rational if its truths are seen as similar across cultures (Mathews, 2004, p. 23), but such a claim is highly problematic since cultures vary widely in their understanding of even the basic tenets of reason and science. With the help of a Positivist approach the items performance across three languages were explored to create a better understanding of the roles they play in the assessment.

By making use of statistical procedures and empirical testing to determine how items perform in the Picture Vocabulary Test, a Positivist approach is taken. Simultaneously the level of validity of the Picture Vocabulary Test was also explored. With the help of Rasch analyses – a statistical procedure used in this study (discussed in the following chapter) the validity level was investigated. Rasch analyses can specifically be used to determine the level of construct validity of an assessment (Tennant & Conaghan, 2007).

Reliability and validity form the crux of any measurement since they are important in establishing the credibility and truthfulness of the findings. Both reliability and validity are represented in many types and forms and have multiple meanings (Neuman, 2003).

4.3 RESEARCH METHODOLOGY

Quantitative research was used in this study as described in McMillan and Schumacher (2006). The study was carefully planned and conducted in order to enhance the credibility of the results. An existing general assessment from the UK, that was used to assess Grade 1 learners, was explored so that statistical conclusion could be made regarding the data and level of validity. By employing quantitative

methods, measurements are made of each item’s difficulty compared to the learner’s ability, thus helping to establish whether the items follow the correct order of difficulty as well as to detect any possible bias that will influence the level of construct validity. Table 3.2 summarises the different instruments used and analyses conducted in order to address each research question.

Table 4.2 Research questions, instrument and analysis

Research Question Instrument Analysis

What barriers to validity used in a Picture Vocabulary Test can be identified from literature

Literature review

To what extent is a unidimensional trait measured by the Picture Vocabulary Test?

Picture Vocabulary Test

Rasch analysis Developmental pathway

To what extent do the items in the Picture Vocabulary Test function the same for the different language groups

Picture Vocabulary Test

Differential item functioning

How can the identified barriers that decrease the level of validity be minimized?

Picture Vocabulary Test

Reflections on the analysis

4.3.1 The SAMP Sample

SAMP chose the target population of Grade 1 learners speaking Afrikaans, English and Sepedi within Pretoria, Gauteng, South Africa. These languages were selected because they are the most dominant in the Pretoria area and were also the most accessible population for the SAMP project. Multi-phase sampling was used whereby schools were stratified according to medium of instruction. Eight schools were

Picture Vocabulary Test which is a subtest of the SAMP assessment was administered to the Grade One learners on a one-to-one basis by the fieldworkers, who were trained to ensure that the assessment was administered in a standard way to all the learners. The assessment took place in the area designated to the fieldworkers by the participating school over a two day period. The fieldworker fetched each Grade One learner from his or her classroom then followed the correct protocol by setting the learner at ease before the assessment started. The fieldworker then asked the learner to identify various objects from different pictures used in the Picture Vocabulary Test. The learner received a mark of 1 for each correct answer and 0 for an incorrect answer. Each object was worth one mark, and these make up the total number per picture a learner could achieve. The correct answers given by the learner were compared to the total number of objects the learner was asked to identify per picture. The results for the Picture Vocabulary Test were then worked out per learner, and captured electronically.

The responses were marked on an optical reader in pencil by the fieldworkers administering the test. An optical reader is a form that allows the fieldworker to colour in a circle next to the correct response. If the response is incorrect the circle is not coloured in. Once the participating schools had completed the SAMP assessment, the optical readers were sent to independent data capturers. The optical readers were processed through specialised machines. The circles coloured in pencil allowed for the data to be magnetically screened. The lead in the pencil allows for easy recognition by the machines capturing the data. This data received from the optical markers was then sent via e-mail to the CEA to be further processed.

4.3.4 Data analysis

“Data analysis is the vehicle used to generate and validate interpretations, formulate inferences, and draw conclusions”, as stated by Scherman (2007, p. 147). The data analysis for this research study followed a quantitative approach; using Rasch analyses (see Section 3.2.6). By following statistical measures of enquiry, exact measurements can be made to determine whether there are any significant differences in the performance of the items used in the test. The inferences made

about the results of the learners with regard to the Picture Vocabulary Tests can help to determine the level of construct validity.

By making use of Rasch analyses, the functions of the items of the test can be scientifically investigated. Rasch analyses are quantitative in nature because of the attributes of ‘additivity’ and ‘ordinality’ (Acton, 2003, p. 902). It has been used over the past 40 years but is being used more as a research tool by many researchers as the adequacy of the instrument and its level of construct validity can be verified (Callingham & Bond, 2006; Rasch, n.d.; Tennant & Conaghan, 2007). Analyses of the data made use of Descriptive Statistics (Section 3.2.5), Rasch Analyses (Section 3.2.6), and Differential Item functioning (Section 3.2.7). These are discussed as follows.

4.3.5 Descriptive statistics

In order to report on the data analyzed, the mean, mode, median, range of scores and minimum and maximum standard deviation were measured. This was done for each object as well as each individual picture. The descriptive statistics facilitated the process of writing about the results of the data that was analyzed (Scherman, 2007). For all the necessary analyses, pathways and graphical representations described in the above sections, a statistical programme, WINSTEPS (Section 3.2.8), was used.

4.3.6 Rasch Analysis

In the 1960's, Georg Rasch, a Danish mathematician, introduced a simple logistic model to construct objective measures (Boone & Rogan, 2005). Designed to overcome the problem of defining the difficulty of an item independently of the subject, it also determined the ability of the individual independently of the items. Relevant to this study, the data of the Picture Vocabulary Test can be analyzed independent of the subject (Bond & Fox, 2001; Bush & Schumacker, 1993; Linacre, 1993; McCamey, 2002; Tennant & Conaghan, 2007; Waugh, 1999).

The Rasch model can be applied to analyse dichotomous data and polytomous data (Pallant & Tennant, 2007), and various types of questions or items. Dichotomous data can only be right or wrong, assigned the value of 1 or 0 respectively. There are

Statistics’ were used in the study to shed more light on the different objects’ difficulties. Bohlig, Fisher, Masters and Bond (1998) argue that misfitting items are not to be thrown out but rather contemplated and explored for possible reasons.

4.3.6.1 Item Characteristic Curve (ICC)

The probability of the learner responding correctly to the item is expressed through an Item Characteristic Curve (ICC). A correct response to an item is dependent on both the learner’s ability and the item’s characteristics. The ICC is a mathematical function or a visual representation of the learner’s ability and the item’s characteristics. An ICC has two asymptotes, the upper asymptote is on the vertical axis at 1.0 and the lower asymptote never reaches 0, as seen in Figure 3.1:

Figure 4.1: Item Characteristic Curve (de Beer, 2004)

The probability of a correct response to an item by the learner is a continually increasing curve (de Beer, 2004). ICC’s differ from one another, with the horizontal location of the inflection of the ability axis shifting more to the right or the left. An inflection occurs when the ICC goes from concave to convex, showing the difficulty level of the item. The ability axis is also known at theta (θ) axis on an ICC. The

horizontal point where the inflection occurs is known as the difficulty level or b- parameter, value or item difficulty (Bond & Fox, 2001). If there are any objects detected that function differently for one group after a DIF analysis has been done, ICC’s will be created by Rasch. The ICC’s will then show where the inflection occurs and how the objects function.

The b-parameter reflects the point where the ability of the learner to get a correct response is 0.5 or 50%. The greater the value of the b-parameter, the more difficult the item. Theoretically the b-value is from minus infinity to plus infinity (-∞ to +∞), but a value of -2.5 to +2.5 is the typical range. -2.5 indicates a very easy item and +2.5 a very difficult item.

4.3.6.2 The One-Parameter Dichotomous Rasch model

Using the Rasch model that follows a Guttman scale will result in some learners being seen as having more ability than others, and there is a greater probability that the learners with high ability will get the easier items correct. If this is not the case, then the assessment is faulty or has a low level of construct validity (Sick, 2008). Rasch analyses generate separate estimates of each item’s difficulty and the learner’s ability. These estimates give the researcher a value relative to every individual’s ability and every item’s difficulty. In other words, a Rasch analysis tells the researcher how the item is functioning relevant to the ability being assessed. It also provides indices to determine if there are items that are spread out or in ‘clumps’. The items should move up in difficulty at equal levels and not be grouped on one difficulty level. If this happens in an assessment, the level of construct validity would be in jeopardy since the items do not follow the true Guttman style, each question becoming progressively more difficult (Bond & Fox, 2001).

The Rasch model further provides an opportunity to examine the responses received from the learners to see if they form a pattern that suits the expected outcome. These response patterns from the assessment are tested against what is expected from the specific assessment. The Rasch model is a powerful tool for determining item ordering (Tennant & Conaghan, 2007, p. 1361). In this study, the response patterns