









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An historical account of the concept of construct validity in psychological tests. The authors discuss the limitations of traditional validation methods and introduce construct validity as an alternative approach. They explore the different types of construct validity and provide examples of its application. The document also touches upon the importance of integrating evidence from various sources and the role of mathematical analysis in construct validation.
Typology: Study Guides, Projects, Research
1 / 15
This page cannot be seen from the preview
Don't miss anything!










tua1ized, as the APA Committee on Psychological Tests learned when
before a test is published. In order to make coherent recommendations the Committee found it necessary to distinguish four types of validity, established by different types of research and requiring different interpre· tation. The chief inn ovation in the Committee's report was the term
{Meehl and R. C. Cballman) studying how proposed recommendations would apply to project ive techniques, and later modified and clarified by the entire Committee {Bordin, Challman, Conrad, Hu mphreys, Super, and the present writers). The statements agreed upon by tbe Committee (and by committees of two other associations) were pub-
tation of construct validity is not "official" and deals with some areas in which the Committee would probably not be unanimous. Th e present
and elaborate its implications. Identification of construct validity was not an isolated development.
deal of dissatisfaction with conventional notions of validity, and intro· dnced new terms and ideas, hut the resulting aggregation of types of
NOTE: Th e second a uthor worked on this problem in conn ect ion with his appoint· mcnt to the Minnesota Center for Philosophy of Sci ence. \ Ve are indebted to the o th er members of the C enter (Herbe rt Feigl, l'vlichael Scriven, .Vilfricl Sellars), and h> D. L. Thistlcth waitc of the University of Illinois, for th eir m;1jor c:o11trilmtions to our thi11king a11c1 their sugges tions for improving this paper. T he paper li rst appc;ll'l'<I i 11 / '.~> ·dwlogie:rl l311llcti11 , Jul y 1955, and is repri nt ed here, wi th minor :ilti;rations. hy pl' 1111 issio 11 of Ilic editor :i nd of the authors.
CONSTRUCT VALIDITY IN PSYCHOLOGICAL TF.STS validity seems only to ha ve stirred the muddy waters. Portions of the distinctions we shall discuss are implicit in Jenkins' paper, "Validity for 'What?" {33), Gulliksen's "I ntrinsic Validity" (27), Goo<le nough's di s-
va lidity" (25), and Mosier's papers on "face validity" and "validity gen-
ment of construct validity as we shall present it.
TI1e categories into which the Recommendations divide va lidity studies are: predictive validity, concurrent validity, content validity, and constrnct vali di ty. Th e fir st two of these may be considered together as criterion-oriented validation procedures.
is primarily interested in some criterion whi ch he wishes to predict. li e administers the test, obtains an independent criterion measure on the
some time after tile test is given, he is studying predictive validity. If th e test score and criterion score are de termined at essentially th e same time, he is studying concurrent validity. Concurrent validity is studied when one test is proposed as a substitute for another (for exampl e, when
tion), or a test is shown to correlate with some contemporary criterion (e.g ., psychiatric diagnosis).
a sample of a universe in whic11 the investigator is interested. Content
verse of items and sampling systematically within this universe to establish the test. Construct validation is in volved whenever a test is to be int erpreted
110 new scientific ap proach. Mu ch curre nt research on tests of per-
C on struct validity is not to he identified sol ely by particular investi-
L. J. Cronbach and P. E. Meehl gative procedures, but by the orientation of the investigator. Criterion-
acceptance of a set of operations as an adequate definition of whatever is to be measured." .Vhen an investigator believes that no criterion available to him is fully valid, he perforce becomes interested in con- struct validity because this is the only way to avoid the "infinite frus- tration" of relating every criterion to some more ultimate standard ( 21). In content validation, acceptance of the universe of content as defining the variable to be measured is essential. Construct validity must be in- vestigated whenever no criterion or universe of content is accepted as entirely adequate to define the quality to be measured. Determining what psychological constructs account for test performance is desirable for almost any test. Thus, although the MMPI was originaJly estab- lished on the basis of empirical discrimination between patient groups and so-called normals (concurrent validity), continuing research has tried to provide a basis for describing the personality associated with each score pattern. Such interpretations permit the clinician to predict performance with respect to criteria which have not yet been employed
Vie can distinguish among the four types of validity by noting that
current validity, the criterion behavior is of concern to the tester, and he may have no concern whatsoever with the type of behavior exl1ibited in the test. (An employer does not care if a worker can manipulate blocks, but the score on the block test may predict something he cares about.) Content validity is studied when the tester is concerned with the type of bel1avior involved in the test performance. Indeed, if the test is a work sample, the behavior represented in the test may be an end in itself. Construct validity is ordinarily studied when the tester has no definite criterion measure of the quality with which he is con- cerned, and must use indirect measures. Herc the trait or quality un- derlying the test is of central importance, rather than either the test
Construct validation is important at times for every sort of psycho- logical test: aptitude, achievement, interests, and so on. Thurstone's statement is interesting in this connection: In the field of intelligence tests, it used to be common to define validity as the correlation between a test score and some outside criterion. \Ve have reached a stage of sophistication where the test-criterion correlation
CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS is too coarse. It is obsolete. If we attempted to ascertain the validity of a test for the second space-factor, for example, we would have to get judges [to] make reliable judgments ah<:>ut people as to this factor. Ordinarily their [the available i.u~ges'] r~tm~s would b~. of no v~lue as a criterion. Consequently, vahd1ty studies m the cogmbve functions now depend on criteria of internal consistency ... (60, p. 3). Construct validity would be involved in answering such questions as: To what extent is this test of intelligence culture-free? Does this test of "interpretation of data" measure reading ability, quantitative reason- ing, or response sets? How does a person with A in Strong Accountant, and B in Strong CPA, differ from a person who has these scores reversed? Example of construct validation procedure. Suppose measure X cor-
tell a student that he has failed a Psychology I exam. Predictive validity
of the experimental and samp1ing conditions. If someone were to ask, "Isn't there perhaps another way to interpret this correlation?" or "\Vhat other kinds of evidence can you bring to support your interpre- tation?" we would hardly understand 'vhat he was asking because no interpretation has been made. These questions become relevant w~en the correlation is advanced as evidence that "test X measures anxiety proneness." Alternative interpretations are possible; e.g., perhaps the test measures "academic aspiration," in which case we will expect dif-
then reasonable to inquire about other kinds of evidence. Add these facts from further studies: Test X correlates .45 with fra-
amount of inteJlectual inefficiency induced by P'dinful electric shock, and .68 with the Taylor Anxiety Scale. Mean X score decreases among four diagnosed groups in this order: anxiety state, reactive depression, "normal," and psychopathic personality. And palmar sweat under threat of failure in Psychology I correlates .60 with threat of failure in mathc· matics. Negative results eliminate competing explanations of the X score; thus, findings of negligible correlations between X and social class, vocational aim, and value-orientation make it fairly safe to reject the suggestion that X measures "academic aspiration." We can have substantial confidence that X docs lll<~a s ure anxiety proneness if the
L. J. Cronbach and P. E. Meehl sive pattern, etc. Our proper conclusion is that, from this evidence, the four tests and the psychiatrist all assess some common factor. 111e asymmetry between the "test" and the so-designated "criterion" arises only because the terminology of predictive validity has become a commonplace in test analysis. In this study where a construct is the central concern, any distinction between the merit of the t es t and
that the psychiatrist's theory and operations were excellent measures of the attribute.
Th e proposal to vali date co nstructual interpretations of tests runs counter to suggestions of some others. Spiker and McCandless ( 57) favor an operational approach. Validation is replaced by compiling state- ments as to how strongly the test predicts other observed variables of interest. To avoid requiring that each new variable be investigated com-
a new test is demonstrated to predict the scores on an older, wcll- establishcd test, then an evaluation of the predictive power of the older
only if the two tests correlate so h ighly t hat there is negligible reliable variance in either test, independent of the other. Where the corre- spondence is less close, one must either retain all the separate variables operationally defined or embark on construct validation. Th e practical user of tests must rely on constructs of some generality to make predictions about new situations. Test X could be used to predict palmar sweating in the face of failure without invoking any
in diverse or even unique situations for which the correlation of test X is unknown. Significant predictions rely on knowledge accumulated
mendations state: It is ordinarily necessary to evaluate construct validity by integrating evidence from many different sources. The problem of construct valida- tion hccomes especially acute in the clinical field since for many of the co11strncts dealt with it is not a question of finding an imperfect criterion h11I of finding any criterion at all. The psychologi st inter es ted in ('0 11 ·
CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS struct validity for clinical devices is concerned with making an es timate of a hypothetical internal process, fa ctor, system, structure, or state and ca~not ~xpect to find. a ~lea r unitary behavioral criterion. An attempt
This appears to conflict with arguments for specific criteria promi-
statements of the latter character: "It is only as a measure of a speci- ficall y defined criterion that a test can be objectively validated at all...
validation. Tests can be profitably interpreted if we "know t he relation- ships between the tested behavior ... and other behavior sa mples, none of these behavior samples necessarily occupying the preeminent position of a criterion" ( p. 75). Factor analysis with several partial criteria might be used to study whether a test measures a postulated "general learning ability." If the data demonstrate specificity of ability instead, such specificity is "useful in its own right in advancing our knowledge of behavior; it should not be construed as a weakness of the tests" (p. 75). \Ve depart from Anastasi at two points. She writes, " Th e validity of a psychological test should not be co nfused with an analysis of the factors which determine the behavior under consideration." W e, how- ever, regard such analysis as a mo st important type of validation. Second, she refers to "the will·o'-the-wisp of psychological proce~ses which arc di stinct from performance" (2, p. 77). While we agree that psychologi- cal processes are elusive, we are sympathetic to attempts to formulate and clarify constructs which are evidenced by performance but distiuct from it. Surely an inductive inference based on a pattern of corrclatio11s cannot be dismissed as "pure speculation."
SPECIFIC CRITERIA USED TEMPORAR ILY: THE "BOOTSTRAPS" F.Jl'F l•:CT Even when a test is constructed on the basis of a specific c ri1·cr io11 , it may ultimately be judged to have greater construct valiclil'y than Ihe criterion. We start with a vague concept which we associate with cc 1tai observations. We then discover empirically that these ohservntio11s co-vary with some other observation which possesses greater reliability or is more intimately co rr ela t ed with relevant experimental chnn gc.~ than
L. f. Cronbach and P. E .- Meelil is th e original measure, or both. For exa mple, the notion of temp era tur e
expansion of a mercury column does not have face validity as an i ndex of hoh1ess. But it turns out that (a) there is a statistical relation be-
mercury method with good interobserver agreem ent; ( c) the regularity of observed relations is increased by using the thermometer {e.g., melt- ing points of samples of the same material vary little on the thermome- ter; we obtain nearly linear relations between mercury measures and pressure of a gas). Finally, ( d) a theoretical struct ure involving unob- servable microevents-the kinetic theory-is worked out which explains
ceptual enri chment begins with what in retrospect we see as an ex- tremely fallible "criterion"-the hu man temperature sense. T hat original
ourselves by our boostraps, but in a legitimate a nd fruitful way. Similarly, the Binet scale was first valued because children's scores
this agreement, it would have been discarded alon g with reaction tim e and the other measures of ability previous ly tried. Teacher judgments once constituted th e criter ion against which the individual intelligence test was validated. But if today a child's IQ is 135 and three of his teachers complain about how shtpid he is, we do not conclude that the test has failed. Quite to th e contrary, if no error in test procedure can be argued, wc treat th e test score as a valid s tatem ent about an important quality, and define our task as that of finding out what o th er variabl es-personality, st udy skill s, etc.-moclify achievement or distort teacher judgment.
VALIDATION PROCEDURES '\Ve ca n use many methods in constrnct validation. Attention should particnlarly be dra.vn to Macfarlane's survey of th ese me thods as they apply to projective devices ( 41).
dir cd ly. 'l'l111s Thurslo11c and C havc validated the Scale for Measuring
182
CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS Attitude Toward th e Church by showing score differenc es between church members and nonchurchgoers. Churchgoing is not tbe criterion of attitude, for the purpose of the test is to measure something other than the crude sociological fact of church attendanc e; on the other hand, failure to find a difference would have seriously challenged the tes t. Only coarse correspondence between test and group designation is expected. Too great a correspondence between the two would indicate that the test is to some degree invalid, because members of the groups are expected to overlap on the test. Intelligence test item s are selected initially on the basis of a correspondence to age, but an item that corr e- lates .95 with age in an elementary school sa mple would surely be suspect.
measure the same construct, a correlation between them is predicted. (An exception is noted where some second attribute has positive load- ing in the first test and negative loading in the second test; then a low correlation is expected. 111is is a testable interpretation provided an
obtained correlation departs from the expectation, however, there is no way to know whether the fault lies in test A, test B, or the formulation of the construct. A matrix of intercorrclations often points out profitable ways of dividing the construct into more meaningful parts, factor analysis being a us eful computational method in such st udies. Guilford (26 ) has discussed the place of factor analysis in construct validation. H is statem ents may be extracted as fo11ows: "The personnel
and practical criteria in a matrix and fa ctor it to identify 'real dimen- sions of human personality.' A factorial description is exact and stable;
which can be combined to predict complex behaviors." It is clear th at factors here function as constructs. Eysenck, in his "criterion analysis" (18), goes farther than Guilford, and shows that factoring can be used explicitly to test hypoth eses about constructs. Factors may or may not be weighted with surplus meaning. Certainly when they a;e regarded as "real dimensions" a great deal of surplus meaning is implied, and the interpreter must shoulder a s11hs ta11tial burden of proof. The alternative view is to regard factors as defining a working reference frame, located in a convenient manner in tltc "space"
L. J. Cronbach and P. E. Meehl negative evidence on constru ct validity. A rece nt anal ysis of "empa th y" tests is perhaps worth citing ( 14). "Em pathy" has been operationally defined in many stu dies by th e ability of a judge to p redict what re-
observed briefly. A mathematical argument has shown, however, that th e scores depend on several attr ibutes of the judge which en ter into his perception of any individual, and that they therefore cannot be in- terpreted as evidence of his ability to interpret cues offered by particular individuals, or of his intuiti on.
THE NUMER ICAL ESTU.1ATE OF CONSTRUCT VALIDITY There is an understandable tendency to seek a "constru ct validity coefficient." A num erical statement of the degree of construct validity would be a statement of the proportion of th e test scor e va riance that is attributable to the construct variable. This numerical estimate can sometimes be arrived at by a factor analysis, but since prese nt methods of factor analysis are based on linear relations, more general methods will ultimately be needed to deal with many quantitative problems of construct va lidation. Rarely wi11 it be possible to estimate definite "construct saturations," because no factor correspo nding closely to the construct will be avail· able. On e can on ly hope to set upper and lower bounds to the '1oad-
tory performance on problems such as M aier's "hatrack" would scarcely be an idea] measure of creativity, but it would be somewhat relevant. 1f its correlation with th e test is .60, this permits a te ntative estimate of 36 per cent as a lower bound. ( Th e estimate is tentative becau se t he test might overlap with the irrelevant portion of the laboratory measure.) The saturation seems to lie between 36 a nd 84 per cent; a cumulation of st udi es would provide better limits.
is not to conclude that the test "is valid" for measuring the constru ct
CONSTRUCT VALIDITY IN PSYCHOLOGI CAL TESTS Th e Logic of Construct Validation Construct validation takes place when an investigator believes that his instrument reflects a particular construct, to which are attached certain meanings. The proposed interpretation generates s pecific testable h ypotheses, which are a means of confirming or disconfirming the claim. TI1e philosophy of science wh ich we believe does most justice to actual scientific practice will now be briefly and dogmatically set forth. Readers interested in further study of the philosophical underpinning ar c referred
THE NOMOLOGICAL NET The fundamental principles are these : I. Scientifically speaking, to "make clear what something is" means to set forth the laws in which it occurs. We shall refer to the i nt er- locking sys tem of laws which constitute a th eory as a 11omological net·
"laws" may be statistical or deterministic.
is that it occur in a nomological net, at l east some of wh ose laws involve
plicitly define the construct, and the (derived ) nomologicals of type a. These latter propositions pennit predictions about events. T he constmct is not "reduced" to the observations, but only combined with other construc ts in the net to make predictions about observables.
rating the nomological network in which it occurs, or of increasing the definiteness of the components. At least in the early history of a con-
to the theory is justified if it generates nomologicals th at are confirmed
L. J. Cronbach and P. E. Meehl predict the same observations. When observations will not fit into the network as it stands, the scientist has a cer tain freedom in selecting where to modify the network. That is, t here may be alternative con- structs or ways of organ iz ing th e n et wh ich for the time being are equally defensible.
logical net tie them to the same cons truct variable. Our confidence in th is identification depe nds upon th e amount of inductive support we have for the regions of the net involved. I t is not necessary that a direct observational comparison of the two operations be made-we may be content with an intra-network proof indicating that the two operations yield estimates of th e same network-defined quantity. Thu s, ph ysicists are content to speak of the "tem perature" of the sun and the "tempera- ture" of a gas at room temperature even though the test operations are nonoverlapping because this identification makes theoretical sense. With these statemen ts of scientific methodology in mind, we return to the specific problem of construct validity as applied to psychological tests. The preceding guide rules should rea ss ure th e "toughrninded ," who fear t hat allowing construct validation opens the door to nonconfinn-
case, many such tests have been left unvalidated, or a finespun network of rationalizations has been offered as if it were validation. Rationaliza- ti on is n ot const ruct validation. One who claim s that his test reflects a construct cannot maintain h is claim in the face of recurrent negative results because these results show that his construct is too loosely defined to yield verifiable inferences. A rigorous (though perhaps probabilistic) chain of inference is r e- quired to establish a test as a measure of a construct. To valida te a claim that a t es t measur es a construct, a nomological n et surrounding the concept must exist. When a construct is fairly new, there may be
rcscmch proceeds, th e const ruct sends out roots in many directions, which atla('h it to more and more facts or other constructs. Thus the
CONSTRUCT VALIDITY IN PSYCHOLOCICAL TESTS electron has more accept ed properties than the neutrino; n um erical
"Acceptance," which was critical in criterion-oriented and content validities, ha s now appeared in construct validity. Unless substantially the same nomological net is accept ed by the several users of the con-
overt assault on others, and B's usage includes repressed hostile reactions,
A that the test does not. H ence, the investigator who proposes to estab-
sufficiently clearly so that others can accept or reject it (cf. 41, p. 406). A cons um er of the test who re jects t he author's theory ca nnot accept the author's validation. He must validate the test for himself, if he
of them concern the amount of "theory," in any high-level sense of that word, which enters into a construct-defining network of laws or
always has a very elaborate theoretical network, rich in hypothetical processes or entities. Constructs as inductive summaries. In the early stages of develop- me nt of a construct or even at more advanced stages wh en our orienta-
formula ted entirely in terms of descriptive (observational) dimensions although not all of the relevant observations have actually been made.
sense that it purports to characterize the behavior facets whi ch belong
erates predictions about hitherto unsampled regions of the phenotypic space. Even though no unobservables or high·order theoretical constructs are introduced, an element of inductive ex trapolation appears in th e claim th at a cluster including some clements n ot-yct-ohsc rv ecl has been identified. Si nce, as in any sorting or abstracting 1':1sk involving n finite set of complex clements, several 11oncquivnl c 11I h nscs of cnlcgor ii'J1tion are availablc, the inves l'i gat or 111 :1y choose u l1 ypolh cs is which generates erro neous predi ct ion s. Th e failure of n snppol>cd, h itherto 1111lricd, mcm-
L. J. Cronbach and P. E. Meehl
directly to our second important qualification upon the network schema. The idealized picture is one of a tidy set of postulates which jointly entail the desired theorems; since some of the theorems are coordinated to the observation base, the system constitutes an implicit definition of the theoretical primitives and gives them an indirect empirical mean- ing. In practice, of course, even the most advanced physical sciences only approximate this ideal. Questions of "categoricalness" and the like, such as logicians raise about pure calculi, are hardly even statable for empirical networks. (What, for example, would be the desiderata of a "well-formed formula" in molar behavior theory?) Psychology works with crude, half- explicit formulations. We do not worry about such advanced formal questions as "whether all molar-behavior statements are decidable by appeal to the postulates" because we know that no existing theoretical network suffices to predict even the known descriptive laws. Neverthe- less, the sketch of a network is there; if it were not, we would not be saying anything intelligible about our constructs. We do not have the rigorous implicit definitions of formal calculi (which still, be it noted, usually permit of a multiplicity of interpretations). Yet the vague, avowedly incomplete network still gives the constructs whatever mean- ing they do have. When the network is very incomplete, having many strands missing entirely and some constructs tied in only by tenuous threads, then the "implicit definition" of these constructs is disturbingly loose; one might say that the meaning of the constructs is tmderdeter- rnined. Since the meaning of theoretical constructs is set forth by stating the laws in which they occur, our incomplete knowledge of t11e laws of
all of the laws involving it; meanwhile, since we are in the process of discovering these laws, we do not yet know precisely what anxiety is.
Conclusions Regarding the Network after Experimentation
the construct is inserted into the accepted network. The network then generates a testable prediction about the relation of the test scores to certain other variables, and the investigator gathers data. If prediction 1111(1 res ult arc in harmony, he can retain his belief that the test measures
CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS the construct. The construct is at best adopted, never demonstrated to be "correct." We do not first "prove" the theory, and then validate the test, nor conversely. In any probable inductive type of inference from a pattern of observations, we examine the relation between t he total network of theory and observation s. 111e syst em involves propositions relating test to construct, construct to oth er constructs, and finally rcJa ting some of these constructs to observables. In ongoing research the chain of in-
diagram showing the numerous inferences required in validating a pre- diction from assessment techniques, where theori es about the criterion situation are as integral a part of the prediction as are the test data.
leading to that prediction. Traditionally the proposition claiming to interpret the test has been set apart as the hypothesis being tested, but actually the evidence is significant for all parts of the chain. If the prediction is not confirmed, any link in the chain may be wrong. A theoretical network can be divided into subtheories used in making
snbtheory are of course evidence in favor of that theory. Such a snb- theory may be so well confirmed by voluminous and diverse evidence
the test's validity. If the theory, combined with a proposed test interpre-
On the other han d, the accumulated evidence for a test's construct
us to modi fy the subtheory employing the construct rather than deny the claim that the test measures the construct. Most cases in psychology today lie somewhere b etween these extremes. T hus, suppose we fail to find a greater incidence of "homosexual sign s" in t he Rorschach records of paranoid patients. Which is more strongly disconfirmed-the Rorschach signs or the orthodox theory of paranoia? The negative finding shows the bridge between the two to be und e- pendable, but this is all we can say. The bridge can not be used unless one end is placed on solider ground. The investigator must decide which en d it is best to relocate. Numerous successful predictions dealing with phenotypically diverse ''criteria" give greater weight to the claim of construct validi ty t han do
L. /. Cronbach and P. E. Meehl
fewer predictions, or predictions involving very similar bel1aviors. In arriving at diverse predictions, the hypothesis of test validity is con· nected each time to a subnetwork largely independent of the portion previous ly used. Success of these derivations testifies to the inductive power of the test-validity statement, and renders it unlikely that an equally effective alternative can be offered.
IMPLICATIONS OF NEGATIVE EVIDENCE The investigator whose prediction and data are discordant must make
practice the distinction is worth making.)
a reasonable possibility, his proper response is to pedonn an adequate study, meanwhile making no report. \Vhen faced with the other two alternatives, he may decide that his test does not measure the construct adequately. Following that decision, he will perhaps prepare and validate a new test. Any rescoring or new interpretative procedure for the origi- nal instrument, like a new test, requires validation by means of a fresh
The investigator may regard interpretation 2 as more likely to lead to eventual advances. It is legitimate for the investigator to call the network defining the construct into question, if he has confidence in the test. Should the investigator decide that some step in the network is un sound, he may be able to invent an alternative network. Perhaps he modifies the network by splitting a concept into two or more por- tions, e.g., by designating types of anxiety, or perhaps he specifies added
modifies the theory in such a manner, he is now required to gather a
are consistent with the modified network, he is free from the fear that his nomologicals were gerrymandered to fit the peculiarities of his fir st
hcrnusc his test results behave as predicted.
CONSTRUCT VALIDITY JN PSYCI-IOLOCICAL TESTS The choice among alternatives, like any strategic decision, is a gamble as to which course of action is the best investment of effort.
is co!lfirmed by prior data, and how well the modifications fit available observations. Is it worth while to modify the test in the hope that it will fit the construct? That depends on how much evidence there is- apart from this abortive experiment-to support the hope, and also on how much it is worth to the investigator's ego to salvage the test. TI1e choice among alternatives is a matter of research planning and no routine policy can be stated.
predictions and subsequent data. When the evidence from a proper investigation of a published t est is essentially negative, it should be reported as a stop sign to discourage use of the test pending a recon-
t he test has not been published, it should be restricted to research use
await the results of the investigator's gamble, with confidence that proper application of the scientific method will ultimately tell whether the test has value. Until the evidence is in, he has no justification for em- ploying the test as a ba sis for terminal decisions. The test may serve, at best, only as a source of suggestions about individuals to be confirmed
There are two perspectives in test validation. From the viewpoint of
predictions made from such measures are consistent with the best avail- able theory of the trait. In the view of the test developer, however, both the test and the theory are under scrutiny. He is free to say to himself
the theory." This way lies delusion, unless he continues h is re.search using a better theory.
ltlo'.POlt'l'INC OF l'OSITIV I•: IU ·~'i\11.TS T he test developer who fi11cls posil ivc ('()rrcspo11clc11c:c l>dw cc11 his proposed intcrprclatio11 nml dnln is ex petted Io report Ihe ha s is for
L. /. Cronbach wd P. E. Meehl
various criteria. Thus, while Strong's Vocational Interest Blank is de·
expected to be satisfied in an occupation if h e has i nterests common to men now happy in the occupation. When Strong finds that those
proposed use of the Engineer score (predictive validity). Since the evidence is consiste nt with the theory on which all the test keys were
have predictive validity. H ow stro ng is this presumption? Not very, from the viewpoint of the traditional skepticism of science. Engineering in-
work are still unstable. A claim cannot be made that t he whole Strong approach is valid just because one score shows predictive validity. But if thirty interest scores were investigated longitudinally and all of them showed t he type of validity predicted by Strong's theory, we would indeed be caviling to say that this evidence gives no confidence in the long-range validity of the thirty-first score. Co nfidence in a t heory is increased as more relevant evidence confirms it, b ut it is always possible that tomorrow's in ves tigation will render the theory obsolete. The Tcclmical Recommendations suggest a rule of
predictive validities for all possible criteria; similarly, no developer ca n run all possible experimental tests of his proposed interpretation. But the recommendation is more subtle than advice tl1at a lot of validation is better than a little. Con sider the Rorschach test. It is used for many inferences, made
the simple unrationalizcd correspondences presumed to exist between certain signs and psychiatric diagnoses. Validating such a sign does nothing to substantia te Rorschach theory. For other Ro rschach formulas an explicit a priori rationale exists (for instance, high F per cent in -
shows correspondence with criteria, its rationale is supported just a little. At n still higher le vel of abstraction, a considerable hocl y of
CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS different constructs. As evidence cumulates, one should be able to decide what specific inference-making chains within this system can he de- pended upon. One should also be able to conclude-or deny-that so mu ch of the system has stood up under test that one has some confi· dence in even the untested lines in the network. In addition to relatively delimited nomological networks surrounding
theory of perception and set, or a theory stated in terms of learned habit patterns. \Vhatever th e th eory of th e int erpreter, whenever h e val idates an inference from th e sys tem, he obtains some reason for added confidence in his overriding system. His total theory is not tested, however, by experiments dealing with only one limited set of constructs. The test developer must investigate far-separated, independent sections of th e network. rlbe more diversi fied the predictions tl1e system is re-
parts of the system will later prove faulty. Here we begin to glimpse a logic to defend the judgment that the test and its whole interpreta· tive system is valid at some level of confidence. Th ere are enthusiasts who would conclude from th e foregoing para- graphs t.h at since there is some evidence of correct, diverse predictions made from the Rorschach, the test as a whole can now be accepted as validated. This conclusion overlooks the negative evidence. Just one finding contrary to expectation, based on sound research, is sufficient to wa sh a whole theoretical structure away. Perhaps the remains can be sal-
is sufficient negative evidence to prevent acceptance of the Rorschach and its accompanying interpretative structures as a whole. So long as any aspects of the overriding theory stated for the test have been dis- confirmed, this structure must be rebuilt.
would interpret the personality "globally." They may argue that a test is best valida ted in matching studies. \Vithout going into detailed questions of matching methodology, we can ask whether such a study va lidates the nomological network "as a whole." The judge does employ some network in arriving at h is conception of his subject, integrating
L. J. Cronbach and P. E. MeeIJI
CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS
REFERENCES I. American Psychological Association. Ethical Standards ot Psychologists. Wash- ington, D.C.: Amer. Psychological Assn., 1953.