Understanding Construct Validity in Psychological Tests: A Historical Perspective, Study Guides, Projects, Research of Psychology

An historical account of the concept of construct validity in psychological tests. The authors discuss the limitations of traditional validation methods and introduce construct validity as an alternative approach. They explore the different types of construct validity and provide examples of its application. The document also touches upon the importance of integrating evidence from various sources and the role of mathematical analysis in construct validation.

Typology: Study Guides, Projects, Research

2021/2022

Uploaded on 09/12/2022

elmut
elmut 🇺🇸

4.6

(16)

285 documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
-
----L.
J.
CRONBACH
and
P.
E.
MEEID..-----
Construct
Validity
in
Psychological
Tests
V
AL
IDATION
of psychological tests has
not
yet been adequately concep·
tua1ized,
as
the
APA
Committee
on
Psychological Tests learned when
it
undertook (1950-54)
to
specify what qualities should
be
investigated
before a te
st
is
published. In order to make coherent recommendations
the
Committee found it necessary
to
distinguish four types of validity,
established by different types of research
and
requiring different interpre·
tation.
The
chief
inn
ovation in the
Com
mittee's report was the term
constmct validity.*
Th
is idea
was
fir
st formulated hy a subcommittee
{Mee
hl
and
R.
C.
Cballman) studyi
ng
how proposed recommendations
would apply to proje
ct
ive techniques, and later modified
and
clarified
by the entire Committee {Bordin, Challman, Conrad,
Hu
mphre
ys
,
Super, and
the
present writers).
The
statements agreed
upon
by
tbe
Committee (and by committees of two other associa tions) were pub-
lished
in
the Technical Recommendations ( 59).
The
present interpre-
tation
of
construct validity
is
not
"official" and deals with some areas
in which
the
Committee would probably
not
be
unanimous.
Th
e present
writers are solely responsible for this att
em
pt to explain the concept
and
elaborate its implications.
Identification
of
construct validity
was
not
an
isolated development.
Writers
on
validity during
the
preceding
decade
had shown a great
deal of dissatisfaction
with
conventional notions of validity, and intro·
dnced new terms
and
ideas,
hut
the resulting aggregation
of
types of
* Referred to in a preliminary report (
58)
as
cougrue
nt
validity.
NOTE:
Th
e second a
uthor
worked on this problem in co
nn
ection with his appoint·
mcnt to
the
Minnesota
Cente
r for Philosophy of
Sci
ence. \
Ve
are indebted to
the
o
th
er members o f
the
C
enter
(Herb
e
rt
Feigl,
l'vli
chael Scriven, \.Vilfricl Sellars), and
h>
D. L.
Thi
stlcth
waitc
of
the
Univ
er
sity
of
Illinois, for
th
eir
m;1jor
c:o11trilmtions
to
our
thi11king
a11c1
their suggestions for improving this paper. T he paper
li
rst appc;ll'
l'<I
i
11
/
'
.
~>
·
dwlogie:rl
l311llcti11
,
Jul
y 1955,
an
d
is
repri
nt
ed
here,
wi
th minor :ilti;rations.
hy
p
l'
1111
issio
11
of
Ilic
editor
:i
nd of
the
authors.
174
CONSTRUCT VALIDITY I N PSYCHOLOGICAL TF.STS
validity seems only
to
ha
ve
stirred the
mudd
y waters. Portions of
the
distinctions we shall discuss are implicit
in
Jenkins' paper, "Validity for
'What?" { 33), Gulliksen's "I
nt
rinsic Validity" (27),
Goo<le
nough's di
s-
tinction between tests as "signs"
and
"samples" (22), Cronbach's sepa·
ration of "logical"
and
"empirical" validity (
11
), Guilford's "factorial
va
lidity" (25),
and
Mo
sier's pape
rs
on
"face validity" and "validity gen-
eralization" (
49,
50). Hel
en
Pea
k (
52)
comes close to
an
explic
it
stat
e-
ment
of construct validity as we shall pre
sent
it.
Four
Type
s of Validation
TI1e
categories
into
which the Recommendations divide
va
lidity
studies are: predictive validity, concurrent validity, c
ontent
validity, and
constrnct vali
di
ty.
Th
e
fir
st two of these may
be
considered together
as
criterion-oriented validation procedures.
TI1e
pattern of a criterion-oriented study
is
familiar.
The
investigator
is
primarily intere
st
ed in some criterion whi
ch
he
w
is
hes
to
predict.
li
e
administers the test, obtains
an
independent criterion measure on the
same subjects, and computes a correlation.
If
the criterion
is
obtained
some time after tile test
is
given, h e is studying predictive validity.
If
th
e
test score
and
criterion score are
de
termined
at
essentially
th
e same
time,
he
is
studying concurrent validity. Concurrent validity is studied
when
one
test
is
proposed as a s
ub
sti
tute
for another (for exampl
e,
when
a multiple-choice form
of
spelling test is substi
tuted
for taking dicta-
tion),
or
a test
is
shown to correlate
with
some contemporary criterion
(e.g
.,
psychiatric diagnosis).
Conte
nt
validity is established by showing that the test items arc
a sample
of
a universe in
whic11
the
investigator is interested.
Content
va
lidity is ordinarily
to
be
es
tablished deductivel
y,
by defining a un
verse of items
and
sampling systematically within this universe to
establish
the
test.
Construct validation
is
in
vo
lved whenever a test
is
to
be
int
erpreted
as a
m0<1s
ure of some attribute or quality which is
not
"operationally
defined."
TI1
e problem faced by the investigator is, '
'\Vhat
constn1cts
accou
nt
for ·variance in test
pe
rformance?" Construct
va
lidity calls for
110 new scientific
ap
proach.
Mu
ch curre
nt
research on tests of per-
sonn
lity (9) is consl'ruct
va
lidation, usually without the
ben
efi
t of a
clear formulation of
!'hi
s process.
C
on
struct validity is not to he identified sol
ely
by particular investi-
17
5
pf3
pf4
pf5
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Understanding Construct Validity in Psychological Tests: A Historical Perspective and more Study Guides, Projects, Research Psychology in PDF only on Docsity!

- ----L. J. CRONBACH and P. E. MEEID..-----

Construct Validity in Psychological Tests

V ALIDATION of psychological tests has not yet been adequately concep·

tua1ized, as the APA Committee on Psychological Tests learned when

it undertook (1950-54) to specify what qualities should be investigated

before a test is published. In order to make coherent recommendations the Committee found it necessary to distinguish four types of validity, established by different types of research and requiring different interpre· tation. The chief inn ovation in the Committee's report was the term

constmct validity.* Th is idea was fir st formulated hy a subcommittee

{Meehl and R. C. Cballman) studying how proposed recommendations would apply to project ive techniques, and later modified and clarified by the entire Committee {Bordin, Challman, Conrad, Hu mphreys, Super, and the present writers). The statements agreed upon by tbe Committee (and by committees of two other associations) were pub-

lished in the Technical Recommendations ( 59). The present interpre-

tation of construct validity is not "official" and deals with some areas in which the Committee would probably not be unanimous. Th e present

writers are solely responsible for this attem pt to explain the concept

and elaborate its implications. Identification of construct validity was not an isolated development.

Writers on validity during the preceding decade had shown a great

deal of dissatisfaction with conventional notions of validity, and intro· dnced new terms and ideas, hut the resulting aggregation of types of

* Referred to in a preliminary report ( 58) as cougruent validity.

NOTE: Th e second a uthor worked on this problem in conn ect ion with his appoint· mcnt to the Minnesota Center for Philosophy of Sci ence. \ Ve are indebted to the o th er members of the C enter (Herbe rt Feigl, l'vlichael Scriven, .Vilfricl Sellars), and h> D. L. Thistlcth waitc of the University of Illinois, for th eir m;1jor c:o11trilmtions to our thi11king a11c1 their sugges tions for improving this paper. T he paper li rst appc;ll'l'<I i 11 / '.~> ·dwlogie:rl l311llcti11 , Jul y 1955, and is repri nt ed here, wi th minor :ilti;rations. hy pl' 1111 issio 11 of Ilic editor :i nd of the authors.

CONSTRUCT VALIDITY IN PSYCHOLOGICAL TF.STS validity seems only to ha ve stirred the muddy waters. Portions of the distinctions we shall discuss are implicit in Jenkins' paper, "Validity for 'What?" {33), Gulliksen's "I ntrinsic Validity" (27), Goo<le nough's di s-

tinction between tests as "signs" and "samples" (22), Cronbach's sepa·

ration of "logical" and "empirical" validity ( 11 ), Guilford's "factorial

va lidity" (25), and Mosier's papers on "face validity" and "validity gen-

eralization" ( 49, 50). Hel en Pea k ( 52) comes close to an explic it stat e-

ment of construct validity as we shall present it.

Four Types of Validation

TI1e categories into which the Recommendations divide va lidity studies are: predictive validity, concurrent validity, content validity, and constrnct vali di ty. Th e fir st two of these may be considered together as criterion-oriented validation procedures.

TI1e pattern of a criterion-oriented study is familiar. The investigator

is primarily interested in some criterion whi ch he wishes to predict. li e administers the test, obtains an independent criterion measure on the

same subjects, and computes a correlation. If the criterion is obtained

some time after tile test is given, he is studying predictive validity. If th e test score and criterion score are de termined at essentially th e same time, he is studying concurrent validity. Concurrent validity is studied when one test is proposed as a substitute for another (for exampl e, when

a multiple-choice form of spelling test is substi tuted for taking dicta-

tion), or a test is shown to correlate with some contemporary criterion (e.g ., psychiatric diagnosis).

Conte nt validity is established by showing th at the test items arc

a sample of a universe in whic11 the investigator is interested. Content

validity is ordinarily to be es tablished deductivel y, by defining a un i·

verse of items and sampling systematically within this universe to establish the test. Construct validation is in volved whenever a test is to be int erpreted

as a m0<1s ure of some attribute or quality which is not "operationally

defined." TI1 e problem faced by the investigator is, ' '\Vhat constn1cts

accou nt for ·variance in test performance?" Construct va lidity calls for

110 new scientific ap proach. Mu ch curre nt research on tests of per-

sonn lity (9) is consl'ruct va lidation, usually without the ben efi t of a

clear formulation of !'hi s process.

C on struct validity is not to he identified sol ely by particular investi-

L. J. Cronbach and P. E. Meehl gative procedures, but by the orientation of the investigator. Criterion-

oriented validity, as Bechtoldt emphasizes ( 3, p. 1245), "involves the

acceptance of a set of operations as an adequate definition of whatever is to be measured." .Vhen an investigator believes that no criterion available to him is fully valid, he perforce becomes interested in con- struct validity because this is the only way to avoid the "infinite frus- tration" of relating every criterion to some more ultimate standard ( 21). In content validation, acceptance of the universe of content as defining the variable to be measured is essential. Construct validity must be in- vestigated whenever no criterion or universe of content is accepted as entirely adequate to define the quality to be measured. Determining what psychological constructs account for test performance is desirable for almost any test. Thus, although the MMPI was originaJly estab- lished on the basis of empirical discrimination between patient groups and so-called normals (concurrent validity), continuing research has tried to provide a basis for describing the personality associated with each score pattern. Such interpretations permit the clinician to predict performance with respect to criteria which have not yet been employed

in empirical validation studies (cf. 46, pp. 49-50, 110-11).

Vie can distinguish among the four types of validity by noting that

each involves a different emphasis on the criterion. In predictive or con-

current validity, the criterion behavior is of concern to the tester, and he may have no concern whatsoever with the type of behavior exl1ibited in the test. (An employer does not care if a worker can manipulate blocks, but the score on the block test may predict something he cares about.) Content validity is studied when the tester is concerned with the type of bel1avior involved in the test performance. Indeed, if the test is a work sample, the behavior represented in the test may be an end in itself. Construct validity is ordinarily studied when the tester has no definite criterion measure of the quality with which he is con- cerned, and must use indirect measures. Herc the trait or quality un- derlying the test is of central importance, rather than either the test

behavior or the scores on the criteria ( 59, p. 14).

Construct validation is important at times for every sort of psycho- logical test: aptitude, achievement, interests, and so on. Thurstone's statement is interesting in this connection: In the field of intelligence tests, it used to be common to define validity as the correlation between a test score and some outside criterion. \Ve have reached a stage of sophistication where the test-criterion correlation

CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS is too coarse. It is obsolete. If we attempted to ascertain the validity of a test for the second space-factor, for example, we would have to get judges [to] make reliable judgments ah<:>ut people as to this factor. Ordinarily their [the available i.u~ges'] r~tm~s would b~. of no v~lue as a criterion. Consequently, vahd1ty studies m the cogmbve functions now depend on criteria of internal consistency ... (60, p. 3). Construct validity would be involved in answering such questions as: To what extent is this test of intelligence culture-free? Does this test of "interpretation of data" measure reading ability, quantitative reason- ing, or response sets? How does a person with A in Strong Accountant, and B in Strong CPA, differ from a person who has these scores reversed? Example of construct validation procedure. Suppose measure X cor-

relates .50 with Y, the amount of palmar sweating induced when we

tell a student that he has failed a Psychology I exam. Predictive validity

of X for Y is adequately described by the coefficient, and a statement

of the experimental and samp1ing conditions. If someone were to ask, "Isn't there perhaps another way to interpret this correlation?" or "\Vhat other kinds of evidence can you bring to support your interpre- tation?" we would hardly understand 'vhat he was asking because no interpretation has been made. These questions become relevant w~en the correlation is advanced as evidence that "test X measures anxiety proneness." Alternative interpretations are possible; e.g., perhaps the test measures "academic aspiration," in which case we will expect dif-

ferent results if we induce palmar sweating by economic threat. It is

then reasonable to inquire about other kinds of evidence. Add these facts from further studies: Test X correlates .45 with fra-

ternity brothers' ratings on "tenseness." Test X correlates .55 with

amount of inteJlectual inefficiency induced by P'dinful electric shock, and .68 with the Taylor Anxiety Scale. Mean X score decreases among four diagnosed groups in this order: anxiety state, reactive depression, "normal," and psychopathic personality. And palmar sweat under threat of failure in Psychology I correlates .60 with threat of failure in mathc· matics. Negative results eliminate competing explanations of the X score; thus, findings of negligible correlations between X and social class, vocational aim, and value-orientation make it fairly safe to reject the suggestion that X measures "academic aspiration." We can have substantial confidence that X docs lll<~a s ure anxiety proneness if the

L. J. Cronbach and P. E. Meehl sive pattern, etc. Our proper conclusion is that, from this evidence, the four tests and the psychiatrist all assess some common factor. 111e asymmetry between the "test" and the so-designated "criterion" arises only because the terminology of predictive validity has become a commonplace in test analysis. In this study where a construct is the central concern, any distinction between the merit of the t es t and

criterion variables would be justified only if it had already been shown

that the psychiatrist's theory and operations were excellent measures of the attribute.

Inadequacy of Validation in Terms of Specific Criteria

Th e proposal to vali date co nstructual interpretations of tests runs counter to suggestions of some others. Spiker and McCandless ( 57) favor an operational approach. Validation is replaced by compiling state- ments as to how strongly the test predicts other observed variables of interest. To avoid requiring that each new variable be investigated com-

pletely by itself, they allow two variables to collapse into one whenever

the properties of the operationally defined measures are the same: "If

a new test is demonstrated to predict the scores on an older, wcll- establishcd test, then an evaluation of the predictive power of the older

test may be used for the new one." But accurate inferences are poss ible

only if the two tests correlate so h ighly t hat there is negligible reliable variance in either test, independent of the other. Where the corre- spondence is less close, one must either retain all the separate variables operationally defined or embark on construct validation. Th e practical user of tests must rely on constructs of some generality to make predictions about new situations. Test X could be used to predict palmar sweating in the face of failure without invoking any

construct, but a counselor is more likely to be asked to forecast behavior

in diverse or even unique situations for which the correlation of test X is unknown. Significant predictions rely on knowledge accumulated

around the generalized construct of anxiety. The Techni cal Recom-

mendations state: It is ordinarily necessary to evaluate construct validity by integrating evidence from many different sources. The problem of construct valida- tion hccomes especially acute in the clinical field since for many of the co11strncts dealt with it is not a question of finding an imperfect criterion h11I of finding any criterion at all. The psychologi st inter es ted in ('0 11 ·

CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS struct validity for clinical devices is concerned with making an es timate of a hypothetical internal process, fa ctor, system, structure, or state and ca~not ~xpect to find. a ~lea r unitary behavioral criterion. An attempt

to identify any one cntenon m eas ur e or any composite as the criterion

aimed at is, however, usually unwarranted (59, pp. 14-15).

This appears to conflict with arguments for specific criteria promi-

nent at places in the testing literature. Thus Anastasi (2) makes many

statements of the latter character: "It is only as a measure of a speci- ficall y defined criterion that a test can be objectively validated at all...

To cJa im that a test measures anything over and above its cri terion is

pure speculation" (p. 67 ). Yet elsewhere this article supports construct

validation. Tests can be profitably interpreted if we "know t he relation- ships between the tested behavior ... and other behavior sa mples, none of these behavior samples necessarily occupying the preeminent position of a criterion" ( p. 75). Factor analysis with several partial criteria might be used to study whether a test measures a postulated "general learning ability." If the data demonstrate specificity of ability instead, such specificity is "useful in its own right in advancing our knowledge of behavior; it should not be construed as a weakness of the tests" (p. 75). \Ve depart from Anastasi at two points. She writes, " Th e validity of a psychological test should not be co nfused with an analysis of the factors which determine the behavior under consideration." W e, how- ever, regard such analysis as a mo st important type of validation. Second, she refers to "the will·o'-the-wisp of psychological proce~ses which arc di stinct from performance" (2, p. 77). While we agree that psychologi- cal processes are elusive, we are sympathetic to attempts to formulate and clarify constructs which are evidenced by performance but distiuct from it. Surely an inductive inference based on a pattern of corrclatio11s cannot be dismissed as "pure speculation."

SPECIFIC CRITERIA USED TEMPORAR ILY: THE "BOOTSTRAPS" F.Jl'F l•:CT Even when a test is constructed on the basis of a specific c ri1·cr io11 , it may ultimately be judged to have greater construct valiclil'y than Ihe criterion. We start with a vague concept which we associate with cc 1tai observations. We then discover empirically that these ohservntio11s co-vary with some other observation which possesses greater reliability or is more intimately co rr ela t ed with relevant experimental chnn gc.~ than

L. f. Cronbach and P. E .- Meelil is th e original measure, or both. For exa mple, the notion of temp era tur e

arises because some objects feel hotter to the touch than others. The

expansion of a mercury column does not have face validity as an i ndex of hoh1ess. But it turns out that (a) there is a statistical relation be-

tween expansion and sensed temperature; ( b) ob servers employ the

mercury method with good interobserver agreem ent; ( c) the regularity of observed relations is increased by using the thermometer {e.g., melt- ing points of samples of the same material vary little on the thermome- ter; we obtain nearly linear relations between mercury measures and pressure of a gas). Finally, ( d) a theoretical struct ure involving unob- servable microevents-the kinetic theory-is worked out which explains

the relation of mercury expansion to heat. 111is whole process of con-

ceptual enri chment begins with what in retrospect we see as an ex- tremely fallible "criterion"-the hu man temperature sense. T hat original

criterion has now been relegated to a peripheral position. \Ve have lifted

ourselves by our boostraps, but in a legitimate a nd fruitful way. Similarly, the Binet scale was first valued because children's scores

tended to agree with judgme nts by schoolteachers. If it had no t shown

this agreement, it would have been discarded alon g with reaction tim e and the other measures of ability previous ly tried. Teacher judgments once constituted th e criter ion against which the individual intelligence test was validated. But if today a child's IQ is 135 and three of his teachers complain about how shtpid he is, we do not conclude that the test has failed. Quite to th e contrary, if no error in test procedure can be argued, wc treat th e test score as a valid s tatem ent about an important quality, and define our task as that of finding out what o th er variabl es-personality, st udy skill s, etc.-moclify achievement or distort teacher judgment.

Expcrjmentation to In vestigate Co ns tru ct Validity

VALIDATION PROCEDURES '\Ve ca n use many methods in constrnct validation. Attention should particnlarly be dra.vn to Macfarlane's survey of th ese me thods as they apply to projective devices ( 41).

Croup <liffcrcnces. If our understanding of a construct leads us to

expect two gronps to cliffer on the test, this expectation may be tested

dir cd ly. 'l'l111s Thurslo11c and C havc validated the Scale for Measuring

182

CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS Attitude Toward th e Church by showing score differenc es between church members and nonchurchgoers. Churchgoing is not tbe criterion of attitude, for the purpose of the test is to measure something other than the crude sociological fact of church attendanc e; on the other hand, failure to find a difference would have seriously challenged the tes t. Only coarse correspondence between test and group designation is expected. Too great a correspondence between the two would indicate that the test is to some degree invalid, because members of the groups are expected to overlap on the test. Intelligence test item s are selected initially on the basis of a correspondence to age, but an item that corr e- lates .95 with age in an elementary school sa mple would surely be suspect.

Correlation matrices and factor analysis. If two tests are presumed to

measure the same construct, a correlation between them is predicted. (An exception is noted where some second attribute has positive load- ing in the first test and negative loading in the second test; then a low correlation is expected. 111is is a testable interpretation provided an

external mea sure of either the first or the second variable exists.) If the

obtained correlation departs from the expectation, however, there is no way to know whether the fault lies in test A, test B, or the formulation of the construct. A matrix of intercorrclations often points out profitable ways of dividing the construct into more meaningful parts, factor analysis being a us eful computational method in such st udies. Guilford (26 ) has discussed the place of factor analysis in construct validation. H is statem ents may be extracted as fo11ows: "The personnel

psychologist wishes to know 'why his tests are valid.' He can place tests

and practical criteria in a matrix and fa ctor it to identify 'real dimen- sions of human personality.' A factorial description is exact and stable;

it is economical in explanatio n; it leads to the creation of pure tests

which can be combined to predict complex behaviors." It is clear th at factors here function as constructs. Eysenck, in his "criterion analysis" (18), goes farther than Guilford, and shows that factoring can be used explicitly to test hypoth eses about constructs. Factors may or may not be weighted with surplus meaning. Certainly when they a;e regarded as "real dimensions" a great deal of surplus meaning is implied, and the interpreter must shoulder a s11hs ta11tial burden of proof. The alternative view is to regard factors as defining a working reference frame, located in a convenient manner in tltc "space"

define <] hy all behaviors of a given lype. \Vhi eh set of fa ctors from a

L. J. Cronbach and P. E. Meehl negative evidence on constru ct validity. A rece nt anal ysis of "empa th y" tests is perhaps worth citing ( 14). "Em pathy" has been operationally defined in many stu dies by th e ability of a judge to p redict what re-

sponses will be given on some questionnaire by a subject he bas

observed briefly. A mathematical argument has shown, however, that th e scores depend on several attr ibutes of the judge which en ter into his perception of any individual, and that they therefore cannot be in- terpreted as evidence of his ability to interpret cues offered by particular individuals, or of his intuiti on.

THE NUMER ICAL ESTU.1ATE OF CONSTRUCT VALIDITY There is an understandable tendency to seek a "constru ct validity coefficient." A num erical statement of the degree of construct validity would be a statement of the proportion of th e test scor e va riance that is attributable to the construct variable. This numerical estimate can sometimes be arrived at by a factor analysis, but since prese nt methods of factor analysis are based on linear relations, more general methods will ultimately be needed to deal with many quantitative problems of construct va lidation. Rarely wi11 it be possible to estimate definite "construct saturations," because no factor correspo nding closely to the construct will be avail· able. On e can on ly hope to set upper and lower bounds to the '1oad-

ing." If "creativity" is defined as something independent of knowledge,

then a correlation of .40 between a presumed t es t of creativity and a

test of arithmetic knowledge would indicate that at least 16 per cent

of th e reliable test variance is irrelevant to creativity as defined. Labora·

tory performance on problems such as M aier's "hatrack" would scarcely be an idea] measure of creativity, but it would be somewhat relevant. 1f its correlation with th e test is .60, this permits a te ntative estimate of 36 per cent as a lower bound. ( Th e estimate is tentative becau se t he test might overlap with the irrelevant portion of the laboratory measure.) The saturation seems to lie between 36 a nd 84 per cent; a cumulation of st udi es would provide better limits.

It shon ld be particularly noted that rej ec ting the nuH hypothesis does

not finish tl1 e job of cons truct val idation ( 35, p. 28 4). The problem

is not to conclude that the test "is valid" for measuring the constru ct

v:1ri:1hlc. Th e msk is to st·atc as definitely :i s possi ble the degree of

validily the l es t is presn rn cd to have.

CONSTRUCT VALIDITY IN PSYCHOLOGI CAL TESTS Th e Logic of Construct Validation Construct validation takes place when an investigator believes that his instrument reflects a particular construct, to which are attached certain meanings. The proposed interpretation generates s pecific testable h ypotheses, which are a means of confirming or disconfirming the claim. TI1e philosophy of science wh ich we believe does most justice to actual scientific practice will now be briefly and dogmatically set forth. Readers interested in further study of the philosophical underpinning ar c referred

to th e works by Braithwaite (6, especially Chapter III), Carnap (7; 8,

pp. 56-69), Pap (5 1), Sellars (SS, 56), Feigl (19, 20), Beck (4), Kneale

(37, pp. 92- 11 0), Hempe] (29; 30, §7).

THE NOMOLOGICAL NET The fundamental principles are these : I. Scientifically speaking, to "make clear what something is" means to set forth the laws in which it occurs. We shall refer to the i nt er- locking sys tem of laws which constitute a th eory as a 11omological net·

work.

2. The laws in a nomo]ogical network may relate (a) observable

properties or quantities to each other; or (b) theoretical constructs to

observables; or ( c) different theoretical constructs to one another. TI1ese

"laws" may be statistical or deterministic.

3. A necessary condition for a construct to be scientifically ad missible

is that it occur in a nomological net, at l east some of wh ose laws involve

observables. Admissible constructs may be r emote from observation, i.e.,

a long derivation may intervene between the nomologicals which im-

plicitly define the construct, and the (derived ) nomologicals of type a. These latter propositions pennit predictions about events. T he constmct is not "reduced" to the observations, but only combined with other construc ts in the net to make predictions about observables.

4. "Learning more about" a theoretical construct is a matter of elaho·

rating the nomological network in which it occurs, or of increasing the definiteness of the components. At least in the early history of a con-

struct the network will be limited, and the const ruct will as yet have

few connections.

S. A11 enrichment of the net such as adding a construct or a relation

to the theory is justified if it generates nomologicals th at are confirmed

hy observation or if it reduces the number of nomologicals required to

L. J. Cronbach and P. E. Meehl predict the same observations. When observations will not fit into the network as it stands, the scientist has a cer tain freedom in selecting where to modify the network. That is, t here may be alternative con- structs or ways of organ iz ing th e n et wh ich for the time being are equally defensible.

  1. \Ve can say that "operations" which are qualitatively very different

"overlap" or "measure the same thing" if their positions in the nomo-

logical net tie them to the same cons truct variable. Our confidence in th is identification depe nds upon th e amount of inductive support we have for the regions of the net involved. I t is not necessary that a direct observational comparison of the two operations be made-we may be content with an intra-network proof indicating that the two operations yield estimates of th e same network-defined quantity. Thu s, ph ysicists are content to speak of the "tem perature" of the sun and the "tempera- ture" of a gas at room temperature even though the test operations are nonoverlapping because this identification makes theoretical sense. With these statemen ts of scientific methodology in mind, we return to the specific problem of construct validity as applied to psychological tests. The preceding guide rules should rea ss ure th e "toughrninded ," who fear t hat allowing construct validation opens the door to nonconfinn-

able test claims. The answer is t h at unless the network makes contact

with observations, and exhibits explicit, public steps of inference, con-

struc t validation can not be claimed. An admissible psychological con-

struct mu st be bel1avior-relevant ( 59, p. 15). For most tests intended

to measure constructs, adequate criteria do not exisl This being the

case, many such tests have been left unvalidated, or a finespun network of rationalizations has been offered as if it were validation. Rationaliza- ti on is n ot const ruct validation. One who claim s that his test reflects a construct cannot maintain h is claim in the face of recurrent negative results because these results show that his construct is too loosely defined to yield verifiable inferences. A rigorous (though perhaps probabilistic) chain of inference is r e- quired to establish a test as a measure of a construct. To valida te a claim that a t es t measur es a construct, a nomological n et surrounding the concept must exist. When a construct is fairly new, there may be

few specifiable associations by which to pin down the concept. As

rcscmch proceeds, th e const ruct sends out roots in many directions, which atla('h it to more and more facts or other constructs. Thus the

CONSTRUCT VALIDITY IN PSYCHOLOCICAL TESTS electron has more accept ed properties than the neutrino; n um erical

ability has more than the second space factor.

"Acceptance," which was critical in criterion-oriented and content validities, ha s now appeared in construct validity. Unless substantially the same nomological net is accept ed by the several users of the con-

struct, public validation is impossible. If A uses aggressiveness to mean

overt assault on others, and B's usage includes repressed hostile reactions,

evidence which convinces B that a test measures aggressiveness convinces

A that the test does not. H ence, the investigator who proposes to estab-

lish a test as a measure of a construct must specify his network or th eory

sufficiently clearly so that others can accept or reject it (cf. 41, p. 406). A cons um er of the test who re jects t he author's theory ca nnot accept the author's validation. He must validate the test for himself, if he

wi sh es to show that it represents the construct as he defines it.

Two general qualifications are in order with reference to the methodo-

logical principles 1 -6 set forth at the beginn ing of this section. Both

of them concern the amount of "theory," in any high-level sense of that word, which enters into a construct-defining network of laws or

lawlike statements. We do not wish to convey the impression that o ne

always has a very elaborate theoretical network, rich in hypothetical processes or entities. Constructs as inductive summaries. In the early stages of develop- me nt of a construct or even at more advanced stages wh en our orienta-

tion is thoroughly practical, little or no theory in the usual sense of the

word need be involved. In the extreme case t he h ypothesized laws are

formula ted entirely in terms of descriptive (observational) dimensions although not all of the relevant observations have actually been made.

The h ypothesized network "goes beyond the data" only in th e l imited

sense that it purports to characterize the behavior facets whi ch belong

to an observable b ut as yet only partially sampled cluster; hence, it gen-

erates predictions about hitherto unsampled regions of the phenotypic space. Even though no unobservables or high·order theoretical constructs are introduced, an element of inductive ex trapolation appears in th e claim th at a cluster including some clements n ot-yct-ohsc rv ecl has been identified. Si nce, as in any sorting or abstracting 1':1sk involving n finite set of complex clements, several 11oncquivnl c 11I h nscs of cnlcgor ii'J1tion are availablc, the inves l'i gat or 111 :1y choose u l1 ypolh cs is which generates erro neous predi ct ion s. Th e failure of n snppol>cd, h itherto 1111lricd, mcm-

L. J. Cronbach and P. E. Meehl

Vagueness of present psyclwlogical laws. This line of thought leads

directly to our second important qualification upon the network schema. The idealized picture is one of a tidy set of postulates which jointly entail the desired theorems; since some of the theorems are coordinated to the observation base, the system constitutes an implicit definition of the theoretical primitives and gives them an indirect empirical mean- ing. In practice, of course, even the most advanced physical sciences only approximate this ideal. Questions of "categoricalness" and the like, such as logicians raise about pure calculi, are hardly even statable for empirical networks. (What, for example, would be the desiderata of a "well-formed formula" in molar behavior theory?) Psychology works with crude, half- explicit formulations. We do not worry about such advanced formal questions as "whether all molar-behavior statements are decidable by appeal to the postulates" because we know that no existing theoretical network suffices to predict even the known descriptive laws. Neverthe- less, the sketch of a network is there; if it were not, we would not be saying anything intelligible about our constructs. We do not have the rigorous implicit definitions of formal calculi (which still, be it noted, usually permit of a multiplicity of interpretations). Yet the vague, avowedly incomplete network still gives the constructs whatever mean- ing they do have. When the network is very incomplete, having many strands missing entirely and some constructs tied in only by tenuous threads, then the "implicit definition" of these constructs is disturbingly loose; one might say that the meaning of the constructs is tmderdeter- rnined. Since the meaning of theoretical constructs is set forth by stating the laws in which they occur, our incomplete knowledge of t11e laws of

nature produces a vagueness in our constructs (see Hempel, 30; Kaplan,

34; Pap, 51). We will be able to say "what anxiety is" when we know

all of the laws involving it; meanwhile, since we are in the process of discovering these laws, we do not yet know precisely what anxiety is.

Conclusions Regarding the Network after Experimentation

The proposition that x per cent of test variance is accounted for by

the construct is inserted into the accepted network. The network then generates a testable prediction about the relation of the test scores to certain other variables, and the investigator gathers data. If prediction 1111(1 res ult arc in harmony, he can retain his belief that the test measures

CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS the construct. The construct is at best adopted, never demonstrated to be "correct." We do not first "prove" the theory, and then validate the test, nor conversely. In any probable inductive type of inference from a pattern of observations, we examine the relation between t he total network of theory and observation s. 111e syst em involves propositions relating test to construct, construct to oth er constructs, and finally rcJa ting some of these constructs to observables. In ongoing research the chain of in-

ference is very complicated. Kelly and Fiske (36, p. 124 ) give a complex

diagram showing the numerous inferences required in validating a pre- diction from assessment techniques, where theori es about the criterion situation are as integral a part of the prediction as are the test data.

A predicted empirical relationship pe rmits us to test all th e propositions

leading to that prediction. Traditionally the proposition claiming to interpret the test has been set apart as the hypothesis being tested, but actually the evidence is significant for all parts of the chain. If the prediction is not confirmed, any link in the chain may be wrong. A theoretical network can be divided into subtheories used in making

particular predictions. All the even ts successfully predicted through a

snbtheory are of course evidence in favor of that theory. Such a snb- theory may be so well confirmed by voluminous and diverse evidence

that we can reasonably view a particular experiment as relevant only to

the test's validity. If the theory, combined with a proposed test interpre-

tation, mispredicts in this case, it is the latter which must be abandoned.

On the other han d, the accumulated evidence for a test's construct

validity may be so strong that an instance of misprediction will force

us to modi fy the subtheory employing the construct rather than deny the claim that the test measures the construct. Most cases in psychology today lie somewhere b etween these extremes. T hus, suppose we fail to find a greater incidence of "homosexual sign s" in t he Rorschach records of paranoid patients. Which is more strongly disconfirmed-the Rorschach signs or the orthodox theory of paranoia? The negative finding shows the bridge between the two to be und e- pendable, but this is all we can say. The bridge can not be used unless one end is placed on solider ground. The investigator must decide which en d it is best to relocate. Numerous successful predictions dealing with phenotypically diverse ''criteria" give greater weight to the claim of construct validi ty t han do

L. /. Cronbach and P. E. Meehl

fewer predictions, or predictions involving very similar bel1aviors. In arriving at diverse predictions, the hypothesis of test validity is con· nected each time to a subnetwork largely independent of the portion previous ly used. Success of these derivations testifies to the inductive power of the test-validity statement, and renders it unlikely that an equally effective alternative can be offered.

IMPLICATIONS OF NEGATIVE EVIDENCE The investigator whose prediction and data are discordant must make

strategic decisions. His results can be interpreted in three ways:

  1. The test does not measure the construct variable.

2. TI1e theoretical network that generated the hypothesis is incorrect.

3. The experimental design failed to test the hypothesis properly.

(Strictly speaking this may be analyzed as a special case of 2, but in

practice the distinction is worth making.)

For further research. If a specific fault of procedure makes the third

a reasonable possibility, his proper response is to pedonn an adequate study, meanwhile making no report. \Vhen faced with the other two alternatives, he may decide that his test does not measure the construct adequately. Following that decision, he will perhaps prepare and validate a new test. Any rescoring or new interpretative procedure for the origi- nal instrument, like a new test, requires validation by means of a fresh

body of data.

The investigator may regard interpretation 2 as more likely to lead to eventual advances. It is legitimate for the investigator to call the network defining the construct into question, if he has confidence in the test. Should the investigator decide that some step in the network is un sound, he may be able to invent an alternative network. Perhaps he modifies the network by splitting a concept into two or more por- tions, e.g., by designating types of anxiety, or perhaps he specifies added

conditions under which a generalization holds. \Vhen an investigator

modifies the theory in such a manner, he is now required to gather a

fresh body of data to test the altered hypotheses. This step should

normally precede publication of the modified theory. If the new data

are consistent with the modified network, he is free from the fear that his nomologicals were gerrymandered to fit the peculiarities of his fir st

~: 1111plc of observations. He can now tru st his test to some extent,

hcrnusc his test results behave as predicted.

CONSTRUCT VALIDITY JN PSYCI-IOLOCICAL TESTS The choice among alternatives, like any strategic decision, is a gamble as to which course of action is the best investment of effort.

Is it wise to modify the theory?' That depends on how well the system

is co!lfirmed by prior data, and how well the modifications fit available observations. Is it worth while to modify the test in the hope that it will fit the construct? That depends on how much evidence there is- apart from this abortive experiment-to support the hope, and also on how much it is worth to the investigator's ego to salvage the test. TI1e choice among alternatives is a matter of research planning and no routine policy can be stated.

For practical use of the test. The consumer can accept a test as a

measure of a construct only when there is a strong positive 6t between

predictions and subsequent data. When the evidence from a proper investigation of a published t est is essentially negative, it should be reported as a stop sign to discourage use of the test pending a recon-

ciliation of test and construct, or final abandonment of the t es t. If

t he test has not been published, it should be restricted to research use

until some degree of validity is established (I). The consum er can

await the results of the investigator's gamble, with confidence that proper application of the scientific method will ultimately tell whether the test has value. Until the evidence is in, he has no justification for em- ploying the test as a ba sis for terminal decisions. The test may serve, at best, only as a source of suggestions about individuals to be confirmed

by other evidence ( 15, 47).

There are two perspectives in test validation. From the viewpoint of

the psychological practitioner, the burden of proof is on the test. A test

should not be used to measure a trait until its proponent establishes that

predictions made from such measures are consistent with the best avail- able theory of the trait. In the view of the test developer, however, both the test and the theory are under scrutiny. He is free to say to himself

privately, "If my test disagrees with the theory, so mnch the worse for

the theory." This way lies delusion, unless he continues h is re.search using a better theory.

ltlo'.POlt'l'INC OF l'OSITIV I•: IU ·~'i\11.TS T he test developer who fi11cls posil ivc ('()rrcspo11clc11c:c l>dw cc11 his proposed intcrprclatio11 nml dnln is ex petted Io report Ihe ha s is for

L. /. Cronbach wd P. E. Meehl

which in itself provides a linkage between the various ke ys and the

various criteria. Thus, while Strong's Vocational Interest Blank is de·

veloped empirically, it also rests on a "theory" that a youth can be

expected to be satisfied in an occupation if h e has i nterests common to men now happy in the occupation. When Strong finds that those

with high Engineering in terest scores in college are preponderantly in

engineering careers nineteen years later, he has partly validated the

proposed use of the Engineer score (predictive validity). Since the evidence is consiste nt with the theory on which all the test keys were

built, thi s evidence alone increases the presumption that the other keys

have predictive validity. H ow stro ng is this presumption? Not very, from the viewpoint of the traditional skepticism of science. Engineering in-

terests may stabilize ear1y, while interests in art or management or social

work are still unstable. A claim cannot be made that t he whole Strong approach is valid just because one score shows predictive validity. But if thirty interest scores were investigated longitudinally and all of them showed t he type of validity predicted by Strong's theory, we would indeed be caviling to say that this evidence gives no confidence in the long-range validity of the thirty-first score. Co nfidence in a t heory is increased as more relevant evidence confirms it, b ut it is always possible that tomorrow's in ves tigation will render the theory obsolete. The Tcclmical Recommendations suggest a rule of

reason, and ask for evidence for each type of inference for which a

test is recommended. It is stated that no test developer can present

predictive validities for all possible criteria; similarly, no developer ca n run all possible experimental tests of his proposed interpretation. But the recommendation is more subtle than advice tl1at a lot of validation is better than a little. Con sider the Rorschach test. It is used for many inferences, made

by means of nomological networks at several levels. At a low level are

the simple unrationalizcd correspondences presumed to exist between certain signs and psychiatric diagnoses. Validating such a sign does nothing to substantia te Rorschach theory. For other Ro rschach formulas an explicit a priori rationale exists (for instance, high F per cent in -

terpreted as implying rigid control of impulses). Each time such a sign

shows correspondence with criteria, its rationale is supported just a little. At n still higher le vel of abstraction, a considerable hocl y of

I hcory surrounds tlic general area of outer control, intcrlating many

CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS different constructs. As evidence cumulates, one should be able to decide what specific inference-making chains within this system can he de- pended upon. One should also be able to conclude-or deny-that so mu ch of the system has stood up under test that one has some confi· dence in even the untested lines in the network. In addition to relatively delimited nomological networks surrounding

control or aspiration, the Rorschach interpreter usua1ly has an overriding

theory of the test as a whole. ' TI1is may be a psychoanalytic theory, a

theory of perception and set, or a theory stated in terms of learned habit patterns. \Vhatever th e th eory of th e int erpreter, whenever h e val idates an inference from th e sys tem, he obtains some reason for added confidence in his overriding system. His total theory is not tested, however, by experiments dealing with only one limited set of constructs. The test developer must investigate far-separated, independent sections of th e network. rlbe more diversi fied the predictions tl1e system is re-

qu ired to make, the greater confidence we can have that only minor

parts of the system will later prove faulty. Here we begin to glimpse a logic to defend the judgment that the test and its whole interpreta· tive system is valid at some level of confidence. Th ere are enthusiasts who would conclude from th e foregoing para- graphs t.h at since there is some evidence of correct, diverse predictions made from the Rorschach, the test as a whole can now be accepted as validated. This conclusion overlooks the negative evidence. Just one finding contrary to expectation, based on sound research, is sufficient to wa sh a whole theoretical structure away. Perhaps the remains can be sal-

vaged to fo rm a new structure. But this structure now mu st be exposed

to fresh risks, and sound negative evidence will destroy it in turn. T here

is sufficient negative evidence to prevent acceptance of the Rorschach and its accompanying interpretative structures as a whole. So long as any aspects of the overriding theory stated for the test have been dis- confirmed, this structure must be rebuilt.

Talk of areas and structures may seem not to recognize those who

would interpret the personality "globally." They may argue that a test is best valida ted in matching studies. \Vithout going into detailed questions of matching methodology, we can ask whether such a study va lidates the nomological network "as a whole." The judge does employ some network in arriving at h is conception of his subject, integrating

specific inferences from spec ifi c data. Matching studies, if successful,

L. J. Cronbach and P. E. MeeIJI

demonstrate only that each judge's interpretative theory has some

valid ity, that it is not completely a fantasy. Very high consistency

between judges is required to show that they are using the same net-

work, and very high success in matching is required to show that the

network is dependable.

If inference is less than perfectly dependable, we must know which

aspects of the interpretati ve network are least dependable and which

are most dependable. Thus, even if one has considerable confidence in

a test "as a whole" because of frequent successful inferences, one still

returns as an ultimate aim to the request of the Technical Recommenda-

tions for separate evidence on the validity of each type of inference to

be made.

Recapitulation

Construct validation was introduced in order to specify types of re-

search required in developing tests for which the conventional views

on validation are inappropriate. Personality tests, and some tests of

ability, are interpreted in terms of attributes for which there is no

adequate criterion. This paper indicates what sorts of evidence can

substantiate such an interpretation, and how such evidence is to be

interpreted. The fo11owing points made in the discussion are particu-

larly significant.

I. A construct is defined implicitly by a network of associations or

propositions in which it occurs. Constructs employed at different stages

of research vary in definiteness.

2. Construct validation is possible only when some of the statements

in the network lead to predicted relations among observables. "While

some observables may be regarded as "criteria," the construct validity

of the criteria themselves is regarded as under investigation.

3. The network defining the construct, and the derivation leading to

the predicted observation, must be reasonably explicit so tbat validating

evidence may be properly interpreted.

4. Many types of evidence are relevant to construct validity, including

content validity, interitem correlations, intertest correlations, test-"cri-

terion" correlations, studies of stability over time, and stability under

experimental intervention. High correlations and high stability may

constitute either favorable or unfavorable evidence for the proposed

int·crprcl'ation, depending on the theory surrounding the construct.

CONSTRUCT VALIDITY IN PSYCHOLOGICAL TESTS

;. "When a predicted relation fails to occur, the fault may lie in the

propo sed interpretation of the test or in the network. Altering the net-

work so that it can cope with the new observations is, in effect, re-

defining the construct. Any such new interpretation of the test must

be validated by a fresh body of data before being advanced publicly.

Great care is required to avoid substituting a posteriori rationalizations

for proper validation.

6. Construct validity cannot generally be expressed in the form of

a single simple coefficient. The data often permit one to establish

upper and lower bounds for the proportion of test variance which can

be attributed to the construct. The integration of diverse data into a

proper interpretation cannot be an entirely quantitative process.

7. Constructs may vary in nature from those very close to "pure de-

scription" (involving little more than extrapolation of relations among

observation-variables) to highly theoretical constructs involving hypothe-

sized entities and processes, or making identifications with constructs

of other sciences.

8. 1ne investigation of a test's construct validity is not essentially

different from the general scientific procedures for developing and con-

firming theories.

. Without in the least advocating construct validity as preferable to

the other three kinds {concurrent, predictive, content), we do believe

it imperative that psychologists make a place for it in their methodo-

logical thinking, so that its rationale, its scientific legitimacy, and its

dangers may become explicit and familiar. This would be preferable

to the widespread current tenclency to engage in what actually amounts

to const ru ct validation research and use of constructs in practical test-

i ng, while talking an "operational" methodology which, if adopted,

would force research into a mold it does not fit.

REFERENCES I. American Psychological Association. Ethical Standards ot Psychologists. Wash- ington, D.C.: Amer. Psychological Assn., 1953.

  1. Anastasi, Anne. "The Concept of Validity in the Interpretation of Test Scores," Educational and Psycholo gical Measurement, 10:67-78 (1950).
  2. Bechtoldt, H. P. "Selection," in S. S. Stevens (ed.}, Handbook of Experimental Psychology, pp. 1237-67. New York : Wiley, 1951.
  3. Deck, L. W. "Constructions and Inferred Entities," Philosopl1y of Science, 17: 74--86 ( 19 50). Reprinted in H. Feigl and M. Brodbeck (eds.). Readings in th e Philosop11y of Science, pp. 368-81. New York : Appleton·Century·Crofts, 1953. 20 1