Evaluating the Constructs Underlying Ability Achievement and Other Tests of Cognitive Competencies

It is important to appreciate that the vertex of the hierarchical model (derived factor analytically) and

2To be sure, other frameworks have been proposed, involving emotional, moral, multiple, and practical intelligence; but these have yet to generate meaningful empirical advances beyond what conventional cognitive ability assessments afford (cf. Brody, 2003; Gottfredson, 2003b; Hunt, 1999; Lubinski & Benbow, 1995). Messick (1992) in particular has skillfully demonstrated that many of these proposed innovations are found in earlier frameworks.

'Attracting considerable attention nowadays is "stereotypic threat," a hypothesis purporting that the validity of psychometric assessments is markedly attenuated for certain underrepresented groups. For multiple reasons, assessment specialists question the tenability of this idea; interested readers will find the following five pages both informative and intriguing (Jensen, 1998, pp. 513-515; Sackett, Schmidt, et al„ 2001, pp. 309-310). (Also see, Cullen, Hardison, & Sackett, 2004; Sackett, Hardison, & Cullen, 2004; Strieker & Bejar, 2004; Strieker & Ward, 2004.) Finally, the National Science Foundations has published several validation reports on psychological tests in school and work settings (e.g., Hartigan & Wigdor, 1989; Wigdor & Garner, 1982; Wigdor & Green, 1991).

Radex Organization of Human Abilities y y O Film memory

OUses y y O Film memory

Hidden figures arith.fi ^

Word beg. & end A \ Achieve. Q* Terman Achieve. V^s. V

Vis. number span

\ \ J>W-digitspan backward

W-picture completion O

Hidden figures

Surface develop

Paper X form board \

Paper X form board \

Surface develop

W-block Paper design

W-object assembly X

W-block Paper design

W-picture completion O

W-object assembly X

Street gestalt arith.fi ^

Word beg. & end A \ Achieve. Q* Terman Achieve. V^s. V

Vis. number span

\ \ J>W-digitspan backward

^ W-digit span forward

W-information yQJJaven /r 1| Letter \G/ "serf

W-information

W-similarities s

W-similarities s

Street gestalt

O W-picture . arrangement/

FIGURE 8.2. Each point in the diagram represents a test. These tests are organized by content and by complexity. Complex, intermediate, and simple tests are indicated by squares, triangles, and circles, respectively. Distinct forms of content are represented as black (verbal), dotted (numerical), and white (figural-spatial). Clusters of abilities that define well-known factors are indicated by a G. Gj= fluid ability, Gc = crystallized ability, Gv = spatial visualization. Tests having the greatest complexity are located near the center of the centroid of the radex. Reprinted from Intelligence, 7, B. Marshalek, D. F. Lohman, and R. E. Snow, "The Complexity Continuum in the Radex and Hierarchical Models of Intelligence," p. 122. Copyright 1983, with permission from Elsevier.

the centroid of the radex model (uncovered through multidimensional scaling) constitute psychologically equivalent factors. This dimension accounts for approximately 50% of the common variance running through heterogeneous collections of cognitive tests in a wide range of talent. It is also a source of variance traveling through more assessment vehicles than many psychologists realize, and it accounts for the preponderance of criterion variance that cognitive abilities are capable of predicting. There are certainly other cognitive abilities beyond the general factor that are useful in predicting real-world criteria (Humphreys, Lubinski, & Yao, 1993; Shea, Lubinski, & Benbow, 2001), and a subsequent section will provide examples. However, to evaluate the unique psychological import of specific abilities, it is necessary to establish their discriminant validity from the general factor, as well as their incremental validity, relative to it, in the prediction of meaningful psychological criteria. This latter point needs to be particularly stressed, as all too often, innovative instruments are correlated with general intelligence and manifest modest correlations. Then, high reliability coefficients are used to argue that the new indicator is distinctive because the majority of its reliable variance is unique to general intelligence. This line of reasoning says nothing about the measure's psychological importance.

Reliable variance that is psychologically uninteresting. The measurement literature is replete with examples of components of variance that are reliable but not necessarily psychologically important: method variance (Campbell & Fiske, 1959), constant error (Loevinger, 1954), systematic bias (Humphreys, 1990), systematic ambient noise (Lykken, 1968), and crud (Meehl, 1990). These terms denote reliable sources of variance that are construct irrelevant (Cook & Campbell, 1979), which saturate all psychological measuring devices. The point is that discriminant validity is only one step in the process of evaluating the psychological significance of measuring tools. Incremental validity is also needed (Sechrest, 1963). Moreover, given that general intelligence accounts for the preponderance of variance that cognitive abilities account for in predicting performance criteria in educational, employment, and training settings (Brody, 1992; Jensen, 1980, 1998; Schmidt & Hunter, 1998; Viswesvaran & Ones, 2002), unless there are compelling reasons to do otherwise, parsimony suggests that innovative tools be evaluated for their incremental validity relative to this standard (cf. Lubin-ski, 2000; Lubinski & Dawis, 1992). After all, general intelligence is the ability construct with the most referent generality, which is why Humphreys (1976) refers to this dimension as, "the primary mental ability." Earlier treatments of this idea are well worth reading (Humphreys, 1962, 1976; McNemar, 1964), as are the words of Messick (1992, p. 379): "Because IQ is merely a way of scaling measures of general intelligence ["g"], the burden of proof in claiming to move beyond IQ is to demonstrate empirically that. . . test scores tap something more than or different from general intelligence. ..."

Reliable variance that is psychologically interesting. Technically, of course, all ability tests carry multiple constructs. That this is true of all assessments of individual differences is partly what motivated Campbell and Fiske (1959) to develop the MTMM matrix and the idea of convergent validity. (Actually, method variance itself can be construed as a construct.) However, some ability tests carry large components of variance relevant to psychologically important general (complexity) and specific (more content focused) constructs, whereas others carry variance primarily restricted to the former. For multifaceted indicators, containing appreciable components of multiple constructs (e.g., those illustrated in Figure 8.3 [viz., Xp X2, and X3]), it is important to ascertain whether general ("g") or specific constructs (viz., Sp S2, or S3) are at work when they manifest validity by forecasting important external criteria: Are the scale's external relationships due to common variance (g, shared with all cognitive measures) or specific variance (S, more indicative of the scale's manifest content)? Answering this question speaks to Messick's (1992) requirement for going beyond IQ.

Given the preceding, several considerations need to be entertained before launching causal inferences about constructs underlying test performance (Gustaffson, 2002; Lubinski, 2004). Consider, for example, mathematical, spatial, and verbal abilities. Content-focused measures within intermediate tiers of the hierarchical organization of cognitive abilities typically carry appreciable components of general and specific variance. So, in studies of the external validity of specific ability instruments (e.g., the constituents in Figure 8.3, viz., Xv X2, or X3), these investigations need to incorporate indicators involving predominantly general factor variance (e.g., the composite in Figure 8.3, viz., Xx + X2 + X3), if the underlying constructs are to be appraised. Doing so enables evaluations of the extent to which general or specific constructs or both are operating (and to what degree). When measures consisting of predominantly specific variance add incremental validity to the prediction of relevant criteria, after a composite that consists largely of the general factor has been entered in a multiple regression equation, evidence is gleaned for the psychological significance of an important cognitive ability distinct from

FIGURE 8.3. Three scales each composed of 35% common variance, 55% specific variance, and 10% error variance (top). When these three scales are aggregated (bottom), the resulting composite consists mostly of the variance they share (61% common variance). Modified and reproduced by special permission of the Publisher CPP, Inc. Mountain View, CA 94043, from "Aptitudes, skills, and proficiencies" by D. Lubinski & R. V. Dawis, in Handbook of Industrial and Organizational Psychology (2nd ed., Vol. 3), by M. D. Dunnette & L. M. Hough (Eds.). Copyright 1992 by CPP, Inc. All rights reserved. Future reproduction is prohibited without the Publisher's written consent.

FIGURE 8.3. Three scales each composed of 35% common variance, 55% specific variance, and 10% error variance (top). When these three scales are aggregated (bottom), the resulting composite consists mostly of the variance they share (61% common variance). Modified and reproduced by special permission of the Publisher CPP, Inc. Mountain View, CA 94043, from "Aptitudes, skills, and proficiencies" by D. Lubinski & R. V. Dawis, in Handbook of Industrial and Organizational Psychology (2nd ed., Vol. 3), by M. D. Dunnette & L. M. Hough (Eds.). Copyright 1992 by CPP, Inc. All rights reserved. Future reproduction is prohibited without the Publisher's written consent.

general intelligence.4 This is also true for innovative tests developed to measure innovative constructs.

Other tests of cognitive competencies. A large literature has emerged (independent of the psychometric assessment of human abilities) to suggest that investigators from other disciplines have built measures of general intelligence without knowing it. This is why it is important to distinguish between the complexity and the content of a test and, subsequently, conducting incremental validity appraisals (Sanders, Lubinski, & Benbow, 1995).

4When general factor variance is operating predominantly, as Cronbach and Snow (1977) have revealed for many educational treatments, and Schmidt and Hunter (1998) have revealed for multiple performance criteria in the world of work, content-focused specific ability tests will typically achieve significant results as well, if used in isolation. However, the general variance is probably what is doing the work. The construct operating involves the complexity of the test (or its general factor variance) rather than the content of the test (or its specific factor variance). Venturing causal inferences about the operative construct from specific ability measures used in isolation is hazardous. For the same reasons, venturing causal inferences about the operative construct underlying innovative measures without considering their overlap with powerful preexisting measures within the hierarchy of cognitive abilities is hazardous as well. All too often, what is purported to be a major advance turns out to be a manifestation of the Jangle fallacy (Kelley, 1927) or a "psychological factor of no importance" (Kelley, 1939). For further and more detailed reading on these ideas, see Gustaffson (2002) and Lubinski (2004; Lubinski & Dawis, 1992).

Complexity travels through multiple content domains and frequently carries most, sometimes all, of the predictable criterion variance.

Three independent lines of work on "functional literacy" (i.e., health literacy [National Work Group on Literacy and Health, 1998], adult literacy [Kirsch, Jungeblut, Jenkins, & Kolstad, 1993], and worker literacy [Sticht, 1975]) have generated distinct measures designed to assess, respectively: knowledge critical for healthy behavior (taking medication properly), everyday skills (interpreting a bus schedule accurately), and employability (skills related to individual differences in employability). Each team built assessment tools with content saturated with effective functioning for good health, life in general, or the world of work (see Gottfredson, 2002, for a detailed analysis of these three lines of research). These instruments, however, all appear to converge on the same underlying construct, a dominant dimension involving individual differences in processing complex information (the vertex of Carroll's hierarchy, or the centroid of Snow's radex). In the information age, it is the processing of complex information that is critical for adaptive performance in multiple arenas. Indeed, the authors of the U.S. Department of Education's National Literacy Survey (NALS) began their scale construction procedures aiming to assess three kinds of literacy: prose, documents, and quantitative. They found, however, that despite an effort to create relatively distinctive measures, their three scales correlate over .90. Thus, "major [NALS] survey results are nearly identical for each of the three scales . . . with findings appearing to be reported essentially in triplicate, as it were" (Reder, 1998, pp. 39, 44).

Just like inventors of the initial specific ability measures were aiming for a parsimonious set of relatively independent dimensions, primarily defined by different content and a theory of group factors (Kelley, 1928; Thurstone, 1938), modern investigators seeking to evaluate individual differences pertaining to health knowledge, reading comprehension, and work competency underappreciated the amount of psychological similarity running through these various cognitive tasks—content domains—that people encounter in everyday life.

Higher levels of general intelligence facilitate the acquisition of many different kinds of knowledge, relative to lower levels.

The preceding findings demonstrate the scope of general intelligence. This construct travels through many different kinds of assessment vehicles, because it travels through many different aspects of life. Individual differences in this attribute reflect differential capabilities for assimilating cultural content. Therefore, general intelligence seems closely aligned with Woodrow's (1921) initial characterization; namely, intelligence is "the capacity to develop capacity."

In the preceding example, the instruments that were developed all measure general intelligence to a remarkable degree. It would be interesting to ascertain whether they manifest any incremental validity beyond general intelligence in the life domains that they were designed for. Moreover, reading, per se, is not the source of overlap, because these findings replicate when questions are given orally (Kirsch et al., 1993; Sticht, 1975). Assimilating, comprehending, and processing information are the individual differences assessed by these measures. Moreover, as reading researchers discovered long ago, there is much more to reading comprehension than simply decoding words. These orally administered assessments constitute an important line of convergent validity (Campbell & Fiske, 1959) because they use a distinctly different medium (viz., oral as opposed to written instructions or listening as opposed to reading). For native speakers, reading ability is comprehension (Jensen, 1980, pp. 325-326); spoken language and written language are just different vehicles—different methods—for conveying information (cf. Carroll, 1997).

Therefore, social scientists interested in the study of health, life competencies, and work competencies can draw on a broad nomological network afforded from decades of psychometric research on ability tests. Traditional ability tests assess individual differences relevant to the phenomena that they are interested in. They generalize to many important life domains outside the educational, occupational, and training settings used for their initial development (cf. Gordon, 1997; Gottfredson, 2002, 2004).

Distinguishing ability and achievement tests. The forgoing discussion about the generality of contrasting "literacies," or tests initially designed to measure more circumscribed competencies, is in many ways not surprising. Because again, the general factor accounts for the majority of the variance that cognitive abilities are capable of predicting. Therefore, it might be expected that when measures are developed without taking into consideration what conventional tests afford, some reinventing of the wheel will occur. Back in the 1920s, Kelley (1927) knew that tiny slivers of general intelligence run through all achievement items. Therefore, when achievement items are sampled broadly and aggregated to form composites, functionally equivalent measures of general intelligence are formed (cf. Roznowski, 1987). Indeed, when Kelley (1927) bemoaned the amount of redundancy across multiple psychological tests and introduced his well-known jangle fallacy to bring this problem to light, he used ability and achievement tests to exemplify the problem.

Equally contaminating to clear thinking is the use of two separate words or expressions covering in fact the same basic situation, but sounding different, as though they were in truth different. The doing of this . . . the writer would call the "jangle" fallacy. "Achievement" and "intelligence" . . . We can mentally conceive of individuals differing in these two traits, and we can occasionally actually find such by using the best of our instruments of mental measurement, but to classify all members of a single school grade upon the basis of their difference in these two traits is sheer absurdity. (Kelley, 1927, p. 64)

Cronbach (1976) reinforced this idea 50 years later: When heterogeneous collections of routine achievement measures are combined to form a total score, an excellent measure of general intelligence is formed. Just prior to this, an APA task force concluded that different "achievement" and "aptitude or ability" tests can be reduced to four dimensions

(Cleary et al., 1975): (a) breadth of item sampling, (b) the extent to which items are tied to a specific educational program, (c) recency of learning assessed, and (d) the purpose of assessment (viz., current status, concurrent validity, or potential for growth, predictive validity). Ability and achievement tests do not differ qualitatively; they differ quantitatively along these four dimensions. Indeed, the same kinds of items (frequently identical items) are routinely found on both "kinds" of tests.

When large numbers of items are broadly sampled from different kinds of information and problem-solving content, not necessarily tied to an educational curriculum, which may involve recent as well as old learning (acquired formally or informally), their aggregation forms a composite that accurately assesses general intelligence. This idea, of course, is Spearman's (1927) indifference of the indicator. However, if familiar achievement or information items are to be used—rather than relatively content free reasoning problems (e.g., Raven matrices)—it is important to stress that sampling should be broad to assess the general factor (cf. Roznowski, 1987). The reason "achievement" items are not used more routinely in assessing broad cognitive abilities is because they contain less construct relevant variance than conventional "ability" items (which are more abstract, complex, and content free). Hence, to assess abilities with high referent generality, many more achievement (knowledge) items, relative to ability (reasoning) items are required (Brody, 1994)—but this is a technical matter, not a conceptual or psychological issue.

Although pools formed by heterogeneous collections of information items may appear unsystematic, or a "hotchpotch" (Spearman, 1927), the communality they distill generates functionally equivalent correlates (Hulin & Humphreys, 1980). To be clear, for individual items, a large component of construct irrelevant uniqueness is associated with each. Indeed, at the item level, over 95% construct irrelevant variance is typical (Green, 1978). However, aggregation attenuates these contaminants and reduces their overall influence in the composite; the small communality associated with each item piles up as more items are added. The composite reflects mostly construct relevant variance (signal), even though each item consists mostly of construct irrelevant variance (noise).

Evaluating the interchangeability of tests. The foregoing discussion highlights why general intelligence variance needs to be controlled before specific ability and innovative cognitive competency measures can be adequately appraised. It also illustrates how similar constructs may travel through instruments that differ widely in content. However, although Cronbach (1970; Cronbach & Snow, 1977) has stressed the importance of a general cognitive ability dimension, he also has expressed concern about capturing this dimension precisely. His concern is germane to other dimensions of cognitive abilities as well.

Reflecting on "construct validity after 30 years," Cronbach (1989) noted that localizing the general dimension running through all cognitive abilities is problematic: Because the center of the radex (Snow & Lohman, 1989), or the vertex of a hierarchical organization (Carroll, 1993), always varies somewhat from sample to sample, and as a function of the diversity of the tests used, how is one ever to know whether the "true" center or summit has been found? Clearly, a method is needed to ascertain when experimentally distinct indicators measure the same construct in the same way.

To determine if two experimentally distinct assessment vehicles are indeed measuring the same construct to the same degree, Fiske (1971) developed extrinsic convergent validity. The idea is this: Two measures may be considered conceptually equivalent and empirically interchangeable if they display corresponding correlational profiles across a heterogeneous collection of external criteria. Examples of the integrative power of this idea can be found in the psychological literature (Judge, Erez, Bono, & Thoresen, 2002; Lubinski, Tellegen, & Butcher, 1983; Schmidt, Lubinski, & Benbow, 1998), but it is surprising that it is not more routinely used, given the amount of concern about redundancy in psychological measuring instruments (cf. Block, 2002; Dawis, 1992; Tellegen,

1993). Another attractive feature of this method is that when multiple measures generate the same pattern of correlations across a heterogeneous collection of external criteria (Judge et al., 2002; Lubinski et al., 1983; Schmidt et al., 1998), ostensibly distinct bodies of literature may be combined under one unifying construct. Consider Table 8.1, which reinforces the earlier discussion of a general cognitive ability running through all specific ability measures as well as all achievement tests.

Table 8.1 contains three experimentally independent measures with verbal content: literary information, reading comprehension, and vocabulary. They have intercorrelations of around .75, so they share approximately half of their variance. And their uniform reliabilities (high .80s) afford each appreciable nonerror uniqueness. Yet, examine the correspondence across their external correlational profiles, which include criteria ranging from other specific abilities to vocational interests. All three correlational profiles are essentially functionally equivalent. All three measures assess the same underlying construct, even though each possesses a large component of nonerror uniqueness or room for divergence. Essentially all the information they afford about individual differences is located in their overlap (or communal-ity). To refer to these three measures as assessing distinct constructs just because they have different labels and manifest content would constitute the jangle fallacy. For many research purposes, these three measures may be used interchangeably.

Notice also in Table 8.1 how these verbal measures covary with other cognitive abilities (quantitative and spatial reasoning) and tests of achievement (music and social studies knowledge), but are only lightly associated with educational-vocational interest measures, if at all. This convergent-discriminant pattern reflects the construct of general intelligence. When measures of quantitative, spatial, and verbal ability are systematically aggregated, an excellent measure of general intelligence is formed (Figure 8.3). The question now becomes, Is the external validity evinced by the three measures in Table 8.1 a function of their verbal content, or the general factor ("g") that runs through them? This is important to ascertain because, again, only after incremental

Extrinsic Convergent Validation Profiles Across Three Measures Having Verbal Content

Literature

Vocabulary

Reading comprehension

Aptitude Tests

Mechanical reasoning

Was this article helpful?

0 0

Post a comment