Internal sources of validity include the intrinsic characteristics of a test, especially its content, assessment methods, structure, and theoretical underpinnings. In this section, several sources of evidence internal to tests are described, including content validity, substantive validity, and structural validity.
Content validity is the degree to which elements of a test, ranging from items to instructions, are relevant to and representative of varying facets of the targeted construct (Haynes, Richard, & Kubany, 1995). Content validity is typically established through the use of expert judges who review test content, but other procedures may also be employed (Haynes et al., 1995). Hopkins and Antes (1978) recommended that tests include a table of content specifications, in which the facets and dimensions of the construct are listed alongside the number and identity of items assessing each facet.
Content differences across tests purporting to measure the same construct can explain why similar tests sometimes yield dissimilar results for the same examinee (Bracken, 1988). For example, the universe of mathematical skills includes varying types of numbers (e.g., whole numbers, decimals, fractions), number concepts (e.g., half, dozen, twice, more than), and basic operations (addition, subtraction, multiplication, division). The extent to which tests differentially sample content can account for differences between tests that purport to measure the same construct.
Tests should ideally include enough diverse content to adequately sample the breadth of construct-relevant domains, but content sampling should not be so diverse that scale coherence and uniformity are lost. Construct underrepresen-tation, stemming from use of narrow and homogeneous content sampling, tends to yield higher reliabilities than tests with heterogeneous item content, at the potential cost of generalizability and external validity. In contrast, tests with more heterogeneous content may show higher validity with the concomitant cost of scale reliability. Clinical inferences made from tests with excessively narrow breadth of content may be suspect, even when other indexes of validity are satisfactory (Haynes et al., 1995).
The formulation of test items and procedures based on and consistent with a theory has been termed substantive validity (Loevinger, 1957). The presence of an underlying theory enhances a test's construct validity by providing a scaffolding between content and constructs, which logically explains relations between elements, predicts undetermined parameters, and explains findings that would be anomalous within another theory (e.g., Kuhn, 1970). As Crocker and Algina (1986) suggest, "psychological measurement, even though it is based on observable responses, would have little meaning or usefulness unless it could be interpreted in light of the underlying theoretical construct" (p. 7).
Many major psychological tests remain psychometrically rigorous but impoverished in terms of theoretical underpinnings. For example, there is conspicuously little theory associated with most widely used measures of intelligence (e.g., the Wechsler scales), behavior problems (e.g., the Child Behavior Checklist), neuropsychological functioning (e.g., the Halstead-Reitan Neuropsychology Battery), and personality and psychopathology (the MMPI-2). There may be some post hoc benefits to tests developed without theories; as observed by Nunnally and Bernstein (1994), "Virtually every measure that became popular led to new unanticipated theories" (p. 107).
Personality assessment has taken a leading role in theory-based test development, while cognitive-intellectual assessment has lagged. Describing best practices for the measurement of personality some three decades ago, Loevinger (1972) commented, "Theory has always been the mark of a mature science. The time is overdue for psychology, in general, and personality measurement, in particular, to come of age" (p. 56). In the same year, Meehl (1972) renounced his former position as a "dustbowl empiricist" in test development:
I now think that all stages in personality test development, from initial phase of item pool construction to a late-stage optimized clinical interpretive procedure for the fully developed and "validated" instrument, theory—and by this I mean all sorts of theory, including trait theory, developmental theory, learning theory, psychodynamics, and behavior genetics—should play an important role. . . . [P]sychology can no longer afford to adopt psychometric procedures whose methodology proceeds with almost zero reference to what bets it is reasonable to lay upon substantive personological horses. (pp. 149-151)
Leading personality measures with well-articulated theories include the "Big Five" factors of personality and Millon's "three polarity" bioevolutionary theory. Newer intelligence tests based on theory such as the Kaufman Assessment Battery for Children (Kaufman & Kaufman, 1983) and Cognitive Assessment System (Naglieri & Das, 1997) represent evidence of substantive validity in cognitive assessment.
Structural validity relies mainly on factor analytic techniques to identify a test's underlying dimensions and the variance associated with each dimension. Also called factorial validity (Guilford, 1950), this form of validity may utilize other methodologies such as multidimensional scaling to help researchers understand a test's structure. Structural validity evidence is generally internal to the test, based on the analysis of constituent subtests or scoring indexes. Structural validation approaches may also combine two or more instruments in cross-battery factor analyses to explore evidence of convergent validity.
The two leading factor-analytic methodologies used to establish structural validity are exploratory and confirmatory factor analyses. Exploratory factor analyses allow for empirical derivation of the structure of an instrument, often without a priori expectations, and are best interpreted according to the psychological meaningfulness of the dimensions or factors that emerge (e.g., Gorsuch, 1983). Confirmatory factor analyses help researchers evaluate the congruence of the test data with a specified model, as well as measuring the relative fit of competing models. Confirmatory analyses explore the extent to which the proposed factor structure of a test explains its underlying dimensions as compared to alternative theoretical explanations.
As a recommended guideline, the underlying factor structure of a test should be congruent with its composite indexes (e.g., Floyd & Widaman, 1995), and the interpretive structure of a test should be the best fitting model available. For example, several interpretive indexes for the Wechsler Intelligence Scales (i.e., the verbal comprehension, perceptual organization, working memory/freedom from distractibility, and processing speed indexes) match the empirical structure suggested by subtest-level factor analyses; however, the original Verbal-Performance Scale dichotomy has never been supported unequivocally in factor-analytic studies. At the same time, leading instruments such as the MMPI-2 yield clinical symptom-based scales that do not match the structure suggested by item-level factor analyses. Several new instruments with strong theoretical underpinnings have been criticized for mismatch between factor structure and interpretive structure (e.g., Keith & Kranzler, 1999; Stinnett, Coombs, Oehler-Stinnett, Fuqua, & Palmer, 1999) even when there is a theoretical and clinical rationale for scale composition. A reasonable balance should be struck between theoretical underpinnings and empirical validation; that is, if factor analysis does not match a test's underpinnings, is that the fault of the theory, the factor analysis, the nature of the test, or a combination of these factors? Carroll (1983), whose factor-analytic work has been influential in contemporary cognitive assessment, cautioned against overreliance on factor analysis as principal evidence of validity, encouraging use of additional sources of validity evidence that move beyond factor analysis (p. 26). Consideration and credit must be given to both theory and empirical validation results, without one taking precedence over the other.
Evidence of test score validity also includes the extent to which the test results predict meaningful and generalizable behaviors independent of actual test performance. Test results need to be validated for any intended application or decision-making process in which they play a part. In this section, external classes of evidence for test construct validity are described, including convergent, discriminant, criterion-related, and consequential validity, as well as specialized forms of validity within these categories.
In a frequently cited 1959 article, D. T. Campbell and Fiske described a multitrait-multimethod methodology for investigating construct validity. In brief, they suggested that a measure is jointly defined by its methods of gathering data (e.g., self-report or parent-report) and its trait-related content (e.g., anxiety or depression). They noted that test scores should be related to (i.e., strongly correlated with) other measures of the same psychological construct (convergent evidence of validity) and comparatively unrelated to (i.e., weakly correlated with) measures of different psychological constructs (discriminant evidence of validity). The multitrait-multimethod matrix allows for the comparison of the relative strength of association between two measures of the same trait using different methods (monotrait-heteromethod correlations), two measures with a common method but tapping different traits (heterotrait-monomethod correlations), and two measures tapping different traits using different methods (heterotrait-heteromethod correlations), all of which are expected to yield lower values than internal consistency reliability statistics using the same method to tap the same trait.
The multitrait-multimethod matrix offers several advantages, such as the identification of problematic method variance. Method variance is a measurement artifact that threatens validity by producing spuriously high correlations between similar assessment methods of different traits. For example, high correlations between digit span, letter span, phoneme span, and word span procedures might be interpreted as stemming from the immediate memory span recall method common to all the procedures rather than any specific abilities being assessed. Method effects may be assessed by comparing the correlations of different traits measured with the same method (i.e., monomethod correlations) and the correlations among different traits across methods (i.e., het-eromethod correlations). Method variance is said to be present if the heterotrait-monomethod correlations greatly exceed the heterotrait-heteromethod correlations in magnitude, assuming that convergent validity has been demonstrated.
Fiske and Campbell (1992) subsequently recognized shortcomings in their methodology: "We have yet to see a really good matrix: one that is based on fairly similar concepts and plausibly independent methods and shows high convergent and discriminant validation by all standards" (p. 394). At the same time, the methodology has provided a useful framework for establishing evidence of validity.
How well do test scores predict performance on independent criterion measures and differentiate criterion groups? The relationship of test scores to relevant external criteria constitutes evidence of criterion-related validity, which may take several different forms. Evidence of validity may include criterion scores that are obtained at about the same time (concurrent evidence of validity) or criterion scores that are obtained at some future date (predictive evidence of validity). External criteria may also include functional, real-life variables (ecological validity), diagnostic or placement indexes (diagnostic validity), and intervention-related approaches (treatment validity).
The emphasis on understanding the functional implications of test findings has been termed ecological validity (Neisser, 1978). Banaji and Crowder (1989) suggested, "If research is scientifically sound it is better to use ecologically lifelike rather than contrived methods" (p. 1188). In essence, ecological validation efforts relate test performance to various aspects of person-environment functioning in everyday life, including identification of both competencies and deficits in social and educational adjustment. Test developers should show the ecological relevance of the constructs a test purports to measure, as well as the utility of the test for predicting everyday functional limitations for remediation. In contrast, tests based on laboratory-like procedures with little or no discernible relevance to real life may be said to have little ecological validity.
The capacity of a measure to produce relevant applied group differences has been termed diagnostic validity (e.g., Ittenbach, Esters, & Wainer, 1997). When tests are intended for diagnostic or placement decisions, diagnostic validity refers to the utility of the test in differentiating the groups of concern. The process of arriving at diagnostic validity may be informed by decision theory, a process involving calculations of decision-making accuracy in comparison to the base rate occurrence of an event or diagnosis in a given population. Decision theory has been applied to psychological tests (Cronbach & Gleser, 1965) and other high-stakes diagnostic tests (Swets, 1992) and is useful for identifying the extent to which tests improve clinical or educational decision-making.
The method of contrasted groups is a common methodology to demonstrate diagnostic validity. In this methodology, test performance of two samples that are known to be different on the criterion of interest is compared. For example, a test intended to tap behavioral correlates of anxiety should show differences between groups of normal individuals and individuals diagnosed with anxiety disorders. A test intended for differential diagnostic utility should be effective in differentiating individuals with anxiety disorders from diagnoses that appear behaviorally similar. Decision-making classification accuracy may be determined by developing cutoff scores or rules to differentiate the groups, so long as the rules show adequate sensitivity, specificity, positive predictive power, and negative predictive power. These terms may be defined as follows:
• Sensitivity: the proportion of cases in which a clinical condition is detected when it is in fact present (true positive).
• Specificity: the proportion of cases for which a diagnosis is rejected, when rejection is in fact warranted (true negative).
• Positive predictive power: the probability of having the diagnosis given that the score exceeds the cutoff score.
• Negative predictive power: the probability of not having the diagnosis given that the score does not exceed the cutoff score.
All of these indexes of diagnostic accuracy are dependent upon the prevalence of the disorder and the prevalence of the score on either side of the cut point.
Findings pertaining to decision-making should be interpreted conservatively and cross-validated on independent samples because (a) classification decisions should in practice be based upon the results of multiple sources of information rather than test results from a single measure, and (b) the consequences of a classification decision should be considered in evaluating the impact of classification accuracy. A false negative classification, in which a child is incorrectly classified as not needing special education services, could mean the denial of needed services to a student. Alternately, a false positive classification, in which a typical child is recommended for special services, could result in a child's being labeled unfairly.
Treatment validity refers to the value of an assessment in selecting and implementing interventions and treatments that will benefit the examinee. "Assessment data are said to be treatment valid," commented Barrios (1988), "if they expedite the orderly course of treatment or enhance the outcome of treatment" (p. 34). Other terms used to describe treatment validity are treatment utility (Hayes, Nelson, & Jarrett, 1987) and rehabilitation-referenced assessment (Heinrichs, 1990).
Whether the stated purpose of clinical assessment is description, diagnosis, intervention, prediction, tracking, or simply understanding, its ultimate raison d'etre is to select and implement services in the best interests of the examinee, that is, to guide treatment. In 1957, Cronbach described a rationale for linking assessment to treatment: "For any potential problem, there is some best group of treatments to use and best allocation of persons to treatments" (p. 680).
The origins of treatment validity may be traced to the concept of aptitude by treatment interactions (ATI) originally proposed by Cronbach (1957), who initiated decades of research seeking to specify relationships between the traits measured by tests and the intervention methodology used to produce change. In clinical practice, promising efforts to match client characteristics and clinical dimensions to preferred therapist characteristics and treatment approaches have been made (e.g., Beutler & Clarkin, 1990; Beutler & Harwood, 2000; Lazarus, 1973; Maruish, 1999), but progress has been constrained in part by difficulty in arriving at consensus for empirically supported treatments (e.g., Beutler, 1998). Inpsy-choeducational settings, test results have been shown to have limited utility in predicting differential responses to varied forms of instruction (e.g., Reschly, 1997). It is possible that progress in educational domains has been constrained by underestimation of the complexity of treatment validity. For example, many ATI studies utilize overly simple modality-specific dimensions (auditory-visual learning style or verbalnonverbal preferences) because of their easy appeal. New approaches to demonstrating ATI are described in the chapter on intelligence in this volume by Wasserman.
In recent years, there has been an increasing recognition that test usage has both intended and unintended effects on individuals and groups. Messick (1989, 1995b) has argued that test developers must understand the social values intrinsic to the purposes and application of psychological tests, especially those that may act as a trigger for social and educational actions. Linn (1998) has suggested that when governmental bodies establish policies that drive test development and implementation, the responsibility for the consequences of test usage must also be borne by the policymakers. In this context, consequential validity refers to the appraisal of value implications and the social impact of score interpretation as a basis for action and labeling, as well as the actual and potential consequences of test use (Messick, 1989; Reckase, 1998).
This new form of validity represents an expansion of traditional conceptualizations of test score validity. Lees-Haley (1996) has urged caution about consequential validity, noting its potential for encouraging the encroachment of politics into science. The Standards for Educational and Psychological Testing (1999) recognize but carefully circumscribe consequential validity:
Evidence about consequences may be directly relevant to validity when it can be traced to a source of invalidity such as construct underrepresentation or construct-irrelevant components. Evidence about consequences that cannot be so traced—that in fact reflects valid differences in performance—is crucial in informing policy decisions but falls outside the technical purview of validity. (p. 16)
Evidence of consequential validity may be collected by test developers during a period starting early in test development and extending through the life of the test (Reckase, 1998). For educational tests, surveys and focus groups have been described as two methodologies to examine consequential aspects of validity (Chudowsky & Behuniak, 1998; Pomplun, 1997). As the social consequences of test use and interpretation are ascertained, the development and determinants of the consequences need to be explored. A measure with unintended negative side effects calls for examination of alternative measures and assessment counterproposals. Consequential validity is especially relevant to issues of bias, fairness, and distributive justice.
Was this article helpful?