From the inception of psychological testing, problems with racial, ethnic, and gender bias have been apparent. As early as 1911, Alfred Binet (Binet & Simon, 1911/1916) was aware that a failure to represent diverse classes of socioeconomic status would affect normative performance on intelligence tests. He deleted classes of items that related more to quality of education than to mental faculties. Early editions of the Stanford-Binet and the Wechsler intelligence scales were standardized on entirely White, native-born samples (Terman, 1916; Terman & Merrill, 1937; Wechsler, 1939,1946,1949). In addition to sample limitations, early tests also contained items that reflected positively on whites. Early editions of the Stanford-Binet included an Aesthetic Comparisons item in which examinees were shown a white, well-coiffed blond woman and a disheveled woman with African features; the examinee was asked "Which one is prettier?" The original MMPI (Hathaway & McKinley, 1943) was normed on a convenience sample of white adult Minnesotans and contained true-false, self-report items referring to culture-specific games (drop-the-handkerchief), literature (Alice in Wonderland), and religious beliefs (the second coming of Christ). These types of problems, of normative samples without minority representation and racially and ethnically insensitive items, are now routinely avoided by most contemporary test developers.
In spite of these advances, the fairness of educational and psychological tests represents one of the most contentious and psychometrically challenging aspects of test development. Numerous methodologies have been proposed to assess item effectiveness for different groups of test takers, and the definitive text in this area is Jensen's (1980) thoughtful Bias in Mental Testing. The chapter by Reynolds and Ramsay in this volume also describes a comprehensive array of approaches to test bias. Most of the controversy regarding test fairness relates to the lay and legal perception that any group difference in test scores constitutes bias, in and of itself. For example, Jencks and Phillips (1998) stress that the test score gap is the single most important obstacle to achieving racial balance and social equity.
In landmark litigation, Judge Robert Peckham in Larry P. v. Riles (1972/1974/1979/1984/1986) banned the use of individual IQ tests in placing black children into educable mentally retarded classes in California, concluding that the cultural bias of the IQ test was hardly disputed in this litigation. He asserted, "Defendants do not seem to dispute the evidence amassed by plaintiffs to demonstrate that the IQ tests in fact are culturally biased" (Peckham, 1972, p. 1313) and later concluded, "An unbiased test that measures ability or potential should yield the same pattern of scores when administered to different groups of people" (Peckham, 1979, pp. 954-955).
The belief that any group test score difference constitutes bias has been termed the egalitarian fallacy by Jensen (1980, p. 370):
This concept of test bias is based on the gratuitous assumption that all human populations are essentially identical or equal in whatever trait or ability the test purports to measure. Therefore, any difference between populations in the distribution of test scores (such as a difference in means, or standard deviations, or any other parameters of the distribution) is taken as evidence that the test is biased. The search for a less biased test, then, is guided by the criterion of minimizing or eliminating the statistical differences between groups. The perfectly nonbiased test, according to this definition, would reveal reliable individual differences but not reliable (i.e., statistically significant) group differences. (p. 370)
However this controversy is viewed, the perception of test bias stemming from group mean score differences remains a deeply ingrained belief among many psychologists and educators. McArdle (1998) suggests that large group mean score differences are "a necessary but not sufficient condition for test bias" (p. 158). McAllister (1993) has observed, "In the testing community, differences in correct answer rates, total scores, and so on do not mean bias. In the political realm, the exact opposite perception is found; differences mean bias" (p. 394).
The newest models of test fairness describe a systemic approach utilizing both internal and external sources of evidence of fairness that extend from test conception and design through test score interpretation and application (McArdle, 1998; Camilli & Shepard, 1994; Willingham, 1999). These models are important because they acknowledge the importance of the consequences of test use in a holistic assessment of fairness and a multifaceted methodological approach to accumulate evidence of test fairness. In this section, a systemic model of test fairness adapted from the work of several leading authorities is described.
Three key terms appear in the literature associated with test score fairness: bias, fairness, and equity. These concepts overlap but are not identical; for example, a test that shows no evidence of test score bias may be used unfairly. To some extent these terms have historically been defined by families of relevant psychometric analyses—for example, bias is usually associated with differential item functioning, and fairness is associated with differential prediction to an external criterion. In this section, the terms are defined at a conceptual level.
Test score bias tends to be defined in a narrow manner, as a special case of test score invalidity. According to the most recent Standards (1999), bias in testing refers to "construct under-representation or construct-irrelevant components of test scores that differentially affect the performance of different groups of test takers" (p. 172). This definition implies that bias stems from nonrandom measurement error, provided that the typical magnitude of random error is comparable for all groups of interest. Accordingly, test score bias refers to the systematic and invalid introduction of measurement error for a particular group of interest. The statistical underpinnings of this definition have been underscored by Jensen (1980), who asserted, "The assessment of bias is a purely objective, empirical, statistical and quantitative matter entirely independent of subjective value judgments and ethical issues concerning fairness or unfairness of tests and the uses to which they are put" (p. 375). Some scholars consider the characterization of bias as objective and independent of the value judgments associated with fair use of tests to be fundamentally incorrect (e.g., Willingham, 1999).
Test score fairness refers to the ways in which test scores are utilized, most often for various forms of decision-making such as selection. Jensen suggests that test fairness refers "to the ways in which test scores (whether of biased or unbiased tests) are used in any selection situation" (p. 376), arguing that fairness is a subjective policy decision based on philosophic, legal, or practical considerations rather than a statistical decision. Willingham (1999) describes a test fairness manifold that extends throughout the entire process of test development, including the consequences of test usage. Embracing the idea that fairness is akin to demonstrating the generaliz-ability of test validity across population subgroups, he notes that "the manifold of fairness issues is complex because validity is complex" (p. 223). Fairness is a concept that transcends a narrow statistical and psychometric approach.
Finally, equity refers to a social value associated with the intended and unintended consequences and impact of test score usage. Because of the importance of equal opportunity, equal protection, and equal treatment in mental health, education, and the workplace, Willingham (1999) recommends that psychometrics actively consider equity issues in test development. As Tiedeman (1978) noted, "Test equity seems to be emerging as a criterion for test use on a par with the concepts of reliability and validity" (p. xxviii).
The internal features of a test related to fairness generally include the test's theoretical underpinnings, item content and format, differential item and test functioning, measurement precision, and factorial structure. The two best-known procedures for evaluating test fairness include expert reviews of content bias and analysis of differential item functioning. These and several additional sources of evidence of test fairness are discussed in this section.
In efforts to enhance fairness, the content and format of psychological and educational tests commonly undergo subjective bias and sensitivity reviews one or more times during test development. In this review, independent representatives from diverse groups closely examine tests, identifying items and procedures that may yield differential responses for one group relative to another. Content may be reviewed for cultural, disability, ethnic, racial, religious, sex, and socioeconomic status bias. For example, a reviewer may be asked a series of questions including, "Does the content, format, or structure of the test item present greater problems for students from some backgrounds than for others?" A comprehensive item bias review is available from Hambleton and Rodgers (1995), and useful guidelines to reduce bias in language are available from the American Psychological Association (1994).
Ideally, there are two objectives in bias and sensitivity reviews: (a) eliminate biased material, and (b) ensure balanced and neutral representation of groups within the test. Among the potentially biased elements of tests that should be avoided are
• material that is controversial, emotionally charged, or inflammatory for any specific group.
• language, artwork, or material that is demeaning or offensive to any specific group.
• content or situations with differential familiarity and relevance for specific groups.
• language and instructions that have different or unfamiliar meanings for specific groups.
• information or skills that may not be expected to be within the educational background of all examinees.
• format or structure of the item that presents differential difficulty for specific groups.
Among the prosocial elements that ideally should be included in tests are
• Presentation of universal experiences in test material.
• Balanced distribution of people from diverse groups.
• Presentation of people in activities that do not reinforce stereotypes.
• Item presentation in a sex-, culture-, age-, and race-neutral manner.
• Inclusion of individuals with disabilities or handicapping conditions.
In general, the content of test materials should be relevant and accessible for the entire population of examinees for whom the test is intended. For example, the experiences of snow and freezing winters are outside the range of knowledge of many Southern students, thereby introducing a geographic regional bias. Use of utensils such as forks may be unfamiliar to Asian immigrants who may instead use chopsticks. Use of coinage from the United States ensures that the test cannot be validly used with examinees from countries with different currency.
Tests should also be free of controversial, emotionally charged, or value-laden content, such as violence or religion. The presence of such material may prove distracting, offensive, or unsettling to examinees from some groups, detracting from test performance.
Stereotyping refers to the portrayal of a group using only a limited number of attributes, characteristics, or roles. As a rule, stereotyping should be avoided in test development. Specific groups should be portrayed accurately and fairly, without reference to stereotypes or traditional roles regarding sex, race, ethnicity, religion, physical ability, or geographic setting. Group members should be portrayed as exhibiting a full range of activities, behaviors, and roles.
Are item and test statistical properties equivalent for individuals of comparable ability, but from different groups? Differential test and item functioning (DTIF, or DTF and DIF) refers to a family of statistical procedures aimed at determining whether examinees of the same ability but from different groups have different probabilities of success on a test or an item. The most widely used of DIF procedures is the Mantel-Haenszel technique (Holland & Thayer, 1988), which assesses similarities in item functioning across various demographic groups of comparable ability. Items showing significant DIF are usually considered for deletion from a test.
DIF has been extended by Shealy and Stout (1993) to a test score-based level of analysis known as differential test functioning, a multidimensional nonparametric IRT index of test bias. Whereas DIF is expressed at the item level, DTF represents a combination of two or more items to produce DTF, with scores on a valid subtest used to match examinees according to ability level. Tests may show evidence of DIF on some items without evidence of DTF, provided item bias statistics are offsetting and eliminate differential bias at the test score level.
Although psychometricians have embraced DIF as a preferred method for detecting potential item bias (McAllister, 1993), this methodology has been subjected to increasing criticism because of its dependence upon internal test properties and its inherent circular reasoning. Hills (1999) notes that two decades of DIF research have failed to demonstrate that removing biased items affects test bias and narrows the gap in group mean scores. Furthermore, DIF rests on several assumptions, including the assumptions that items are unidimensional, that the latent trait is equivalently distributed across groups, that the groups being compared (usually racial, sex, or ethnic groups) are homogeneous, and that the overall test is unbiased. Camilli and Shepard (1994) observe, "By definition, internal DIF methods are incapable of detecting constant bias. Their aim, and capability, is only to detect relative discrepancies" (p. 17).
The demonstration that a test has equal internal integrity across racial and ethnic groups has been described as a way to demonstrate test fairness (e.g., Mercer, 1984). Among the internal psychometric characteristics that may be examined for this type of generalizability are internal consistency, item difficulty calibration, test-retest stability, and factor structure.
With indexes of internal consistency, it is usually sufficient to demonstrate that the test meets the guidelines such as those recommended above for each of the groups of interest, considered independently (Jensen, 1980). Demonstration of adequate measurement precision across groups suggests that a test has adequate accuracy for the populations in which it may be used. Geisinger (1998) noted that "subgroup-specific reliability analysis may be especially appropriate when the reliability of a test has been justified on the basis of internal consistency reliability procedures (e.g., coefficient alpha). Such analysis should be repeated in the group of special test takers because the meaning and difficulty of some components of the test may change over groups, especially over some cultural, linguistic, and disability groups" (p. 25). Differences in group reliabilities may be evident, however, when test items are substantially more difficult for one group than another or when ceiling or floor effects are present for only one group.
A Rasch-based methodology to compare relative difficulty of test items involves separate calibration of items of the test for each group of interest (e.g., O'Brien, 1992). The items may then be plotted against an identity line in a bivariate graph and bounded by 95 percent confidence bands. Items falling within the bands are considered to have invariant difficulty, whereas items falling outside the bands have different difficulty and may have different meanings across the two samples.
The temporal stability of test scores should also be compared across groups, using similar test-retest intervals, in order to ensure that test results are equally stable irrespective of race and ethnicity. Jensen (1980) suggests,
If a test is unbiased, test-retest correlation, of course with the same interval between testings for the major and minor groups, should yield the same correlation for both groups. Significantly different test-retest correlations (taking proper account of possibly unequal variances in the two groups) are indicative of a biased test. Failure to understand instructions, guessing, carelessness, marking answers haphazardly, and the like, all tend to lower the test-retest correlation. If two groups differ in test-retest correlation, it is clear that the test scores are not equally accurate or stable measures of both groups. (p. 430)
As an index of construct validity, the underlying factor structure of psychological tests should be robust across racial and ethnic groups. A difference in the factor structure across groups provides some evidence for bias even though factorial invariance does not necessarily signify fairness (e.g., Meredith, 1993; Nunnally & Bernstein, 1994). Floyd and Widaman (1995) suggested, "Increasing recognition of cultural, developmental, and contextual influences on psychological constructs has raised interest in demonstrating measurement invariance before assuming that measures are equivalent across groups" (p. 296).
Beyond the concept of internal integrity, Mercer (1984) recommended that studies of test fairness include evidence of equal external relevance. In brief, this determination requires the examination of relations between item or test scores and independent external criteria. External evidence of test score fairness has been accumulated in the study of comparative prediction of future performance (e.g., use of the Scholastic Assessment Test across racial groups to predict a student's ability to do college-level work). Fair prediction and fair selection are two objectives that are particularly important as evidence of test fairness, in part because they figure prominently in legislation and court rulings.
Prediction bias can arise when a test differentially predicts future behaviors or performance across groups. Cleary (1968) introduced a methodology that evaluates comparative predictive validity between two or more salient groups. The Cleary rule states that a test may be considered fair if it has the same approximate regression equation, that is, comparable slope and intercept, explaining the relationship between the predictor test and an external criterion measure in the groups undergoing comparison. A slope difference between the two groups conveys differential validity and relates that one group's performance on the external criterion is predicted less well than the other's performance. An intercept difference suggests a difference in the level of estimated performance between the groups, even if the predictive validity is comparable. It is important to note that this methodology assumes adequate levels of reliability for both the predictor and criterion variables. This procedure has several limitations that have been summarized by Camilli and Shepard (1994). The demonstration of equivalent predictive validity across demographic groups constitutes an important source of fairness that is related to validity generalization.
The consequences of test score use for selection and decision-making in clinical, educational, and occupational domains constitute a source of potential bias. The issue of fair selection addresses the question of whether the use of test scores for selection decisions unfairly favors one group over another. Specifically, test scores that produce adverse, disparate, or disproportionate impact for various racial or ethnic groups may be said to show evidence of selection bias, even when that impact is construct relevant. Since enactment of the Civil Rights Act of 1964, demonstration of adverse impact has been treated in legal settings as prima facie evidence of test bias. Adverse impact occurs when there is a substantially different rate of selection based on test scores and other factors that works to the disadvantage of members of a race, sex, or ethnic group.
Federal mandates and court rulings have frequently indicated that adverse, disparate, or disproportionate impact in selection decisions based upon test scores constitutes evidence of unlawful discrimination, and differential test selection rates among majority and minority groups have been considered a bottom line in federal mandates and court rulings. In its Uniform Guidelines on Employment Selection Procedures (1978), the Equal Employment Opportunity Commission (EEOC) operationalized adverse impact according to the four-fifths rule, which states, "A selection rate for any race, sex, or ethnic group which is less than four-fifths (4/5) (or eighty percent) of the rate for the group with the highest rate will generally be regarded by the Federal enforcement agencies as evidence of adverse impact" (p. 126). Adverse impact has been applied to educational tests (e.g., the Texas Assessment of Academic Skills) as well as tests used in personnel selection. The U.S. Supreme Court held in 1988 that differential selection ratios can constitute sufficient evidence of adverse impact. The 1991 Civil Rights Act, Section 9, specifically and explicitly prohibits any discriminatory use of test scores for minority groups.
Since selection decisions involve the use of test cutoff scores, an analysis of costs and benefits according to decision theory provides a methodology for fully understanding the consequences of test score usage. Cutoff scores may be varied to provide optimal fairness across groups, or alternative cutoff scores may be utilized in certain circumstances. McArdle (1998) observes, "As the cutoff scores become increasingly stringent, the number of false negative mistakes (or costs) also increase, but the number of false positive mistakes (also a cost) decrease" (p. 174).
Was this article helpful?