Jensen (1980) compiled an extensive early review of test bias studies. One concern addressed in the review was rational judgments that test items were biased based on their content or phrasing. For scientists, rational judgments are those based on reason rather than empirical findings. Such judgments may seem sound or even self-evident, but they often conflict with each other and with scientific evidence.
A WISC-R item often challenged on rational grounds is, "What is the thing to do if a boy/girl much smaller than yourself starts to fight with you?" Correct responses include, "Walk away," and, "Don't hit him back." CTBH proponents criticized this item as biased against inner-city Black children, who may be expected to hit back to maintain their status, and who may therefore respond incorrectly for cultural reasons. Jensen (1980) reviewed large-sample research indicating that proportionately more Black children than White children responded correctly to this item. Miele (1979), who also researched this item in a large-N study, concluded that the item was easier for Blacks than for Whites. As with this item, empirical results often contradict rational judgments.
Jensen (1980) addressed bias in predictive and construct validity, along with situational bias. Bias in predictive validity, as defined by Jensen, is systematic error in predicting a criterion variable for people of different groups. This bias occurs when one regression equation is incorrectly used for two or more groups. The review included studies involving Blacks and Whites, the two most frequently researched groups. The conclusions reached by Jensen were that (a) a large majority of studies showed that tests were equally valid for these groups and that (b) when differences were found, the tests overpredicted Black examinees when compared with White examinees. CTBH would have predicted the opposite result.
Bias in construct validity occurs when a test measures groups of examinees differently. For example, a test can be more difficult, valid, or reliable for one group than for another. Construct bias involves the test itself, whereas predictive bias involves a test's prediction of a result outside the test.
Jensen (1980) found numerous studies of bias in construct validity. As regards difficulty, when item scores differed for ethnic groups or social classes, the differences were not consistently associated with the culture loadings of the tests. Score differences between Black and White examinees were larger on nonverbal than on verbal tests, contrary to beliefs that nonverbal tests are culture fair or unbiased. The sizes of Black-White differences were positively associated with tests' correlations with g, or general ability. In tests with several item types, such as traditional intelligence tests, the rank orders of item difficulties for different ethnic groups were very highly correlated. Items that discriminated most between Black and White examinees also discriminated most between older and younger members of each ethnic group. Finally, Blacks, Whites, and Mexican Americans showed similar correlations between raw test scores and chronological ages.
In addition, Jensen (1980) reviewed results pertaining to validity and reliability. Black, White, and Mexican American examinees produced similar estimates of internal consistency reliability. As regards validity, Black and White samples showed the same factor structures. Jensen wrote that the evidence was generally inconclusive for infrequently researched ethnic groups, such asAsianAmericans andNativeAmericans.
Jensen's (1980) term situational bias refers to "influences in the test situation, but independent of the test itself, that may bias test scores" (p. 377). These influences may include, among others, characteristics of the test setting, the instructions, and the examiners themselves. Examples include anxiety, practice and coaching effects, and examiner dialect and ethnic group (Jensen, 1984). As Jensen (1980) observed, sit-uational influences would not constitute test bias, because they are not attributes of the tests themselves. Nevertheless, they should emerge in studies of construct and predictive bias. Jensen concluded that the situational variables reviewed did not influence group differences in scores.
Soon after Jensen's (1980) review was published, the National Academy of Sciences and the National Research Council commissioned a panel of 19 experts, who conducted a second review of the test bias literature. The panel concluded that well-constructed tests were not biased against African Americans or other English-speaking minority groups (Wigdor
& Garner, 1982). Later, a panel of 52 professionals signed a position paper that concluded, in part, "Intelligence tests are not culturally biased againstAmerican blacks or other native-born, English-speaking peoples in the U. S. Rather, IQ scores predict equally accurately for all such Americans, regardless of race and social class" ("Mainstream Science," 1994, p. A18). That same year, a task force of 11 psychologists, established by the American Psychological Association, concluded that no test characteristic reviewed made a substantial contribution to Black-White differences in intelligence scores (Neisser et al., 1996). Thus, several major reviews have failed to support CTBH (see also Reynolds, 1998a, 1999).
Review by Reynolds, Lowe, and Saenz Content Validity
Reynolds, Lowe, et al. (1999) categorized findings under content, construct, and predictive validity. Content validity is the extent to which the content of a test is a representative sample of the behavior to be measured (Anastasi, 1988). Items with content bias should behave differently from group to group for people of the same standing on the characteristic being measured. Typically, reviewers judge an intelligence item to have content bias because the information or solution method required is unfamiliar to disadvantaged or minority individuals, or because the tests' author has arbitrarily decided on the correct answer, so that minorities are penalized for giving responses that are correct in their own culture but not in the author's culture.
The issue of content validity with achievement tests is complex. Important variables to consider include exposure to instruction, general ability of the group, and accuracy and specificity of the items for the sample (Reynolds, Lowe, et al., 1999; see also Schmidt, 1983). Little research is available for personality tests, but cultural variables that may be found to influence some personality tests include beliefs regarding discipline and aggression, values related to education and employment, and perceptions concerning society's fairness toward one's group.
Camilli and Shepard (1994; Reynolds, 2000a) recommended techniques based on item-response theory (IRT) to detect differential item functioning (DIF). DIF statistics detect items that behave differently from one group to another. A statistically significant DIF statistic, by itself, does not indicate bias but may lead to later findings of bias through additional research, with consideration of the construct meant to be measured. For example, if an item on a composition test were about medieval history, studies might be conducted to determine if the item is measuring composition skill or some unintended trait, such as historical knowledge. For smaller samples, a contingency table (CT) procedure is often used to estimate DIF. CT approaches are relatively easy to understand and interpret.
Nandakumar, Glutting, and Oakland (1993) used a CT approach to investigate possible racial, ethnic, and gender bias on the Guide to the Assessment of Test Session Behavior (GATSB). Participants were boys and girls aged 6-16 years, of White, Black, or Hispanic ethnicity. Only 10 of 80 items produced statistically significant DIFs, suggesting that the GATSB has little bias for different genders and ethnicities.
In very-large-^ studies, Reynolds, Willson, and Chatman (1984) used a partial correlation procedure (Reynolds, 2000a) to estimate DIF in tests of intelligence and related aptitudes. The researchers found no systematic bias against African Americans or women on measures of English vocabulary. Willson, Nolan, Reynolds, and Kamphaus (1989) used the same procedure to estimate DIF on the Mental Processing scales of the K-ABC. The researchers concluded that there was little apparent evidence of race or gender bias.
Jensen (1976) used a chi-square technique (Reynolds, 2000a) to examine the distribution of incorrect responses for two multiple-choice intelligence tests, RPM and the Peabody Picture-Vocabulary Test (PPVT). Participants were Black and White children aged 6-12 years. The errors for many items were distributed systematically over the response options. This pattern, however, was the same for Blacks and Whites. These results indicated bias in a general sense, but not racial bias. On RPM, Black and White children made different types of errors, but for few items. The researcher examined these items with children of different ages. For each of the items, Jensen was able to duplicate Blacks' response patterns using those of Whites approximately two years younger.
Scheuneman (1987) used linear methodology on Graduate Record Examination (GRE) item data to show possible influences on the scores of Black and White test-takers. Vocabulary content, true-false response, and presence or absence of diagrams were among the item characteristics examined. Paired, experimental items were administered in the experimental section of the GRE General Test, given in December 1982. Results indicated that certain characteristics common to a variety of items may have a differential influence on Blacks' and Whites' scores. These items may be measuring, in part, test content rather than verbal, quantitative, or analytical skill.
Jensen (1974, 1976, 1977) evaluated bias on the Wonder-lic Personnel Test (WPT), PPVT, and RPM using correlations between P decrements (Reynolds, 2000a) obtained by Black students and those obtained by White students. P is the probability of passing an item, and a P decrement is the size of the
Was this article helpful?