There have been many definitions of test bias (Flaugher, 1978; and see chapter 4 of this volume). Messick's (1980, 1989, 1995) conceptions of test bias are perhaps the most widely accepted and emerge from the perspective of construct validity. Messick portrayed bias as a specific source of test variance other than the valid variance associated with the desired construct. Bias is associated with some irrelevant variable, such as race, gender, or in the case of cross-cultural testing, culture (or perhaps country of origin). van de Vijver and his associates (van de Vijver, 2000; van de Vijver & Leung, 1997; van de Vijver & Poortinga, 1997) have perhaps best characterized bias within the context of cross-cultural assessment. "Test bias exists, from this viewpoint, when an existing test does not measure the equivalent underlying psychological construct in a new group or culture, as was measured within the original group in which it was standardized" (Allen & Walsh, 2000, p. 67). van de Vijver describes bias as "a generic term for all nuisance factors threatening the validity of cross-cultural comparisons. Poor item translations, inappropriate item content, and lack of standardization in administration procedures are just a few examples" (van de Vijver & Leung, p. 10). He also describes bias as "a lack of similarity of psychological meaning of test scores across cultural groups" (van de Vijver, p. 88).
The term bias, as can be seen, is very closely related to conceptual equivalence. van de Vjver describes the distinction as follows: "The two concepts are strongly related, but have a somewhat different connotation. Bias refers to the presence or absence of validity-threatening factors in such assessment, whereas equivalence involves the consequences of these factors on the comparability of test scores" (van de Vijver, 2000, p. 89). In his various publications, van de Vijver identifies three groupings of bias: construct, method, and item bias. Each is described in the following sections.
Measures that do not examine the same construct across cultural groups exhibit construct bias, which clearly is highly related to the notion of conceptual equivalence previously described. Contstruct bias would be evident in a measure that has been (a) factor analyzed in its original culture with the repeated finding of a four-factor solution, and that is then (b) translated to another language and administered to a sample from the culture in which that language is spoken, with a factor analysis of the test results indicating a two-factor and therefore different solution. When the constructs underlying the measures vary across cultures, with culture-specific components of the construct present in some cultures, such evidence is not only likely to result, it should result if both measures are to measure the construct validly in their respective cultures. Construct bias can occur when constructs only partially overlap across cultures, when there is a differential appropriateness of behaviors comprising the construct in different cultures, when there is a poor sampling of relevant behaviors constituting the construct in one or more cultures, or when there is incomplete coverage of all relevant aspects of the construct (van de Vijver, 2000).
An example of construct bias can be seen in the following. In many instances, inventories (such as personality inventories) are largely translated from the original language to a second or target language. If the culture that uses the target language has culture-specific aspects of personality that either do not exist or are not as prevalent as in the original culture, then these aspects will certainly not be translated into the target-language form of the assessment instrument.
The concept of construct bias has implications for both cross-cultural research and cross-cultural psychological practice. Cross-cultural or etic comparisons are unlikely to be very meaningful if the construct means something different in the two or more cultures, or if it is a reasonably valid representation of the construct in one culture but less so in the other. The practice implications in the target language emerge from the fact that the measure may not be valid as a measure of culturally relevant constructs that may be of consequence for diagnosis and treatment.
Method Bias van de Vijver (2000) has identified a number of types of method bias, including sample, instrument, and administration bias. The different issues composing this type of bias were given the name method bias because they relate to the kinds of topics covered in methods sections of journal articles (van de Vijver). Method biases often affect performance on the assessment instrument as a whole (rather than affecting only components of the measure). Some of the types of method bias are described in the following.
In studies comparing two or more cultures, the samples from each culture may differ on variables related to test-relevant background characteristics. These differences may affect the comparison. Examples of such characteristics would include fluency in the language in which testing occurs, general education levels, and underlying motivational levels (van de Vijver, 2000). Imagine an essay test that is administered in a single language. Two groups are compared: Both have facility with the language, but one has substantially more ability. Regardless of the knowledge involved in the answers, it is likely that the group that is more facile in the language will provide better answers on average because of their ability to employ the language better in answering the question.
This type of bias is much like sample bias, but the groups being tested tend to differ in less generic ways that are more specific to the testing method itself, as when a test subject has some familiarity with the general format of the testing or some other form of test sophistication. van de Vijver (2000) states that the most common forms of this bias exist when groups differ by response styles in answering questions, or by their familiarity with the materials on the test. As is described later in this chapter, attempts to develop culture-fair or culture-free intelligence tests (e.g., Bracken, Naglieri, & Baardos, 1999; Cattell, 1940; Cattell & Cattell, 1963) have often used geometric figures rather than language in the effort to avoid dependence upon language. Groups that differ in educational experience or by culture also may have differential exposure to geometric figures. This differential contact with the stimuli composing the test may bias the comparison in a manner that is difficult to disentangle from the construct of intelligence measured by the instrument.
Alternatively, different cultures vary in the tendency of their members to disclose personal issues about themselves.
When two cultures are compared, depending upon who is making the comparison, it is possible that one group could look overly self-revelatory or the other too private.
Imagine the use of a measure such as the Thematic Apperception Test (TAT) in a cross-cultural comparison. Not only do the people pictured on many of the TAT cards not look like persons from some cultures, but the scenes themselves have a decidedly Western orientation. Respondents from some cultures would obviously find such pictures more foreign and strange.
Geisinger (1994) recommended the use of enhanced test-practice exercises to attempt to reduce differences in testformat familiarity. Such exercises could be performed at the testing site or in advance of the testing, depending upon the time needed to gain familiarity.
The final type of method bias emerges from the interactions between test-taker or respondent and the individual administering the test, whether the test, questionnaire, or interview, is individually administered or is completed in more of a large-group situation. Such biases could come from language problems on the part of an interviewer, who may be conducting the interview in a language in which he or she is less adept than might be ideal (van de Vijver & Leung, 1997). Communications problems may result from other language problems—for example, the common mistakes individuals often make in second languages in regard to the use of the familiar second person.
Another example of administration bias may be seen in the multicultural testing literature in the United States. The theory of stereotype threat (Steele, 1997; Steele & Aronson, 1995) suggests that African Americans, when taking an individualized intellectual assessment or other similar measure that is administered by someone whom the African American test-takers believe holds negative stereotypes about them, will perform at the level expected by the test administrator. Steele's theory holds that negative stereotypes can have a powerful influence on the results of important assessments—stereotypes that can influence test-takers' performance and, ultimately, their lives. Of course, in a world where there are many cultural tensions and cultural preconceptions, it is possible that this threat may apply to groups other than Whites and Blacks in the American culture. van de Vijver (2000) concludes his chapter, however, with the statement that with notable exceptions, responses to either interviews or most cognitive tests do not seem to be strongly affected by the cultural, racial, or ethnic status of administrators.
In the late 1970s, those involved in large-scale testing, primarily psychometricians, began studying the possibility that items could be biased—that is, that the format or content of items could influence responses to individual items on tests in unintended ways. Because of the connotations of the term biased, the topic has more recently been termed differential item functioning, or dif (e.g., Holland & Wainer, 1993). van de Vijver and his associates prefer continuing to use the term item bias, however, to accentuate the notion that these factors are measurement issues that, if not handled appropriately, may bias the results of measurement. Essentially, on a cognitive test, an item is biased for a particular group if it is more difficult for individuals of a certain level of overall ability than it is for members of the other group who have that same level of overall ability. Items may appear biased in this way because they deal with content that is not uniformly available to members of all cultures involved. They may also be identified as biased because translations have not been adequately performed.
In addition, there may be factors such as words describing concepts that are differentially more difficult in one language than the other. A number of studies are beginning to appear in the literature describing the kinds of item problems that lead to differential levels of difficulty (e.g., Allalouf, Hambleton, & Sireci, 1999; Budgell, Raju, & Quartetti, 1995; Hulin, 1987; Tanzer, Gittler, & Sim, 1994). Some of these findings may prove very useful for future test-translation projects, and may even influence the construction of tests that are likely to be translated. For example, in an early study of Hispanics and Anglos taking the Scholastic Aptitude Test, Schmitt (1988) found that verbal items that used roots common to English and Spanish appeared to help Hispanics. Limited evidence suggested that words that differed in cognates (words that appear to have the same roots but have different meanings in both languages) and homographs (words spelled alike in both languages but with different meanings in the two) caused difficulties for Hispanic test-takers. Allalouf et al. found that on a verbal test for college admissions that had been translated from Hebrew to Russian, analogy items presented the most problems. Most of these difficulties (items identified as differentially more difficult) emerged from word difficulty, especially in analogy items. Interestingly, many of the analogies were easier in Russian, the target language. Apparently, the translators chose easier words, thus making items less difficult. Sentence completion items had been difficult to translate because of different sentence structures in the two languages. Reading Comprehension items also lead to some problems, mostly related to the specific content of the reading passages in question. In some cases, the content was seen as culturally specific. Allalouf et al. also concluded that differences in item difficulty can emerge from differences in wording, differences in content, differences in format, or differences in cultural relevance.
If the responses to questions on a test are not objective in format, differences in scoring rubrics can also lead to item bias. Budgell et al. (1995) used an expert committee to review the results of statistical analyses that identified items as biased; in many cases, the committee could not provide logic as to why an item translated from English to French was differentially more difficult for one group or the other.
Item bias has been studied primarily in cognitive measures: ability and achievement tests. van de Vijver (2000) correctly notes that measures such as the Rorschach should also be evaluated for item bias. It is possible that members of different cultures could differentially interpret the cards on measures such as the Rorschach or the TAT.
Was this article helpful?