The very nature of cross-cultural psychology places a heavy emphasis upon assessment. In particular, measures that are used to make comparisons across cultural groups need to measure the characteristic unvaryingly in two or more cultural groups. Of course, in some settings, this procedure may be rather simple; a comparison of British and American participants with regard to a variable such as depression or intelligence may not produce unusual concerns. The language, English, is, of course the same for both groups. Minor adjustments in the spelling of words (e.g., behavioral becomes behavioural) would first be needed. Some more careful editing of the items composing scales would also be needed, however, to assure that none of the items include content that has differing cultural connotations in the two countries. A question about baseball, for example, could affect resultant comparisons. These examples are provided simply to present the nature of the issue. Cross-cultural psychologists have focused upon the nature of equivalence and, in particular, have established qualitative levels of equivalence.
Many writers have considered the notion of equivalence in cross-cultural testing. Lonner (1979) is acknowledged often for systematizing our conception of equivalence in testing in cross-cultural psychology. He described four kinds of equivalence: linguistic equivalence, conceptual equivalence, functional equivalence, and metric equivalence (Nichols, Padilla, & Gomez-Maqueo, 2000). Brislin (1993) provided a similar nomenclature with three levels of equivalence: translation, conceptual, and metric, leaving out functional equivalence, an important kind of equivalence, as noted by Berry (1980), Butcher and Han (1998), and Helms (1992). van de Vijver and Leung (1997) operationalized four hierarchical levels of equivalence as well, encompassing construct inequivalence, construct equivalence, measurement unit equivalence, and scalar or full-score comparability. It should be noted, however, that like the concepts of test reliability and validity, equivalence is not a property resident in a particular test or assessment device (van de Vijver & Leung). Rather, the construct is tied to a particular instrument and the cultures involved. Equivalence is also time dependent, given the changes that may occur in cultures. Lonner's approach, which would appear to be most highly accepted in the literature, is described in the next section, followed by an attempt to integrate some other approaches to equivalence.
When a cross-cultural study involves two or more settings in which different languages are employed for communication, the quality and fidelity of the translation of tests, testing materials, questionnaires, interview questions, open-ended responses from test-takers, and the like are critical to the validity of the study. Differences in the wording of questions on a test, for example, can have a significant impact on both the validity of research results and the applicability of a measure in a practice setting. If items include idioms from the home language in the original form, the translation of those idioms is typically unlikely to convey the same meaning in the target language. The translation of tests from host language to target language has been a topic of major concern to cross-cultural psychologists and psychometricians involved in this work. A discussion of issues and approaches to the translation of testing materials appears later in this chapter.
Most of the emphasis on this topic has concerned the translation of tests and testing materials. Moreland (1996) called attention to the translation of test-taker responses and of testing materials. Using objective personality inventories, for example, in a new culture when they were developed in another requires substantial revisions in terms of language and cultural differences. It is relatively easy to use a projective device such as the Rorschach inkblots in a variety of languages. That is because such measures normally consist of nonverbal stimuli, which upon first glance do not need translating in terms of language. However, pictures and stimuli that are often found in such measures may need to be changed to be consistent with the culture in which they are to be used. (The images of stereotypic people may differ across cultures, as may other aspects of the stimuli that appear in the projective techniques.) Furthermore, it is critical in such a case that the scoring systems, including rubrics when available, be carefully translated as well. The same processes that are used to insure that test items are acceptable in both languages must be followed if the responses are to be evaluated in equivalent manners.
The question asked in regard to conceptual equivalence may be seen as whether the test measures the same construct in both (or all) cultures, or whether the construct underlying the measure has the same meaning in all languages (Allen & Walsh, 2000). Conceptual equivalence therefore relates to test validity, especially construct validity.
Cronbach and Meehl (1955) established the conceptual structure for construct validity with the model of a nomo-logical network. The nomological network concept is based upon our understanding of psychological constructs (hypothetical psychological variables, characteristics, or traits) through their relationships with other such variables. What psychologists understand about in vivo constructs emerges from how those constructs relate empirically to other constructs. In naturalistic settings, psychologists tend to measure two or more constructs for all participants in the investigation and to correlate scores among variables. Over time and multiple studies, evidence is amassed so that the relationships among variables appear known. From their relationships, the structure of these constructs becomes known and a nomolog-ical network can be imagined and charted; variables that tend to be highly related are closely aligned in the nomological network and those that are not related have no connective structure between them. The construct validity of a particular test, then, is the extent to which it appears to measure the theoretical construct or trait that it is intended to measure. This construct validity is assessed by determining the extent to which the test correlates with variables in the patterns predicted by the nomological network. When the test correlates with other variables with which it is expected to correlate, evidence of construct validation, called convergent validation, is found (Campbell & Fiske, 1959; Geisinger, 1992). Conversely, when a test does not correlate with a measure that the theory of the psychological construct suggests that it should not, positive evidence of construct validation, called discriminant validation (Campbell & Fiske, 1959) is also found.
Consider the following simple example. Intelligence and school performance are both constructs measured by the Wechsler Intelligence Scale for Children-III (WISC-III) and grade point average (GPA)—in this instance, in the fourth grade. Numerous investigations in the United States provide data showing the two constructs to correlate moderately. The WISC-III is translated into French and a similar study is performed with fourth-graders in schools in Quebec, where a GPA measure similar to that in U.S. schools is available. If the correlation is similar between the two variables (intelligence and school performance), then some degree of conceptual equivalence between the English and French versions of the WISC-III is demonstrated. If a comparable result is not found, however, it is unclear whether (a) the WISC-III was not properly translated and adapted to French; (b) the GPA in the Quebec study is different somehow from that in the American studies; (c) one or both of the measured constructs (intelligence and school performance) does not exist in the same fashion in Quebec as they do in the United States; or (d) the constructs simply do not relate to each other the way they do in the United States. Additional research would be needed to establish the truth in this situation. This illustration is also an example of what van de Vijver and Leung (1997) have termed construct inequivalence, which occurs when an assessment instrument measures different constructs in different languages. No etic comparisons can be made in such a situation, because the comparison would be a classic apples-and-oranges contrast.
Ultimately and theoretically, conceptual equivalence is achieved when a test that has considerable evidence of construct validity in the original or host language and culture is adapted for use in a second language and culture, and the target-language nomological network is identical to the original one. When such a nomological network has been replicated, it might be said that the construct validity of the test generalizes from the original test and culture to the target one. Factor analysis has long been used as a technique of choice for this equivalence evaluation (e.g., Ben-Porath, 1990). Techniques such as structural equation modeling are even more useful for such analyses (e.g., Byrne, 1989, 1994; Loehlin, 1992), in which the statistical model representing the nomological network in the host culture can be applied and tested in the target culture. Additional information on these approaches is provided later in this chapter. (Note that, throughout this chapter, the terms conceptual equivalence and construct equivalence are used synonymously.)
Functional equivalence is achieved when the domain of behaviors sampled on a test has the same purpose and meaning in both cultures in question. "For example, in the United
States the handshake is functionally equivalent to the head bow with hands held together in India" (Nichols et al., 2000, p. 260). When applied to testing issues, functional equivalence is generally dealt with during the translation phase. The individuals who translate the test must actually perform a more difficult task than a simple translation. They frequently must adapt questions as well. That is, direct, literal translation of questions may not convey meaning because the behaviors mentioned in some or all of the items might not generalize across cultures. Therefore, those involved in adapting the original test to a new language and culture must remove or change those items that deal with behavior that does not generalize equivalently in the target culture. When translators find functionally equivalent behaviors to use to replace those that do not generalize across cultures, they are adapting, rather than translating, the test; for this reason, it is preferable to state that the test is adapted rather than translated (Geisinger, 1994; Hambleton, 1994). Some researchers appear to believe that functional equivalence has been subsumed by conceptual equivalence (e.g., Brislin, 1993; Moreland, 1996).
Nichols et al. (2000) have defined metric equivalence as "the extent to which the instrument manifests similar psychometric properties (distributions, ranges, etc.) across cultures" (p. 256). According to Moreland (1996), the standards for meeting metric equivalence are higher than reported by Nichols et al. First, metric equivalence presumes conceptual equivalence. The measures must quantify the same variable in the same way across the cultures. Specifically, scores on the scale must convey the same meaning, regardless of which form was administered. There are some confusing elements to this concept. On one hand, such a standard does not require that the arithmetic means of the tests to be the same in both cultures (Clark, 1987), but does require that individual scores be indicative of the same psychological meaning. Thus, it is implied that scores must be criterion referenced. An individual with a given score on a measure of psychopathology would need treatment, regardless of which language version of the form was taken. Similarly, an individual with the same low score on an intelligence measure should require special education, whether that score is at the 5th or 15th percentile of his or her cultural population. Some of the statistical techniques for establishing metric equivalence are described later in this chapter.
Part of metric equivalence is the establishment of comparable reliability and validity across cultures. Sundberg and Gonzales (1981) have reported that the reliability of measures is unlikely to be influenced by translation and adaptation. This writer finds such a generalization difficult to accept as a conclusion. If the range of scores is higher or lower in one culture, for example, that reliability as traditionally defined will also be reduced in that testing. The quality of the test adaptation, too, would have an impact. Moreland (1996) suggests that investigators would be wise to ascertain test stability (i.e., test-retest reliability) and internal consistency in any adapted measure, and this writer concurs with this recommendation.
Geisinger (1992) has considered the validation of tests for two populations or in two languages. There are many ways to establish validity evidence: content-related approaches, criterion-related approaches, and construct-related approaches. Construct approaches were already discussed with respect to conceptual equivalence.
In regard to establishing content validity, the adequacy of sampling of the domain is critical. To establish comparable content validity in two forms of a measure, each in a different language for a different cultural group, one must gauge first whether the content domain is the same or different in each case. In addition, one must establish that the domain is sampled with equivalent representativeness in all cases. Both of these determinations may be problematic. For example, imagine the translation of a fourth-grade mathematics test that, in its original country, is given to students who attend school 10 months of the year, whereas in the target country, the students attend for only 8 months. In such an instance, the two domains of mathematics taught in the 4th year of schooling are likely to be overlapping, but not identical. Because the students from the original country have already attended three longer school years prior to this year, they are likely to begin at a more advanced level. Furthermore, given that the year is longer, they are likely to cover more material during the academic year. In short, the domains are not likely to be identical. Finally, the representativeness of the domains must be considered.
Was this article helpful?