Methods Of Evaluating Test Equivalence

Once a test has been adapted into a target language, it is necessary to establish that the test has the kinds of equivalence that are needed for proper test interpretation and use. Methodologists and psychometricians have worked for several decades on this concern, and a number of research designs and statistical methods are available to help provide data for this analysis, which ultimately informs the testdevelopment team to make a judgment regarding test equivalence. Such research is essential for tests that are to be used with clients in settings that differ from that in which the test was originally developed and validated.

Methods to Establish Equivalence of Scores

Historically, a number of statistical methods have been used to establish the equivalence of scores emerging from a translated test. Four techniques are noted in this section: exploratory factor analysis, structural equation modeling (including confirmatory factor analysis), regression analysis, and item-response theory. Cook, Schmitt, and Brown (1999) provide a far more detailed description of these techniques. Individual items that are translated or adapted from one language to another also should be subjected to item bias (or dif) analyses as well. Holland and Wainer (1993) have provided an excellent resource on dif techniques, and van de Vijver and Leung (1997) devote the better part of an outstanding chapter (pp. 62-88) specifically to the use of item bias techniques. Allalouf et al. (1999) and Budgell et al. (1995) are other fine examples of this methodology in the literature.

Exploratory, Replicatory Factor Analysis

Many psychological tests, especially personality measures, have been subjected to factor analysis, a technique that has often been used in psychology in an exploratory fashion to identify dimensions or consistencies among the items composing a measure (Anastasi & Urbina, 1997). To establish that the internal relationships of items or test components hold across different language versions of a test, a factor analysis of the translated version is performed. A factor analysis normally begins with the correlation matrix of all the items composing the measure. The factor analysis looks for patterns of consistency or factors among the items. There are many forms of factor analysis (e.g., Gorsuch, 1983) and techniques differ in many conceptual ways. Among the important decisions made in any factor analysis are determining the number of factors, deciding whether these factors are permitted to be correlated (oblique) or forced to be uncorrelated (orthogonal), and interpreting the resultant factors. A component of the factor analysis is called rotation, whereby the dimensions are changed mathematically to increase inter-pretability. The exploratory factor analysis that bears upon the construct equivalence of two measures has been called replicatory factor analysis (RFA; Ben-Porath, 1990) and is a form of cross-validation. In this instance, the number of factors and whether the factors are orthogonal or oblique are constrained to yield the same number of factors as in the original test. In addition, a rotation of the factors is made to attempt to maximally replicate the original solution; this technique is called target rotation. Once these procedures have been performed, the analysts can estimate how similar the factors are across solutions. van de Vijver and Leung (1997) provide indices that may be used for this judgment (e.g., the coefficient of proportionality). Although RFA has probably been the most used technique for estimating congruence (van de Vijver & Leung), it does suffer from a number of problems. One of these is simply that newer techniques, especially confirmatory factor analysis, can now perform a similar analysis while also testing whether the similarity is statistically significant through hypothesis testing. A second problem is that different researchers have not employed standard procedures and do not always rotate their factors to a target solution (van de Vijver & Leung). Finally, many studies do not compute indices of factor similarity across the two solutions and make this discernment only judgmentally (van de Vijver & Leung). Nevertheless, a number of outstanding researchers (e.g., Ben-Porath, 1990; Butcher, 1996)

have recommended the use of RFA to establish equivalence and this technique has been widely used, especially in validation efforts for various adaptations of the frequently translated MMPI and the Eysenck Personality Questionnaire.


Regression approaches are generally used to establish the relationships between the newly translated measure and measures with which it has traditionally correlated in the original culture. The new test can be correlated statistically with other measures, and the correlation coefficients that result may be compared statistically with similar correlation coefficients found in the original population. There may be one or more such correlated variables. When there is more than one independent variable, the technique is called multiple regression. In this case, the adapted test serves as the dependent variable, and the other measures as the independent variables. When multiple regression is used, the independent variables are used to predict the adapted test scores. Multiple regression weights the independent variables mathematically to optimally predict the dependent variable. The regression equation for the original test in the original culture may be compared with that for the adapted test; where there are differences between the two regression lines, whether in the slope or the intercept, or in some other manner, bias in the testing is often presumed.

If the scoring of the original- and target-language measures is the same, it is also possible to include cultural group membership in a multiple regression equation. Such a nominal variable is added as what has been called dummy-coded variable. In such an instance, if the dummy-coded variable is assigned a weighting as part of the multiple regression equation, indicating that it predicts test scores, evidence of cultural differences across either the two measures or the two cultures may be presumed (van de Vijver & Leung, 1997).

Structural Equation Modeling, Including Confirmatory Factor Analysis

Structural equation modeling (SEM; Byrne, 1994; Loehlin, 1992) is a more general and statistically sophisticated procedure that encompasses both factor analysis and regression analysis, and does so in a manner that permits elegant hypothesis testing. When SEM is used to perform factor analysis, it is typically called a confirmatory factor analysis, which is defined by van de Vijver and Leung (1997) as "an extension of classical exploratory factor analysis. Specific to confirmatory factor analysis is the testing of a priori specified hypotheses about the underlying structure, such as the number of factors, loadings of variables on factors, and factor correlations" (p. 99). Essentially, the results of factor-analytic studies of the measure in the original language are constrained upon the adapted measure, data from the adapted measure analyzed, and a goodness-of-fit statistical test is performed.

Regression approaches to relationships among a number of tests can also be studied with SEM. Elaborate models of relationships among other tests, measuring variables hypothesized and found through previous research to be related to the construct measured by the adapted test, also may be tested using SEM. In such an analysis, it is possible for a researcher to approximate the kind of nomological net conceptualized by Cronbach and Meehl (1955), and test whether the structure holds in the target culture as it does in the original culture. Such a test should be the ideal to be sought in establishing the construct equivalence of tests across languages and cultures.

Item-Response Theory

Item-response theory (IRT) is an alternative to classical psychometric true-score theory as a method for analyzing test data. Allen and Walsh (2000) and van de Vijver and Leung (1997) provide descriptions of the way that IRT may be used to compare items across two forms of a measure that differ by language. Although a detailed description of IRT is beyond the scope of this chapter, the briefest of explanations may provide a conceptual understanding of how the procedure is used, especially for cognitive tests. An item characteristic curve (ICC) is computed for each item. This curve has as the x axis the overall ability level of test takers, and as the y axis, the probability of answering the question correctly. Different IRT models have different numbers of parameters, with one-, two-and three-parameter models most common. These parameters correspond to difficulty, discrimination, and the ability to get the answer correct by chance, respectively. The ICC curves are plotted as normal ogive curves. When a test is adapted, each translated item may be compared across languages graphically by overlaying the two ICCs as well as by comparing the item parameters mathematically. If there are differences, these may be considered conceptually. This method, too, may be considered as one technique for identifying item bias.

Methods to Establish Linkage of Scores

Once the conceptual equivalence of an adapted measure has been met, researchers and test developers often wish to provide measurement-unit and metric equivalence, as well. For most measures, this requirement is met through the process of test equating. As noted throughout this chapter, merely translating a test from one language to another, even if cultural biases have been eliminated, does not insure that the two different-language forms of a measure are equivalent. Conceptual or construct equivalence needs to be established first. Once such a step has been taken, then one can consider higher levels of equivalence. The mathematics of equating may be found in a variety of sources (e.g., Holland & Rubin, 1982; Kolen & Brennan, 1995), and Cook et al. (1999) provide an excellent integration of research designs and analysis for test adaptation; research designs for such studies are abstracted in the following paragraphs.

Sireci (1997) clarified three experimental designs that can be used to equate adapted forms to their original-language scoring systems and, perhaps, norms. He refers to them as (a) the separate-monolingual-groups design, (b) the bilingual-group design, and (c) the matched-monolingual-groups design. A brief description of each follows.

Separate-Monolingual-Groups Design

In the separate-monolingual-groups design, two different groups of test takers are involved, one from each language or cultural group. Although some items may simply be assumed to be equivalent across both tests, data can be used to support this assumption. These items serve as what is known in equating as anchor items. IRT methods are then generally used to calibrate the two tests to a common scale, most typically the one used by the original-language test (Angoff & Cook, 1988; O'Brien, 1992; Sireci, 1997). Translated items must then be evaluated for invariance across the two different-language test forms; that is, they are assessed to determine whether their difficulty differs across forms. This design does not work effectively if the two groups actually differ, on average, on the characteristic that is assessed (Sireci); in fact, in such a situation, one cannot disentangle differences in the ability measured from differences in the two measures. The method also assumes that the construct measured is based on a single, unidimensional factor. Measures of complex constructs, then, are not good prospects for this method.

Bilingual-Group Design

In the bilingual-group design, a single group of bilingual individuals takes both forms of the test in counterbalanced order. An assumption of this method is that the individuals in the group are all equally bilingual, that is, equally proficient in each language. In Maldonado and Geisinger (in press), all participants first were tested in both Spanish and English competence to gain entry into the study. Even under such restrictive circumstances, however, a ceiling effect made a true assessment of equality impossible. The problem of finding equally bilingual test takers is almost insurmountable. Also, if knowledge of what is on the test in one language affects performance on the other test, it is possible to use two randomly assigned groups of bilingual individuals (where their level of language skill is equated via randomization). In such an instance, it is possible either to give each group one of the tests or to give each group one-half of the items (counterbalanced) from each test in a nonoverlapping manner (Sireci, 1997). Finally, one must question how representative the equally bilingual individuals are of the target population; thus the external validity of the sample may be questioned.

Matched-Monolingual-Groups Design

This design is conceptually similar to the separate-monolingual-groups design, except that in this case the study participants are matched on the basis of some variable expected to correlate highly with the construct measured. By being matched in this way, the two groups are made more equal, which reduces error. "There are not many examples of the matched monolingual group linking design, probably due to the obvious problem of finding relevant and available matching criteria" (Sireci, 1997, p. 17). The design is nevertheless an extremely powerful one.

Was this article helpful?

0 0

Post a comment