Jensen (1980; Brown et al., 1999) contended that three fallacious assumptions were impeding the scientific study of test bias: (a) the egalitarian fallacy, that all groups were equal in the characteristics measured by a test, so that any score difference must result from bias; (b) the culture-bound fallacy, that reviewers can assess the culture loadings of items through casual inspection or armchair judgment; and (c) the standardization fallacy, that a test is necessarily biased when used with any group not included in large numbers in the norming sample. In Jensen's view, the mean-difference-as-bias approach is an example of the egalitarian fallacy.

A prior assumption of equal ability is as unwarranted scientifically as the opposite assumption. Studies have shown group differences for many abilities and even for sensory capacities (Reynolds, Willson, et al., 1999). Both equalities and inequalities must be found empirically, that is, through scientific observation. An assumption of equality, if carried out consistently, would have a stultifying effect on research. Torrance (1980) observed that disadvantaged Black children in the United States have sometimes earned higher creativity scores than many White children. This finding may be important, given that Blacks are underrepresented in classes for gifted students. The egalitarian assumption implies that these Black children's high creativity is an artifact of tests, foreclosing on more substantive interpretations—and on possible changes in student placement.

Equal ability on the part of different ethnic groups is not a defensible egalitarian fallacy. A fallacy, as best understood, is an error in judgment or reasoning, but the question of equal ability is an empirical one. By contrast, an a priori assumption of either equal or unequal ability can be regarded as fallacious. The assumption of equal ability is most relevant, because it is implicit when any researcher interprets a mean difference as test bias.

The impossibility of proving a null hypothesis is relevant here. Scientists never regard a null hypothesis as proven, because the absence of a counterinstance cannot prove a rule. If 100 studies do not provide a counterinstance, the 101st study may. Likewise, the failure to reject a hypothesis of equality between groups—that is, a null hypothesis—cannot prove that the groups are equal. This hypothesis, then, is not falsifi-able and is therefore problematic for researchers.

As noted above, a mean difference by itself does not show bias. One may ask, then, what (if anything) it does show. It indicates simply that two groups differ when means are taken to represent their performance. Thus, its accuracy depends on how well means, as opposed to other measures of the typical score, represent the two groups; on how well any measure of the typical score can represent the two groups; and on how well differences in typical scores, rather than in variation, asymmetry, or other properties, can represent the relationships between the two groups. Ramsay (2000) reanalyzed a study in which mean differences between groups had been found. The reanalysis showed that the two groups differed much more in variation than in typical scores.

Most important, a mean difference provides no information as to why two groups differ: because of test bias, genetic influences, environmental factors, a gene-environment interaction, or perhaps biases in society recorded by tests. Rather than answering this question, mean differences raise it in the first place. Thus, they are a starting point—but are they a good one? Answering this question is a logical next step.

A difference between group means is easy to obtain. In addition, it permits an easy, straightforward interpretation—but a deceptive one. It provides scant information, and none at all regarding variation, kurtosis, or asymmetry. These additional properties are needed to understand any group's scores.

Moreover, a mean difference is often an inaccurate measure of center. If a group's scores are highly asymmetric—that is, if the high scores taper off gradually but the low scores clump together, or vice versa—their mean is always too high or too low, pulled as it is toward the scores that taper gradually. Symmetry should never be assumed, even for standardized test scores. A test with a large, national norming sample can produce symmetric scores with that sample but asymmetric or skewed scores for particular schools, communities, or geographic regions. Results for people in these areas, if skewed, can produce an inaccurate mean and therefore an inaccurate mean difference. Even a large norming sample can include very small samples for one or more groups, producing misleading mean differences for the norming sample itself.

Finally, a mean is a point estimate: a single number that summarizes the scores of an entire group of people. A group's scores can have little skew or kurtosis but vary so widely that the mean is not typical of the highest and lowest scores. In addition to being potentially inaccurate, then, a mean can be unrepresentative of the group it purports to summarize.

Thus, means have numerous potential limitations as a way to describe groups and differences between groups. In addition to a mean, measures of shape and spread, sometimes called distribution and variation, are necessary. Researchers, including clinical researchers, may sometimes need to use different centroids entirely: medians, modes, or modified M statistics. Most basically, we always need a thoroughgoing description of each sample. Furthermore, it is both possible and necessary to test the characteristics of each sample to assess their representativeness of the respective population characteristics. This testing can be a simple process, often using group confidence intervals.

Once we know what we have found—which characteristics vary from group to group—we can use this information to start to answer the question why. That is, we can begin to investigate causation. Multivariate techniques are often suitable for this work. Bivariate techniques address only two variables, as the name implies. Thus, they are ill suited to pursue possible causal relationships, because they cannot rule out alternative explanations posed by additional variables (Ramsay, 2000).

Alternatively, we can avoid the elusive causal question why and instead use measurement techniques developed to assess bias. Reynolds (1982a) provides copious information about these techniques. Such procedures cannot tell us if group differences result from genetic or environmental factors, but they can suggest whether test scores may be biased. Researchers have generated a literature of considerable size and sophistication using measurement techniques for examining test bias. This chapter now addresses the results of such research.

Was this article helpful?

## Post a comment