The present-day conflict over bias in standardized tests is motivated largely by public concerns. The impetus, it may be argued, lies with beliefs fundamental to democracy in the United States. Most Americans, at least those of majority ethnicity, view the United States as a land of opportunity— increasingly, equal opportunity that is extended to every
person. We want to believe that any child can grow up to be president. Concomitantly, we believe that everyone is created equal, that all people harbor the potential for success and achievement. This equality of opportunity seems most reasonable if everyone is equally able to take advantage of it.
Parents and educational professionals have corresponding beliefs: The children we serve have an immense potential for success and achievement; the great effort we devote to teaching or raising children is effort well spent; my own child is intelligent and capable. The result is a resistance to labeling and alternative placement, which are thought to discount students' ability and diminish their opportunity. This terrain may be a bit more complex for clinicians, because certain diagnoses have consequences desired by clients. A disability diagnosis, for example, allows people to receive compensation or special services, and insurance companies require certain serious conditions for coverage.
The nature of psychological characteristics and their measurement is partly responsible for long-standing concern over test bias (Reynolds & Brown, 1984a). Psychological characteristics are internal, so scientists cannot observe or measure them directly but must infer them from a person's external behavior. By extension, clinicians must contend with the same limitation.
According to MacCorquodale and Meehl (1948), a psychological process is an intervening variable if it is treated only as a component of a system and has no properties beyond the ones that operationally define it. It is a hypothetical construct if it is thought to exist and to have properties beyond its defining ones. In biology, a gene is an example of a hypothetical construct. The gene has properties beyond its use to describe the transmission of traits from one generation to the next. Both intelligence and personality have the status of hypothetical constructs. The nature of psychological processes and other unseen hypothetical constructs are often subjects of persistent debate (see Ramsay, 1998b, for one approach). Intelligence, a highly complex psychological process, has given rise to disputes that are especially difficult to resolve (Reynolds, Willson, et al., 1999).
Test development procedures (Ramsay & Reynolds, 2000a) are essentially the same for all standardized tests. Initially, the author of a test develops or collects a large pool of items thought to measure the characteristic of interest. Theory and practical usefulness are standards commonly used to select an item pool. The selection process is a rational one.
That is, it depends upon reason and judgment; rigorous means of carrying it out simply do not exist. At this stage, then, test authors have no generally accepted evidence that they have selected appropriate items.
A common second step is to discard items of suspect quality, again on rational grounds, to reduce the pool to a manageable size. Next, the test's author or publisher administers the items to a group of examinees called a tryout sample. Statistical procedures then help to identify items that seem to be measuring an unintended characteristic or more than one characteristic. The author or publisher discards or modifies these items.
Finally, examiners administer the remaining items to a large, diverse group of people called a standardization sample or norming sample. This sample should reflect every important characteristic of the population who will take the final version of the test. Statisticians compile the scores of the norming sample into an array called a norming distribution.
Eventually, clients or other examinees take the test in its final form. The scores they obtain, known as raw scores, do not yet have any interpretable meaning. A clinician compares these scores with the norming distribution. The comparison is a mathematical process that results in new, standard scores for the examinees. Clinicians can interpret these scores, whereas interpretation of the original, raw scores would be difficult and impractical (Reynolds, Lowe, et al., 1999).
Standard scores are relative. They have no meaning in themselves but derive their meaning from certain properties— typically the mean and standard deviation—of the norming distribution. The norming distributions of many ability tests, for example, have a mean score of 100 and a standard deviation of 15. A client might obtain a standard score of 127. This score would be well above average, because 127 is almost 2 standard deviations of 15 above the mean of 100. Another client might obtain a standard score of 96. This score would be a little below average, because 96 is about one third of a standard deviation below a mean of 100.
Here, the reason why raw scores have no meaning gains a little clarity. A raw score of, say, 34 is high if the mean is 30 but low if the mean is 50. It is very high if the mean is 30 and the standard deviation is 2, but less high if the mean is again 30 and the standard deviation is 15. Thus, a clinician cannot know how high or low a score is without knowing certain properties of the norming distribution. The standard score is the one that has been compared with this distribution, so that it reflects those properties (see Ramsay & Reynolds, 2000a, for a systematic description of test development).
Charges of bias frequently spring from low proportions of minorities in the norming sample of a test and correspondingly small influence on test results. Many norming samples include only a few minority participants, eliciting suspicion that the tests produce inaccurate scores—misleadingly low ones in the case of ability tests—for minority examinees. Whether this is so is an important question that calls for scientific study (Reynolds, Lowe, et al., 1999).
Test development is a complex and elaborate process (Ramsay & Reynolds, 2000a). The public, the media, Congress, and even the intelligentsia find it difficult to understand. Clinicians, and psychologists outside the measurement field, commonly have little knowledge of the issues surrounding this process. Its abstruseness, as much as its relative nature, probably contributes to the amount of conflict over test bias. Physical and biological measurements such as height, weight, and even risk of heart disease elicit little controversy, although they vary from one ethnic group to another. As explained by Reynolds, Lowe, et al. (1999), this is true in part because such measurements are absolute, in part because they can be obtained and verified in direct and relatively simple ways, and in part because they are free from the distinctive social implications and consequences of standardized test scores. Reynolds et al. correctly suggest that test bias is a special case of the uncertainty that accompanies all measurement in science. Ramsay (2000) and Ramsay and Reynolds (2000b) present a brief treatment of this uncertainty incorporating Heisenberg's model.
Was this article helpful?