Ancient Secrets of Kings

Improve Your Intelligence and IQ

Get Instant Access

Notes. PPVT = Peabody Picture Vocabulary Test; RPM = Raven's Progressive Matrices; SB L-M = Stanford-Binet, Form LM; WISC-R = Wechsler Intelligence Scale for Children-Revised; WPT = Wonderlic Personnel Test; Sandoval, 1979 = Medians for 10 WISC-R subtests, excluding Coding and Digit Span. aMales. bFemales. cMales and females combined.

Notes. PPVT = Peabody Picture Vocabulary Test; RPM = Raven's Progressive Matrices; SB L-M = Stanford-Binet, Form LM; WISC-R = Wechsler Intelligence Scale for Children-Revised; WPT = Wonderlic Personnel Test; Sandoval, 1979 = Medians for 10 WISC-R subtests, excluding Coding and Digit Span. aMales. bFemales. cMales and females combined.

difference between Ps for one item and the next. Jensen also obtained correlations between the rank orders of item difficulties for Black and Whites. Results for rank orders and P decrements, it should be noted, differ from those that would be obtained for the scores themselves.

The tests examined were RPM; the PPVT; the WISC-R; the WPT; and the Revised Stanford-Binet Intelligence Scale, Form L-M. Jensen (1974) obtained the same data for Mexican American and White students on the PPVT and RPM. Table 4.1 shows the results, with similar findings obtained by Sandoval (1979) and Miele (1979). The correlations showed little evidence of content bias in the scales examined. Most correlations appeared large. Some individual items were identified as biased, but they accounted for only 2% to 5% of the variation in score differences.

Hammill (1991) used correlations of P decrements to examine the Detroit Tests of Learning Aptitude (DTLA-3). Correlations exceeded .90 for all subtests, and most exceeded .95. Reynolds and Bigler (1994) presented correlations of P decrements for the 14 subtests of the Test of Memory and Learning (TOMAL). Correlations again exceeded .90, with most exceeding .95, for males and females and for all ethnicities studied.

Another procedure for detecting item bias relies on the partial correlation between an item score and a nominal variable such as ethnic group. The correlation partialed out is that between total test score and the nominal variable. If the variable and the item score are correlated after the partialed correlation is removed, the item is performing differently from group to group, which suggests bias. Reynolds, Lowe, et al. (1999) describe this technique as "the simplest and perhaps the most powerful" means of detecting item bias. They note, however, that it is a relatively recent application. Thus, it may have limitations not yet known.

Research on item bias in personality measures is sparse but has produced results similar to those with ability tests (Moran,

1990; Reynolds, 1998a, 1998b; Reynolds & Harding, 1983). The few studies of behavior rating scales have produced little evidence of bias for White, Black, and Hispanic and Latin populations in the United States (James, 1995; Mayfield & Reynolds, 1998; Reynolds & Kamphaus, 1992).

Not all studies of content bias have focused on items. Researchers evaluating the WISC-R have defined bias differently. Few results are available for the WISC-III; future research should utilize data from this newer test. A recent book by Prifitera and Saklofske (1998), however, addresses the WISC-III and ethnic bias in the United States. These results are discussed later (see the "Construct Validity" and "Predictive Validity" sections).

Reynolds and Jensen (1983) examined the 12 WISC-R subtests for bias against Black children using a variation of the group by item analysis of variance (ANOVA). The researchers matched Black children to White children from the norming sample on the basis of gender and Full Scale IQ. SES was a third matching variable and was used when a child had more than one match in the other group. Matching controlled for g, so a group difference indicated that the subtest in question was more difficult for Blacks or for Whites.

Black children exceeded White children on Digit Span and Coding. Whites exceeded Blacks on Comprehension, Object Assembly, and Mazes. Blacks tended to obtain higher scores on Arithmetic and Whites on Picture Arrangement. The actual differences were very small, and variance due to ethnic group was less than 5% for each subtest. If the WISC-R is viewed as a test measuring only g, these results may be interpretable as indicating subtest bias. Alternatively, the results may indicate differences in Level II ability (Reynolds, Willson, et al., 1999) or in specific or intermediate abilities.

Taken together, studies of major ability and personality tests show no consistent evidence for content bias. When bias is found, it is small. Tests with satisfactory reliability, validity, and norming appear also to have little content bias. For numerous standardized tests, however, results are not yet available. Research with these tests should continue, investigating possible content bias with differing ethnic and other groups.

Construct Validity

Anastasi (1988) defines construct validity as the extent to which a test may be said to measure a theoretical construct or trait. Test bias in construct validity, then, may be defined as the extent to which a test measures different constructs for different groups.

Factor analysis is a widely used method for investigating construct bias (Reynolds, 2000a). This set of complex techniques groups together items or subtests that correlate highly among themselves. When a group of items correlates highly together, the researcher interprets them as reflecting a single characteristic. The researcher then examines the pattern of correlations and induces the nature of this characteristic. Table 4.2 shows a simple example.

In the table, the subtests picture identification, matrix comparison, visual search, and diagram drawing have high correlations in the column labeled "Factor 1." Definitions, antonyms, synonyms, and multiple meanings have low correlations in this column but much higher ones in the column labeled "Factor 2." A researcher might interpret these results as indicating that the first four subtests correlate with factor 1 and the second four correlate with factor 2. Examining the table, the researcher might see that the subtests correlating highly with factor 1 require visual activity, and he or she might therefore label this factor Visual Ability. The same researcher might see that the subtests correlating highly with factor 2 involve the meanings of words, and he or she might label this factor Word Meanings. To label factors in this way, researchers must be familiar with the subtests or items, common responses to them, and scoring of these responses (see also Ramsay & Reynolds, 2000a). The results in Table 4.2 are called a two-factor

TABLE 4.2 A Sample Factor Structure

Subtest Factor 1 Factor 2

Picture Identification .78 .17

Matrix Comparison .82 .26

Visual Search .86 .30

Diagram Drawing .91 .29

Definitions .23 .87

Antonyms .07 .92

Synonyms .21 .88

Multiple Meanings .36 .94

solution. Actual factor analysis is a set of advanced statistical techniques, and the explanation presented here is necessarily a gross oversimplification.

Very similar factor analytic results for two or more groups, such as genders or ethnicities, are evidence that the test responses being analyzed behave similarly as to the constructs they represent and the extent to which they represent them. As noted by Reynolds, Lowe, et al. (1999), such comparative factor analyses with multiple populations are important for the work of clinicians, who must know that a test functions very similarly from one population to another to interpret scores consistently.

Researchers most often calculate a coefficient of congruence or simply a Pearson correlation to examine factorial similarity, often called factor congruence or factor invariance. The variables correlated are one group's item or subtest correlations (shown in Table 4.2) with another's. A coefficient of congruence may be preferable, but the commonly used techniques produce very similar results, at least with large samples (Reynolds & Harding, 1983; Reynolds, Lowe, et al., 1999). Researchers frequently interpret a value of .90 or higher as indicating factor congruity. For other applicable techniques, see Reynolds (2000a).

Extensive research regarding racial and ethnic groups is available for the widely used WISC and WISC-R. This work consists largely of factor analyses. Psychometricians are trained in this method, so its usefulness in assessing bias is opportune. Unfortunately, many reports of this research fail to specify whether exploratory or confirmatory factor analysis has been used. In factor analyses of construct and other bias, exploratory techniques are most common. Results with the WISC and WISC-R generally support factor congruity. For preschool-age children also, factor analytic results support congruity for racial and ethnic groups (Reynolds, 1982a).

Reschly (1978) conducted factor analyses comparing WISC-R correlations for Blacks, Whites, Mexican Americans, and Papagos, a Native American group, all in the southwestern United States. Reschly found that the two-factor solutions were congruent for the four ethnicities. The 12 coefficients of congruence ranged from .97 to .99. For the less widely used three-factor solutions, only results for Whites and Mexican Americans were congruent. The one-factor solution showed congruence for all four ethnicities, as Miele (1979) had found with the WISC.

Oakland and Feigenbaum (1979) factor analyzed the 12 WISC-R subtests separately for random samples of normal Black, White, and Mexican American children from an urban school district in the northwestern United States. Samples were stratified by race, age, sex, and SES. The researchers used a Pearson r for each factor to compare it for the three ethnic groups. The one-factor solution produced rs of .95 for Black and White children, .97 for Mexican American and White children, and .96 for Black and Mexican American children. The remaining results were r = .94 - .99. Thus, WISC-R scores were congruent for the three ethnic groups.

Gutkin and Reynolds (1981) compared factor analytic results for the Black and White children in the WISC-R norming sample. Samples were stratified by age, sex, race, SES, geographic region, and community size to match 1970 U.S. Census Bureau data. The researchers compared one-, two-, and three-factor solutions using magnitudes of unique variances, proportion of total variance accounted for by common factor variance, patterns of correlations with each factor, and percentage of common factor variance accounted for by each factor. Coefficients of congruence were .99 for comparisons of the unique variances and of the three solutions examined. Thus, the factor correlations were congruent for Black and White children.

Dean (1979) compared three-factor WISC-R solutions for White and Mexican American children referred because of learning difficulties in the regular classroom. Analyzing the 10 main WISC-R subtests, Dean found these coefficients of congruence: .84 for Verbal Comprehension, .89 for Perceptual Organization, and .88 for Freedom from Distractibility.

Gutkin and Reynolds (1980) compared one-, two-, and three-factor principal-factor solutions of the WISC-R for referred White and Mexican American children. The researchers also compared their solutions to those of Reschly (1978) and to those derived from the norming sample. Coefficients of congruence were .99 for Gutkin and Reynolds's one-factor solutions and .98 and .91 for their two-factor solutions. Coefficients of congruence exceeded .90 in all comparisons of Gutkin and Reynolds's solutions to Reschly's solutions for normal Black, White, Mexican American, and Papago children and to solutions derived from the norming sample. Three-factor results were more varied but also indicated substantial congruity for these children.

DeFries et al. (1974) administered 15 ability tests to large samples of American children of Chinese or Japanese ancestry. The researchers examined correlations among the 15 tests for the two ethnic groups and concluded that the cognitive organization of the groups was virtually identical. Willerman (1979) reviewed these results and concluded, in part, that the tests were measuring the same abilities for the two groups of children.

Results with adults are available as well. Kaiser (1986) and Scholwinski (1985) have found the Wechsler Intelligence Scale-Revised (WAIS-R) to be factorially congruent for Black and White adults from the norming sample. Kaiser conducted separate hierarchical analyses for Black and White participants and calculated coefficients of congruence for the General, Verbal, and Performance factors. Coefficients for the three factors were .99, .98, and .97, respectively. Scholwinski (1985) selected Black and White participants closely matched in age, sex, and Full Scale IQ, from the WAIS-R norming sample. Results again indicated factorial congruence.

Researchers have also assessed construct bias by estimating internal consistency reliabilities for different groups. Internal consistency reliability is the extent to which all items of a test are measuring the same construct. A test is unbiased with regard to this characteristic to the extent that its reliabilities are similar from group to group.

Jensen (1977) used Kuder-Richardson formula 21 to estimate internal consistency reliability for Black and White adults on the Wonderlic Personnel Test. Reliability estimates were .86 and .88 for Blacks and Whites, respectively. In addition, Jensen (1974) used Hoyt's formula to obtain internal consistency estimates of .96 on the PPVT for Black, White, and Mexican American children. The researcher then subdivided each group of children by gender and obtained reliabilities of .95-97. Raven's colored matrices produced internal consistency reliabilities of .86-.91 for the same six race-gender groupings. For these three widely used aptitude tests, Jensen's (1974,1976) results indicated homogeneity of test content and consistency of measurement by gender and ethnicity.

Sandoval (1979) and Oakland and Feigenbaum (1979) have extensively examined the internal consistency reliability of the WISC-R subtests, excluding Digit Span and Coding, for which internal consistency analysis is inappropriate. Both studies included Black, White, and Mexican American children. Both samples were large, and Sandoval's exceeded 1,000.

Sandoval (1979) estimated reliabilities to be within .04 of each other for all subtests except Object Assembly. This subtest was most reliable for Black children at .95, followed by Whites at .79 and Mexican Americans at .75. Oakland and Feigenbaum (1979) found reliabilities within .06, again excepting Object Assembly. In this study, the subtest was most reliable for Whites at .76, followed by Blacks at .64 and Mexican Americans at .67. Oakland and Feigenbaum also found consistent reliabilities for males and females.

Dean (1979) assessed the internal consistency reliability of the WISC-R for Mexican American children tested by White examiners. Reliabilities were consistent with, although slightly larger than, those reported by Wechsler (1975) for the norming sample.

Results with the WISC-III norming sample (Prifitera, Weiss, & Saklofske, 1998) suggested a substantial association between IQ and SES. WISC-III Full Scale IQ was higher for children whose parents had high education levels, and parental education is considered a good measure of SES. The children's Full Scale IQs were 110.7, 103.0, 97.9, 90.6, and 87.7, respectively, in the direction of highest (college or above) to lowest (< 8th grade) parental education level. Researchers have reported similar results for other IQ tests (Prifitera et al.). Such results should not be taken as showing SES bias because, like ethnic and gender differences, they may reflect real distinctions, perhaps influenced by social and economic factors. Indeed, IQ is thought to be associated with SES. By reflecting this theoretical characteristic of intelligence, SES differences may support the construct validity of the tests examined.

Psychologists view intelligence as a developmental phenomenon (Reynolds, Lowe, et al., 1999). Hence, similar correlations of raw scores with age may be evidence of construct validity for intelligence tests. Jensen (1976) found that these correlations for the PPVT were .73 with Blacks, .79 with Whites, and .67 with Mexican Americans. For Raven's colored matrices, correlations were .66 for Blacks, .72 for Whites, and .70 for Mexican Americans. The K-ABC produced similar results (Kamphaus & Reynolds, 1987).

A review by Moran (1990) and a literature search by Reynolds, Lowe, et al. (1999) indicated that few construct bias studies of personality tests had been published. This limitation is notable, given large mean differences on the Minnesota Multiphasic Personality Inventory (MMPI), and possibly the MMPI-2, for different genders and ethnicities (Reynolds et al.). Initial results for the Revised Children's Manifest Anxiety Scale (RCMAS) suggest consistent results by gender and ethnicity (Moran, 1990; Reynolds & Paget, 1981).

To summarize, studies using different samples, methodologies, and definitions of bias indicate that many prominent standardized tests are consistent from one race, ethnicity, and gender to another (see Reynolds, 1982b, for a review of methodologies). These tests appear to be reasonably unbiased for the groups investigated.

Predictive Validity

As its name implies, predictive validity pertains to prediction from test scores, whereas content and construct validity pertain to measurement. Anastasi (1988) defines predictive or criterion-related validity as "the effectiveness of a test in predicting an individual's performance in specified activities" (p. 145). Thus, test bias in predictive validity may be defined as systematic error that affects examinees' performance differentially depending on their group membership. Cleary et al. (1975) defined predictive test bias as constant error in an inference or prediction, or error in a prediction that exceeds the smallest feasible random error, as a function of membership in a particular group. Oakland and Matuszek (1977) found that fewer children were wrongly placed using these criteria than using other, varied models of bias. An early court ruling also favored Cleary's definition (Cortez v. Rosen, 1975).

Of importance, inaccurate prediction sometimes reflects inconsistent measurement of the characteristic being predicted, rather than bias in the test used to predict it. In addition, numerous investigations of predictive bias have addressed the selection of employment and college applicants of different racial and ethnic groups. Future studies should also address personality tests (Moran, 1990). As the chapter will show, copious results for intelligence tests are available.

Under the definition presented by Cleary et al. (1975), the regression line formed by any predictor and criterion (e.g., total test score and a predicted characteristic) must be the same for each group with whom the test is used. A regression line consists of two parameters: a slope a and an intercept b. Too great a group difference in either of these parameters indicates that a regression equation based on the combined groups would predict inaccurately (Reynolds, Lowe, et al., 1999). A separate equation for each group then becomes necessary with the groups and characteristics for which bias has been found.

Hunter, Schmidt, and Hunter (1979) reviewed 39 studies, yielding 866 comparisons, of Black-White test score validity in personnel selection. The researchers concluded that the results did not support a hypothesis of differential or single-group validity. Several studies of the Scholastic Aptitude Test (SAT) indicated no predictive bias, or small bias against Whites, in predicting grade point average (GPA) and other measures of college performance (Cleary, 1968; Cleary et al., 1975).

Reschly and Sabers (1979) examined the validity of WISC-R IQs in predicting the Reading and Math subtest scores of Blacks, Whites, Mexican Americans, and Papago Native Americans on the Metropolitan Achievement Tests (MAT). The MAT has undergone item analysis procedures to eliminate content bias, making it especially appropriate for this research: Content bias can be largely ruled out as a competing explanation for any invalidity in prediction. WISC-R IQs underpredicted MAT scores for Whites, compared with the remaining groups. Overprediction was greatest for Papagos. The intercept typically showed little bias.

Reynolds and Gutkin (1980) conducted similar analyses for WISC-R Verbal, Performance, and Full Scale IQs as predictors of arithmetic, reading, and spelling. The samples were large groups of White and Mexican American children from the southwestern United States. Only the equation for Performance IQ and arithmetic achievement differed for the two groups. Here, an intercept bias favored Mexican American children.

Likewise, Reynolds and Hartlage (1979) assessed WISC and WISC-R Full Scale IQs as predictors of Blacks' and Whites' arithmetic and reading achievement. The children's teachers had referred them for psychological services in a rural, southern school district. The researchers found no statistically significant differences for these children. Many participants, however, had incomplete data (34% of the total).

Prifitera, Weiss, and Saklofske (1998) noted studies in which the WISC-III predicted achievement equally for Black, White, and Hispanic children. In one study, Weiss and Prifitera (1995) examined WISC-III Full Scale IQ as a predictor of Wechsler Individual Achievement Test (WIAT) scores for Black, White, and Hispanic children aged 6 to 16 years. Results indicated little evidence of slope or intercept bias, a finding consistent with those for the WISC and WISC-R. Weiss, Prifitera, and Roid (1993) reported similar results.

Bossard, Reynolds, and Gutkin (1980) analyzed the 1972 Stanford-Binet Intelligence Scale when used to predict the reading, spelling, and arithmetic attainment of referred Black and White children. No statistically significant bias appeared in comparisons of either correlations or regression analyses.

Reynolds, Willson, and Chatman (1985) evaluated K-ABC scores as predictors of Black and White children's academic attainment. Some of the results indicated bias, usually over-prediction of Black children's attainment. Of 56 Potthoff comparisons, however, most indicated no statistically significant bias. Thus, evidence for bias had low method reliability for these children.

In addition, Kamphaus and Reynolds (1987) reviewed seven studies on predictive bias with the K-ABC. Overpre-diction of Black children's scores was more common than with other tests and was particularly common with the Sequential Processing Scale. The differences were small and were mitigated by using the K-ABC Mental Processing Composite. Some underprediction of Black children's scores also occurred.

A series of very-large-N studies reviewed by Jensen (1980) and Sattler (1974) have compared the predictive validities of group IQ tests for different races. This procedure has an important limitation. If validities differ, regression analyses must also differ. If validities are the same, regression analyses may nonetheless differ, making additional analysis necessary (but see Reynolds, Lowe, et al., 1999). In addition, Jensen and Sattler found few available studies. Lorge-Thorndike Verbal and Nonverbal IQs were the results most often investigated. The reviewers concluded that validities were comparable for Black and White elementary school children. In the future, researchers should broaden the range of group intelligence tests that they examine. Emphasis on a small subset of available measures is a common limitation of test research.

Guterman (1979) reported an extensive analysis of the Ammons andAmmons Quick Test (QT), a verbal IQ measure, with adolescents of different social classes. The variables pre dicted were (a) social knowledge measures; (b) school grades obtained in Grades 9,10, and 12; (c) Reading Comprehension Test scores on the Gates Reading Survey; and (d) Vocabulary and Arithmetic subtest scores on the General Aptitude Test Battery (GATB). Guterman found little evidence of slope or intercept bias with these adolescents, except that one social knowledge measure, sexual knowledge, showed intercept bias.

Another extensive analysis merits attention, given its unexpected results. Reynolds (1978) examined seven major preschool tests: the Draw-a-Design and Draw-a-Child subtests of the McCarthy Scales, the Mathematics and Language subtests of the Tests of Basic Experiences, the Preschool Inventory-Revised Edition, and the Lee-Clark Readiness Test. Variables predicted were four MAT subtests: Word Knowledge, Word Discrimination, Reading, and Arithmetic. Besides increased content validity, the MAT had the advantage of being chosen by teachers in the district as the test most nearly measuring what was taught in their classrooms. Reynolds compared correlations and regression analyses for the following race-gender combinations: Black females versus Black males, White females versus White males, Black females versus White females, and Black males versus White males. The result was 112 comparisons each for correlations and regression analyses.

For each criterion, scores fell in the same rank order: White females < White males < Black females < Black males. Mean validities comparing pre- and posttest scores, with 12 months intervening, were .59 for White females, .50 for White males, .43 for Black females, and .30 for Black males. In spite of these overall differences, only three differences between correlations were statistically significant, a chance finding with 112 comparisons. Potthoff comparisons of regression lines, however, indicated 43 statistically significant differences. Most of these results occurred when race rather than gender was compared: 31 of 46 comparisons (p < .01). The Preschool Inventory and Lee-Clark Test most frequently showed bias; the MRT never did. The observed bias overpredicted scores of Black and male children.

Researchers should investigate possible reasons for these results, which may have differed for the seven predictors but also by the statistical results compared. Either Potthoff comparisons or comparisons of correlations may be inaccurate or inconsistent as analyses of predictive test bias (see also Reynolds, 1980).

Brief screening measures tend to have low reliability compared with major ability and aptitude tests such as the WISC-III and the K-ABC. Low reliability can lead to bias in prediction (Reynolds, Lowe, et al., 1999). More reliable measures, such as the Metropolitan Readiness Tests (MRT), the

The Examiner-Examinee Relationship 85

WPPSI, and the McCarthy Scales, have shown little evidence of internal bias. The WPPSI and McCarthy Scales have not been assessed for predictive bias with differing racial or ethnic groups (Reynolds, Lowe, et al., 1999).

Reynolds (1980) examined test and subtest scores for the seven tests noted above when used to predict MAT scores for males and females and for diverse ethnic groups. The researcher examined residuals, the differences between predicted scores and actual scores obtained by examinees. Techniques used were multiple regression to obtain residuals and race by gender ANOVA to analyze them.

ANOVA results indicated no statistically significant differences in residuals for ethnicities or genders, and no statistically significant interactions. Reynolds (1980) then examined a subset of the seven-test battery. No evidence of racial bias appeared. The results indicated gender bias in predicting two of the four MAT subtests, Word Discrimination and Word Knowledge. The seven tests consistently underpredicted females' scores. The difference was small, on the order of .13 to .16 standard deviation.

For predictive validity, as for content and construct validity, the results reviewed above suggest little evidence of bias, be it differential or single-group validity. Differences are infrequent. Where they exist, they usually take the form of small overpredictions for lower scoring groups, such as dis-advantaged, low-SES, or ethnic minority examinees. These overpredictions are unlikely to account for adverse placement or diagnosis of these groups. On a grander scale, the small differences found may be reflections, but would not be major causes, of sweeping social inequalities affecting ethnic group members. The causes of such problems as employment discrimination and economic deprivation lie primarily outside the testing environment.

Was this article helpful?

0 0
Brain Battalion

Brain Battalion

Get All The Support And Guidance You Need To Be A Success At Beefing Up Your Brain. This Book Is One Of The Most Valuable Resources In The World When It Comes To Understanding Your Brain to Come Up With A Battle Plan For Intelligence.

Get My Free Ebook

Post a comment