placement, or selection
.90 or greater
proposed guidelines for the evaluation of test score reliability coefficients (e.g., Bracken, 1987; Cicchetti, 1994; Clark & Watson, 1995; Nunnally & Bernstein, 1994; Salvia & Ysseldyke, 2001), depending upon whether test scores are to be used for high- or low-stakes decision-making. High-stakes tests refer to tests that have important and direct consequences such as clinical-diagnostic, placement, promotion, personnel selection, or treatment decisions; by virtue of their gravity, these tests require more rigorous and consistent psychometric standards. Low-stakes tests, by contrast, tend to have only minor or indirect consequences for examinees.
After a test meets acceptable guidelines for minimal acceptable reliability, there are limited benefits to further increasing reliability. Clark and Watson (1995) observe that "Maximizing internal consistency almost invariably produces a scale that is quite narrow in content; if the scale is narrower than the target construct, its validity is compromised" (pp. 316-317). Nunnally and Bernstein (1994, p. 265) state more directly: "Never switch to a less valid measure simply because it is more reliable."
Internal consistency indexes of reliability provide a single average estimate of measurement precision across the full range of test scores. In contrast, local reliability refers to measurement precision at specified trait levels or ranges of scores. Conditional error refers to the measurement variance at a particular level of the latent trait, and its square root is a conditional standard error. Whereas classical test theory posits that the standard error of measurement is constant and applies to all scores in a particular population, item response theory posits that the standard error of measurement varies according to the test scores obtained by the examinee but generalizes across populations (Embretson & Hershberger, 1999).
As an illustration of the use of classical test theory in the determination of local reliability, the Universal Nonverbal Intelligence Test (UNIT; Bracken & McCallum, 1998) presents local reliabilities from a classical test theory orientation. Based on the rationale that a common cut score for classification of individuals as mentally retarded is an FSIQ equal to 70, the reliability of test scores surrounding that decision point was calculated. Specifically, coefficient alpha reliabilities were calculated for FSIQs from - 1.33 and - 2.66 standard deviations below the normative mean. Reliabilities were corrected for restriction in range, and results showed that composite IQ reliabilities exceeded the .90 suggested criterion. That is, the UNIT is sufficiently precise at this ability range to reliably identify individual performance near to a common cut point for classification as mentally retarded.
Item response theory permits the determination of conditional standard error at every level of performance on a test. Several measures, such as the Differential Ability Scales (Elliott, 1990) and the Scales of Independent Behavior— Revised (SIB-R; Bruininks, Woodcock, Weatherman, & Hill, 1996), report local standard errors or local reliabilities for every test score. This methodology not only determines whether a test is more accurate for some members of a group (e.g., high-functioning individuals) than for others (Daniel, 1999), but also promises that many other indexes derived from reliability indexes (e.g., index discrepancy scores) may eventually become tailored to an examinee's actual performance. Several IRT-based methodologies are available for estimating local scale reliabilities using conditional standard errors of measurement (Andrich, 1988; Daniel, 1999; Kolen, Zeng, & Hanson, 1996; Samejima, 1994), but none has yet become a test industry standard.
Are test scores consistent over time? Test scores must be reasonably consistent to have practical utility for making clinical and educational decisions and to be predictive of future performance. The stability coefficient, or test-retest score reliability coefficient, is an index of temporal stability that can be calculated by correlating test performance for a large number of examinees at two points in time. Two weeks is considered a preferred test-retest time interval (Nunnally & Bernstein, 1994; Salvia & Ysseldyke, 2001), because longer intervals increase the amount of error (due to maturation and learning) and tend to lower the estimated reliability.
Bracken (1987; Bracken & McCallum, 1998) recommends that a total test stability coefficient should be greater than or equal to .90 for high-stakes tests over relatively short test-retest intervals, whereas a stability coefficient of .80 is reasonable for low-stakes testing. Stability coefficients may be spuriously high, even with tests with low internal consistency, but tests with low stability coefficients tend to have low internal consistency unless they are tapping highly variable state-based constructs such as state anxiety (Nunnally & Bernstein, 1994). As a general rule of thumb, measures of internal consistency are preferred to stability coefficients as indexes of reliability.
Whenever tests require observers to render judgments, ratings, or scores for a specific behavior or performance, the consistency among observers constitutes an important source of measurement precision. Two separate methodological approaches have been utilized to study consistency and consensus among observers: interrater reliability (using correlational indexes to reference consistency among observers) and interrater agreement (addressing percent agreement among observers; e.g., Tinsley & Weiss, 1975). These distinctive approaches are necessary because it is possible to have high interrater reliability with low manifest agreement among raters if ratings are different but proportional. Similarly, it is possible to have low interrater reliability with high manifest agreement among raters if consistency indexes lack power because of restriction in range.
Interrater reliability refers to the proportional consistency of variance among raters and tends to be correlational. The simplest index involves correlation of total scores generated by separate raters. The intraclass correlation is another index of reliability commonly used to estimate the reliability of ratings. Its value ranges from 0 to 1.00, and it can be used to estimate the expected reliability of either the individual ratings provided by a single rater or the mean rating provided by a group of raters (Shrout & Fleiss, 1979). Another index of reliability, Kendall's coefficient of concordance, establishes how much reliability exists among ranked data. This procedure is appropriate when raters are asked to rank order the persons or behaviors along a specified dimension.
Interrater agreement refers to the interchangeability of judgments among raters, addressing the extent to which raters make the same ratings. Indexes of interrater agreement typically estimate percentage of agreement on categorical and rating decisions among observers, differing in the extent to which they are sensitive to degrees of agreement correct for chance agreement. Cohen's kappa is a widely used statistic of interobserver agreement intended for situations in which raters classify the items being rated into discrete, nominal categories. Kappa ranges from - 1.00to +1.00; kappa values of .75 or higher are generally taken to indicate excellent agreement beyond chance, values between .60 and .74 are considered good agreement, those between .40 and .59 are considered fair, and those below .40 are considered poor (Fleiss, 1981).
Interrater reliability and agreement may vary logically depending upon the degree of consistency expected from specific sets of raters. For example, it might be anticipated that people who rate a child's behavior in different contexts (e.g., school vs. home) would produce lower correlations than two raters who rate the child within the same context (e.g., two parents within the home or two teachers at school). In a review of 13 preschool social-emotional instruments, the vast majority of reported coefficients of interrater congruence were below .80 (range .12 to .89). Walker and Bracken (1996) investigated the congruence of biological parents who rated their children on four preschool behavior rating scales. Interparent congruence ranged from a low of .03 (Temperament Assessment Battery for Children Ease of Management through Distractibility) to a high of .79 (Temperament Assessment Battery for Children Approach/Withdrawal). In addition to concern about low congruence coefficients, the authors voiced concern that 44% of the parent pairs had a mean discrepancy across scales of 10 to 13 standard score points; differences ranged from 0 to 79 standard score points.
Interrater studies are preferentially conducted under field conditions, to enhance generalizability of testing by clinicians "performing under the time constraints and conditions of their work" (Wood, Nezworski, & Stejskal, 1996, p. 4). Cone (1988) has described interscorer studies as fundamental to measurement, because without scoring consistency and agreement, many other reliability and validity issues cannot be addressed.
When two parallel forms of a test are available, then correlating scores on each form provides another way to assess reliability. In classical test theory, strict parallelism between forms requires equality of means, variances, and covariances (Gulliksen, 1950). A hierarchy of methods for pinpointing sources of measurement error with alternative forms has been proposed (Nunnally & Bernstein, 1994; Salvia & Ysseldyke, 2001): (a) assess alternate-form reliability with a two-week interval between forms, (b) administer both forms on the same day, and if necessary (c) arrange for different raters to score the forms administered with a two-week retest interval and on the same day. If the score correlation over the two-week interval between the alternative forms is lower than coefficient alpha by .20 or more, then considerable measurement error is present due to internal consistency, scoring subjectivity, or trait instability over time. If the score correlation is substantially higher for forms administered on the same day, then the error may stem from trait variation over time. If the correlations remain low for forms administered on the same day, then the two forms may differ in content with one form being more internally consistent than the other. If trait variation and content differences have been ruled out, then comparison of subjective ratings from different sources may permit the major source of error to be attributed to the subjectivity of scoring.
In item response theory, test forms may be compared by examining the forms at the item level. Forms with items of comparable item difficulties, response ogives, and standard errors by trait level will tend to have adequate levels of alternate form reliability (e.g., McGrew & Woodcock, 2001). For example, when item difficulties for one form are plotted against those for the second form, a clear linear trend is expected. When raw scores are plotted against trait levels for the two forms on the same graph, the ogive plots should be identical.
At the same time, scores from different tests tapping the same construct need not be parallel if both involve sets of items that are close to the examinee's ability level. As reported by Embretson (1999), "Comparing test scores across multiple forms is optimal when test difficulty levels vary across persons" (p. 12). The capacity of IRT to estimate trait level across differing tests does not require assumptions of parallel forms or test equating.
Reliability generalization is a meta-analytic methodology that investigates the reliability of scores across studies and samples (Vacha-Haase, 1998). An extension of validity generalization (Hunter & Schmidt, 1990; Schmidt & Hunter, 1977), reliability generalization investigates the stability of reliability coefficients across samples and studies. In order to demonstrate measurement precision for the populations for which a test is intended, the test should show comparable levels of reliability across various demographic subsets of the population (e.g., gender, race, ethnic groups), as well as salient clinical and exceptional populations.
Was this article helpful?
How Would You Like To Amaze People With Your Intelligence? Increase your IQ and get prepared to receive accolades in every sphere of life. Do you feel dejected every time your boss praises a colleague for an intelligent professional move? Do you want to become a crucial resource to your company?