Test Score Reliability

If measurement is to be trusted, it must be reliable. It must be consistent, accurate, and uniform across testing occasions, across time, across observers, and across samples. In psychometric terms, reliability refers to the extent to which measurement results are precise and accurate, free from random and unexplained error. Test score reliability sets the upper limit of validity and thereby constrains test validity, so that unreliable test scores cannot be considered valid.

Reliability has been described as "fundamental to all of psychology" (Li, Rosenthal, & Rubin, 1996), and its study dates back nearly a century (Brown, 1910; Spearman, 1910).

Concepts of reliability in test theory have evolved, including emphasis in IRT models on the test information function as an advancement over classical models (e.g., Hambleton et al., 1991) and attempts to provide new unifying and coherent models of reliability (e.g., Li & Wainer, 1997). For example, Embretson (1999) challenged classical test theory tradition by asserting that "Shorter tests can be more reliable than longer tests" (p. 12) and that "standard error of measurement differs between persons with different response patterns but generalizes across populations" (p. 12). In this section, reliability is described according to classical test theory and item response theory. Guidelines are provided for the objective evaluation of reliability.

Internal Consistency

Determination of a test's internal consistency addresses the degree of uniformity and coherence among its constituent parts. Tests that are more uniform tend to be more reliable. As a measure of internal consistency, the reliability coefficient is the square of the correlation between obtained test scores and true scores; it will be high if there is relatively little error but low with a large amount of error. In classical test theory, reliability is based on the assumption that measurement error is distributed normally and equally for all score levels. By contrast, item response theory posits that reliability differs between persons with different response patterns and levels of ability but generalizes across populations (Embretson & Hershberger, 1999).

Several statistics are typically used to calculate internal consistency. The split-half method of estimating reliability effectively splits test items in half (e.g., into odd items and even items) and correlates the score from each half of the test with the score from the other half. This technique reduces the number of items in the test, thereby reducing the magnitude of the reliability. Use of the Spearman-Brown prophecy formula permits extrapolation from the obtained reliability coefficient to original length of the test, typically raising the reliability of the test. Perhaps the most common statistical index of internal consistency is Cronbach's alpha, which provides a lower bound estimate of test score reliability equivalent to the average split-half consistency coefficient for all possible divisions of the test into halves. Note that item response theory implies that under some conditions (e.g., adaptive testing, in which the items closest to an examinee's ability level need be measured) short tests can be more reliable than longer tests (e.g., Embretson, 1999).

In general, minimal levels of acceptable reliability should be determined by the intended application and likely consequences of test scores. Several psychometricians have

TABLE 3.1 Guidelines for Acceptable Internal Consistency Reliability Coefficients

Test Methodology

Purpose of Assessment

Median Reliability Coefficient

Group assessment

Programmatic

Was this article helpful?

0 0

Post a comment