Classical test theory traces its origins to the procedures pioneered by Galton, Pearson, Spearman, and E. L. Thorndike, and it is usually defined by Gulliksen's (1950) classic book. Classical test theory has shaped contemporary investigations of test score reliability, validity, and fairness, as well as the widespread use of statistical techniques such as factor analysis.
At its heart, classical test theory is based upon the assumption that an obtained test score reflects both true score and error score. Test scores may be expressed in the familiar equation
Observed Score = True Score + Error
In this framework, the observed score is the test score that was actually obtained. The true score is the hypothetical amount of the designated trait specific to the examinee, a quantity that would be expected if the entire universe of relevant content were assessed or if the examinee were tested an infinite number of times without any confounding effects of such things as practice or fatigue. Measurement error is defined as the difference between true score and observed score. Error is uncorre-lated with the true score and with other variables, and it is distributed normally and uniformly about the true score. Because its influence is random, the average measurement error across many testing occasions is expected to be zero.
Many of the key elements from contemporary psychomet-rics may be derived from this core assumption. For example, internal consistency reliability is a psychometric function of random measurement error, equal to the ratio of the true score variance to the observed score variance. By comparison, validity depends on the extent of nonrandom measurement error. Systematic sources of measurement error negatively influence validity, because error prevents measures from validly representing what they purport to assess. Issues of test fairness and bias are sometimes considered to constitute a special case of validity in which systematic sources of error across racial and ethnic groups constitute threats to validity generalization. As an extension of classical test theory, generalizabil-ity theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Cronbach, Rajaratnam, & Gleser, 1963; Gleser, Cronbach, & Rajaratnam, 1965) includes a family of statistical procedures that permits the estimation and partitioning of multiple sources of error in measurement. Generalizability theory posits that a response score is defined by the specific conditions under which it is produced, such as scorers, methods, settings, and times (Cone, 1978); generalizability coefficients estimate the degree to which response scores can be generalized across different levels of the same condition.
Classical test theory places more emphasis on test score properties than on item parameters. According to Gulliksen (1950), the essential item statistics are the proportion of persons answering each item correctly (item difficulties, or p values), the point-biserial correlation between item and total score multiplied by the item standard deviation (reliability index), and the point-biserial correlation between item and criterion score multiplied by the item standard deviation (validity index).
Hambleton, Swaminathan, and Rogers (1991) have identified four chief limitations of classical test theory: (a) It has limited utility for constructing tests for dissimilar examinee populations (sample dependence); (b) it is not amenable for making comparisons of examinee performance on different tests purporting to measure the trait of interest (test dependence); (c) it operates under the assumption that equal measurement error exists for all examinees; and (d) it provides no basis for predicting the likelihood of a given response of an examinee to a given test item, based upon responses to other items. In general, with classical test theory it is difficult to separate examinee characteristics from test characteristics. Item response theory addresses many of these limitations.
Was this article helpful?