The process of item and scale calibration dates back to the earliest attempts to measure temperature. Early in the seventeenth century, there was no method to quantify heat and cold except through subjective judgment. Galileo and others experimented with devices that expanded air in glass as heat increased; use of liquid in glass to measure temperature was developed in the 1630s. Some two dozen temperature scales were available for use in Europe in the seventeenth century, and each scientist had his own scales with varying gradations and reference points. It was not until the early eighteenth century that more uniform scales were developed by Fahrenheit, Celsius, and de Reaumur.
The process of calibration has similarly evolved in psychological testing. In classical test theory, item difficulty is judged by the p value, or the proportion of people in the sample that passes an item. During ability test development, items are typically ranked by p value or the amount of the trait being measured. The use of regular, incremental increases in item difficulties provides a methodology for building scale gradations. Item difficulty properties in classical test theory are dependent upon the population sampled, so that a sample with higher levels of the latent trait (e.g., older children on a set of vocabulary items) would show different item properties (e.g., higher p values) than a sample with lower levels of the latent trait (e.g., younger children on the same set of vocabulary items).
In contrast, item response theory includes both item properties and levels of the latent trait in analyses, permitting item calibration to be sample-independent. The same item difficulty and discrimination values will be estimated regardless
of trait distribution. This process permits item calibration to be "sample-free," according to Wright (1999), so that the scale transcends the group measured. Embretson (1999) has stated one of the new rules of measurement as "Unbiased estimates of item properties may be obtained from unrepresentative samples" (p. 13).
Item response theory permits several item parameters to be estimated in the process of item calibration. Among the indexes calculated in widely used Rasch model computer programs (e.g., Linacre & Wright, 1999) are item fit-to-model expectations, item difficulty calibrations, item-total correlations, and item standard error. The conformity of any item to expectations from the Rasch model may be determined by examining item fit. Items are said to have good fits with typical item characteristic curves when they show expected patterns near to and far from the latent trait level for which they are the best estimates. Measures of item difficulty adjusted for the influence of sample ability are typically expressed in logits, permitting approximation of equal difficulty intervals.
The item gradient of a test refers to how steeply or gradually items are arranged by trait level and the resulting gaps that may ensue in standard scores. In order for a test to have adequate sensitivity to differing degrees of ability or any trait being measured, it must have adequate item density across the distribution of the latent trait. The larger the resulting standard score differences in relation to a change in a single raw score point, the less sensitive, discriminating, and effective a test is.
For example, on the Memory subtest of the Battelle Developmental Inventory (Newborg, Stock, Wnek, Guidubaldi, & Svinicki, 1984), a child who is 1 year, 11 months old who earned a raw score of 7 would have performance ranked at the 1st percentile for age, whereas a raw score of 8 leaps to a percentile rank of 74. The steepness of this gradient in the distribution of scores suggests that this subtest is insensitive to even large gradations in ability at this age.
A similar problem is evident on the Motor Quality index of the Bayley Scales of Infant Development-Second Edition Behavior Rating Scale (Bayley, 1993). A 36-month-old child with a raw score rating of 39 obtains a percentile rank of 66. The same child obtaining a raw score of 40 is ranked at the 99th percentile.
As a recommended guideline, tests may be said to have adequate item gradients and item density when there are approximately three items per Rasch logit, or when passage of a single item results in a standard score change of less than one third standard deviation (0.33 SD) (Bracken, 1987;
Bracken & McCallum, 1998). Items that are not evenly distributed in terms of the latent trait may yield steeper change gradients that will decrease the sensitivity of the instrument to finer gradations in ability.
Do tests have adequate breadth, bottom and top? Many tests yield their most valuable clinical inferences when scores are extreme (i.e., very low or very high). Accordingly, tests used for clinical purposes need sufficient discriminating power in the extreme ends of the distributions.
The floor of a test represents the extent to which an individual can earn appropriately low standard scores. For example, an intelligence test intended for use in the identification of individuals diagnosed with mental retardation must, by definition, extend at least 2 standard deviations below normative expectations (IQ < 70). In order to serve individuals with severe to profound mental retardation, test scores must extend even further to more than 4 standard deviations below the normative mean (IQ < 40). Tests without a sufficiently low floor would not be useful for decision-making for more severe forms of cognitive impairment.
A similar situation arises for test ceiling effects. An intelligence test with a ceiling greater than 2 standard deviations above the mean (IQ > 130) can identify most candidates for intellectually gifted programs. To identify individuals as exceptionally gifted (i.e., IQ > 160), a test ceiling must extend more than 4 standard deviations above normative expectations. There are several unique psychometric challenges to extending norms to these heights, and most extended norms are extrapolations based upon subtest scaling for higher ability samples (i.e., older examinees than those within the specified age band).
As a rule of thumb, tests used for clinical decision-making should have floors and ceilings that differentiate the extreme lowest and highest 2% of the population from the middlemost 96% (Bracken, 1987, 1988; Bracken & McCallum, 1998). Tests with inadequate floors or ceilings are inappropriate for assessing children with known or suspected mental retardation, intellectual giftedness, severe psychopathology, or exceptional social and educational competencies.
Item response theory yields several different kinds of interpretable scores (e.g., Woodcock, 1999), only some of which are norm-referenced standard scores. Because most test users are most familiar with the use of standard scores, it is the process of arriving at this type of score that we discuss. Transformation of raw scores to standard scores involves a number of decisions based on psychometric science and more than a little art.
The first decision involves the nature of raw score transformations, based upon theoretical considerations (Is the trait being measured thought to be normally distributed?) and examination of the cumulative frequency distributions of raw scores within age groups and across age groups. The objective of this transformation is to preserve the shape of the raw score frequency distribution, including mean, variance, kurtosis, and skewness. Linear transformations of raw scores are based solely on the mean and distribution of raw scores and are commonly used when distributions are not normal; linear transformation assumes that the distances between scale points reflect true differences in the degree of the measured trait present. Area transformations of raw score distributions convert the shape of the frequency distribution into a specified type of distribution. When the raw scores are normally distributed, then they may be transformed to fit a normal curve, with corresponding percentile ranks assigned in a way so that the mean corresponds to the 50th percentile, - 1 SD and + 1 SD correspond to the 16th and 84th percentiles, respectively, and so forth. When the frequency distribution is not normal, it is possible to select from varying types of nonnormal frequency curves (e.g., Johnson, 1949) as a basis for transformation of raw scores, or to use polynomial curve fitting equations.
Following raw score transformations is the process of smoothing the curves. Data smoothing typically occurs within groups and across groups to correct for minor irregularities, presumably those irregularities that result from sampling fluctuations and error. Quality checking also occurs to eliminate vertical reversals (such as those within an age group, from one raw score to the next) and horizonal reversals (such as those within a raw score series, from one age to the next). Smoothing and elimination of reversals serve to ensure that raw score to standard score transformations progress according to growth and maturation expectations for the trait being measured.
Was this article helpful?
Enchanted Learning Experiences -Why They Should Be The Norm For Our Children. The latter part of the twentieth century has seen more discoveries about the human brain than in all previous history of mankind. It is as though we have been paddling in the shallows of a vast ocean hitherto unaware of its existence.