Item selection. Adapting test difficulty creates challenges for the test developers. First, items must be selected according to the test specification plan. The specifications detail the content of the test, including the knowledge and skills to be assessed, and how many items should be included in each area assessed. For paper-and-pencil tests, test developers thoroughly inspect a test form to ensure that it satisfies the test specifications and avoids item cluing (i.e., one item provides information that can be used to answer another item).

To satisfy content requirements on adaptive tests, subject matter experts must code each item for its content. In addition, an "enemy list" must be developed that enumerates sets of items from which only one item can be selected. For example, suppose items 13, 72, and 547 are enemies because they have highly similar content or provide cluing; if any one of these items is selected, the others become ineligible for inclusion in the CAT. Stocking and Swanson's (1993) weighted deviation model and van der Linden's (2000) shadow test provide sophisticated methods for accomplishing these goals; a simpler approach introduced by Kingsbury and Zara (1991) is described in the following section.

Scoring. With paper-and-pencil tests, the most common approach to scoring is the number of correct answers given by an examinee. The number correct score is usually transformed to a score scale that is used for reporting the results of the exam, but scaled scores are generally based on the number of correct answers. With a CAT, number correct scoring is not appropriate because two examinees may have answered the same number of items correctly, but the difficulty of the items may be dramatically different. Thus, a more sophisticated approach to scoring, based on item response theory (IRT), is necessary. Reise and Waller (2002) provide a lucid introduction to IRT; a more detailed treatment of IRT and computerized testing is given by van der Linden and Glas (2000).

IRT models the probability of a positive response as a function of an individual's standing on the latent trait. In cognitive ability testing, the most commonly used model is the three-parameter logistic model. Here, the probability of a positive response, Pt(9), is

where 6 is an individual's standing on the trait assessed by the test, D is a constant set equal to 1.702 for historical reasons, a( is a parameter that describes the extent to which the item discriminates between individuals with higher and lower 0s, bj is a parameter that indexes item difficulty, and c. is called the guessing parameter because it reflects the chance that very low ability examinees will answer correctly. 6 is usually scaled to have a mean of 0 and a standard deviation of 1.

The two-parameter logistic model sets c. = 0 for all items, so that p.m=---.

This model is often used for personality assessment (Reise & Waller, 1990; see also Chernyshenko, Stark, Chan, Drasgow, & Williams, 2001). The one-parameter logistic or Rasch model makes the additional restriction that a;= 1:

This model is widely used in Europe and in licensing and certification testing programs in the United States.

Figure 7.1 shows a plot of P.(6), which is usually called an item characteristic curve, for an item with a. = 1.1, bi = 0.4, and c. = 0.15. Note that the curve is nearly flat at low and high 6 levels. Consequently, the item provides little discrimination between individuals with, say, 6 of -3.0 versus -2.0 or 2.0 versus 3.0. In the psychometric argot, the item provides little information in these ranges of 0. Alternatively, note the difference in the probability of a positive response between individuals with 6 = 0.0 versus 1.0: .42 versus .80. Here the lower 6 individuals have clearly lower chances of responding positively than higher 6 , and so the item provides substantial information in this range of 6 values.

The item characteristic curves for Rasch model items are particularly convenient. With a( = 1 and c = 0 for all items, the only item parameter that varies is item difficulty, br The restricted form of Rasch model item characteristic curves leads to many desirable statistical properties. However, it is an empirical question whether a = 1 for all items; this condition should be carefully examined before applying the Rasch model.

Scoring an individual's responses in IRT refers to locating the value along the 8 continuum that best represents the individual's standing on the latent trait. Maximum likelihood estimation can be used for this purpose; the principle of maximum likelihood estimation states that the estimate of 6 of 6 should be the value that makes the individual's responses appear most probable. If the responses are coded u. = 1 for a correct or positive response and u; = 0 for an incorrect or negative response, then the likelihood of a positive response is just P¡(0), and the likelihood of a negative response is [1- R (6)}. Mathematically, this can be expressed compactly as [p (0)p _ p

Provided that the test or scale is unidimensional (i.e., all the items measure a single latent trait), the likelihood of all the responses is i=i

and the maximum likelihood ability estimate 6 is the value along the 6 continuum that maximizes L. This value is obtained by iterative numerical methods that would be very difficult to compute by hand but can be determined by the computer nearly instantly.

After the estimate 9 is obtained, it is usually transformed to a score scale that is used to report n scores to examinees. Let X = ^w. denote the i=1

number right score on a conventional test; the process of transforming 6 to the reporting scale is analogous to the process of transforming X. The simplest transformation is linear (e.g., kt0 + k2 or kX + k ). More complicated transformations are sometimes used; see Kolen and Brennan (1995) for details.

Adapting item difficulty to a person's ability level.

A critical element in adaptive testing is selecting items that are most informative about an individual's standing on the trait assessed by the test or scale. Going beyond the simple notion of branching to a more difficult item following a correct answer and an easier item following an incorrect answer, it is possible to determine the item in the item pool that is most informative about an examinee's ability. Statisticians have developed the notion of information to refer to the reduction of uncertainty about a parameter being estimated (8 in the present context). For IRT and psychological measurement, Lord (1980a) showed that the information an item provides at ability level 6 is

Rmi-Rm where P^d)' is the slope (i.e., derivative) of the item characteristic curve at ability 0.

1(6) is called the item information function, and it shows the range of 6 where an item is discriminating—that is, has a large value of 1(6)—and where the item is not discriminating. Figure 7.2 shows the item information function for the item with a. = 1.1, i '

bi = 0.4, and c. = 0.15 described previously. Note that this item provides substantial information near its item difficulty (for 6 values near 0.4), but little information for 6 levels below -1.0 or above 2.0.

Thus, a simple approach to determining which item to administer next, given that a respondent's

Item Information Function

current ability estimate is 6, is to compute 1( 6) for all the items in the item pool and select the item with the largest item information. This approach, called maximum information item selection, must be modified in high-stakes testing programs because items must be selected to satisfy content specifications, avoid violating enemy lists, and satisfy item exposure controls. Kingsbury and Zara's (1991) method provides a straightforward means of accomplishing these goals. It involves forming subsets of items according to the content specifications, selecting the number of items from each subset as dictated by the content specifications, and using maximum information item selection to determine which items within a subset would be the best choices for administration. To avoid overexposure, Kingsbury and Zara suggested picking items at random from among the items in each subset with the greatest information.

Precision of estimation. The total amount of information at ability 6 given n items have been administered is defined as

Lord (1980) showed that the conditional standard error of measurement is

SEMêm = \/Jm for the maximum likelihood ability estimate 6 of 6. Consequently, after an adaptive test is completed, 1(9) = ]T / (0) can be estimated, and the standard error of 6 can be determined.

Test security. Computerized testing has allowed many exams to go to "walk-in testing," where examinees schedule tests at testing centers at times that are convenient to them. To minimize cheating, however, testing programs must take care to prevent overexposure of some items.

First, note that if a CAT begins by initializing 6 = 0 (i.e., by assuming that an examinee is average before any items have been administered) and then selects the item with the greatest information at that ability level, all examinees will receive the same first item. Moreover, all examinees who answer correctly will be branched to the same second item. Such item selection algorithms are said to be deterministic because individuals with the same sequence of right and wrong answers will receive the same set of items.

There are two interrelated problems with deterministic item selection algorithms. First, many items in the item pool will never be administered. In a simulation of a CAT with an item pool of 260 items, Hulin, Drasgow, and Parsons (1983) found that 141 items were never chosen by the item selection algorithm just described. Thus, functionally, the Hulin et al. item pool consisted of just 119 items. It is widely believed that CATs with smaller item pools are more easily compromised than CATs with larger items pools, so using just 119 of the 260 items would be a source of concern for a high-stakes testing program.

The second problem with deterministic item selection arises when coaching schools or other conspiracies attempt to "crack the test" by having a series of individuals take the test and memorize items. The first person to take the test would memorize the first item and report it to the coaching school or post it on a Web site. The correct answer would then be quickly determined. The second conspirator to take the exam would see the same first item, answer it correctly, and then be branched upward to a second item, which the conspirator would memorize and report. The third conspirator would be able to answer the first two items correctly and be branched upward to a third item, which that person would memorize. As the Educational Testing Service learned with the Graduate Record Exam, it is possible for a relatively small number of conspirators to compromise a CAT.

To minimize the chance of cheating, it is critical to control how often each item is administered. Item exposure control algorithms (see Stocking & Lewis, 2000, for a review) use randomization for this purpose. For example, the exam may begin by randomly selecting one of the 20 items with the largest information at 6 = 0.

The Sympson-Hetter (1985) method has frequently been used to control item exposure. Here every item in the item pool has an exposure control parameter, which is a number between zero and one. If an item is tentatively selected for administration, a random uniform number between zero and one is drawn. If the random number is less than the exposure control parameter, the item is administered; otherwise, the item is rejected and another item is tentatively selected and the process repeated. The exposure control parameters are chosen so that no item is administered to more than a prespecihed percentage (say, 15%) of examinees.

Experience with high-stakes CATs clearly indicates that item pools must be quite large to maintain test security (Mills, 1999). If computer simulations show that a CAT has satisfactory psychometric properties with an item pool of 250 items, a high-stakes CAT may need a pool of perhaps 2,500 items to resist compromise.

Was this article helpful?

## Post a comment