## P0xaocLfO17

In this equation, L stands for the likelihood function, and the latent trait distribution is denoted by f(d). Warm (1989) used a prior distribution for J(6), which is equal to the square root of the Fisher information function. Such a prior is called nonin-jormative because it is a constant over the latent dimension and does not contain information about the latent distribution of person abilities. By means of Warm's method, an estimator for 6 is obtained

that has a finite value for persons who solve either none or every item. It also has a smaller bias than the MLE. The estimator is called the weighted likelihood estimator (WLE) or Warm's estimator.

Using an informative instead of a noninformative prior in combination with the likelihood function, an a posteriori distribution is obtained that contains all information about the 0y parameters. The expectation of this distribution for a single person can be used as a point estimator of 0y, called the expected a posteriori estimator (EAP; Bock & Aitkin, 1981). In contrast to the WLE, this estimator does not minimize the bias but the mean square error, ~ 9 J ■

Although WLE and EAP estimators are the best way to estimate each individual's latent trait value, in other words, they are the best point estimators, they do have one major drawback: The sample distributions of both point estimators do not converge to the latent trait distribution. Although the mean of the latent distribution can consistently be estimated by both point estimators, the variance cannot. The variance of the WLE exceeds the latent variance because the measurement error is part of it. In contrast, the EAP distribution is shrunken compared to the latent distribution because the EAPs (which are the means of each individual's posterior distribution) have smaller variances than the "true values."

Alternatively, it is possible to draw a number of values at random from each individual's posterior distribution and to use these values for the estimation of the latent distribution. Because these values are plausible estimators for the individuals' latent trait value, they are called plausible values (Mislevy Beaton, Kaplan, & Sheehan, 1992). The variance of these plausible values is a consistent estimate of the variance of the latent distribution. Although plausible values reproduce the latent variance, they are poor point estimators as compared to the WLE or EAR

The advantage of the MML method is not only the consistent estimation of the latent distribution, but also the possibility of applying it under missing data conditions. In this case, the property of estimating the latent distribution consistently depends on the condition that the missing data are missing at random (MAR). MAR means that the missing data are a random sample of the observed data (Schäfer, 1997). If this assumption is true, then the estimation of the latent distribution only on the basis of the observed data will be consistent. The likelihood of the observed data contains all the necessary and sufficient information for this estimation. Missing data that are due to an incomplete test design usually fulfill the MAR condition so that consistent estimates are ensured.

A third advantage of the MML method is connected to its property of providing consistent estimates of the latent variance. The estimator of the latent variance can be used as a measure of the "true score" variance in the classical definition of reliability. In the definition of reliability as the ratio of true score variance and observed score variance, the true score variance is represented by the variance of the latent trait, and the observed score variance is given by the variance of the WLE estimates. Thus, reliability can be calculated as est.Varié) ncn

where est.Var(d) is the variance of the latent trait estimated by means of the MML method.

This measure of reliability has turned out to be much closer to classical measures. In particular, it is not distorted by the overestimation of error variance defined as the expected standard error of person parameter estimates (Andrich, 1988; Rost, 1996).

After the estimation of the model parameters and the calculation of a measure of accuracy, it is necessary to determine whether the considered model fits the data. This is a question of internal validity. In other words, a measure of a latent trait by means of multimethod data is said to be internally valid if the model used to estimate the trait fits the data. One possibility is to calculate the ratio of the likelihoods (LR) of the model being considered and a less-restrictive model (likelihood ratio test). Then the test statistic -21n(LR) is approximately chi-square distributed, and the degrees of freedoms are equal to the difference of the numbers of parameters in both models. If the likelihood ratio test is significant, then the assumptions of the more restrictive model do not hold.

The likelihood ratio test requires—among other things—that the model under consideration be a submodel of the comparison model. If two or more nonnested models are to be compared, it is possible to use so-called information indices like Akaike's information criterion (AIC) or Bayes information criterion (BIC). These indices allow the comparison of models by combining the likelihood value and the number of model parameters. The rationale is that models that have more parameters can fit the data better than models with fewer parameters. Consequently, each model is penalized for its number of parameters. Additionally, the sample size is also considered in BIC estimates because the number of different response patterns usually increases with the number of subjects so that a model assuming many parameters is more likely when the sample size is large. Because this effect is considered by the BIC, this index is preferable to the AIC if the number of response patterns is high. Although it is possible to compare very different models by means of these two indices, a serious disadvantage is that it is not possible to make the difference between two values subject to a statistical significance test.

AN ANALYSIS OF THE GERMAN SCIENCE TEST IN OECD/PISA 2003 (FIELD TRIAL DATA)

In this section, the models presented earlier will be used to analyze the field trial data of the German science test of the OECD Program for International Student Assessment in 2003 (OECD/PISA 2003). OECD/PISA 2003 is the second of at least three cycles of an international large-scale study designed to assess and analyze the educational systems of more than 30 countries (of which almost all are members of the Organization for Economic Cooperation and Development, OECD). The target population consists of 15-year-old high school students whose skills and competencies are assessed in the domains of reading, mathematical, and scientific literacy in real-life settings.

In addition to the international part of the study, each participating country has the opportunity to examine special national research questions by administration of their own national test. In Germany, the scientific literacy component has been (and still is) the responsibility of the Leibniz Institute for Science Education. A national expert team consisting of biologists, chemists, physicists, educational researchers, and psychologists from this institute and several German universities constructed the national science test. Based on the experience with the national science test of PISA 2000, a complete two-facet design was created for the national science test of PISA 2003.

The first facet refers to the content areas of the test items, which can be assigned to the three science subjects of the German school system, biology, chemistry, and physics. The test covered four content areas from physics, four from biology, and two from chemistry, that is, a total of 10 different contents. The second facet refers to seven so-called cognitive components:

1. "Evaluating" comprises the ability to analyze a specific, complex, and problematic situation in which no simple solution exists but several possible options to act exist.

2. "Divergent thinking" is the ability to create a number of different but correct answers to a cognitive task for which there is not just one solution.

3. "Dealing with graphical representations" stands for the ability to solve a cognitive task by using the information that is provided by a graph, a diagram, or an illustration.

4. "Convergent thinking/reasoning" represents the ability to solve problems by means of an inferred or introduced rule.

5. "Using mental models": A mental model can be described as a spatial or geometrical concept that represents scientific facts and their relations. By means of this concept, the student should be able to predict and explain experimental results and empirical findings.

6. "Describing the phenomenon" is the ability of a person to correctly describe the pieces of information given in tables, diagrams, graphs, or illustrations.

7. "Dealing with numbers" stands for the ability to perform numerical calculations in the context of a scientific concept. For a correct calculation it is not only necessary to do the right arithmetical operations, but it is essential that the underlying scientific concept has been understood.

In the facet design of the 2003 German science test, each combination of a content and a cognitive component is represented by one item. In summary, the field trial test version was based on a two-facet design with 70 items addressing 10 content areas and 7 cognitive components. Because of limited testing time, not all 70 tasks could be administered to each of the 1,955 students assessed in the field trial. Therefore, sampling followed a multimatrix design in which 10 subgroups of students responded to the tasks of either one, three, or five content areas. As a result, the final data matrix included an average of 70% missing data that is due to the test design. In the analyses described following, only those cases were included that responded at least to 21 items (781 students).

For the purpose of illustrating the analysis of multimethod test data, we will consider the content areas as the "items" and the cognitive components as the "methods." In the case of the German PISA 2003 science field test, a biological content such as "predators and prey" is connected to each of the cognitive competencies, so students work on a task dealing with, for example, the different possibilities for increasing the lynx population ("predators and prey" in connection with "divergent thinking"). In the terminology of the multitrait-multimethod approach, the combination of content area, for instance, "predators and prey," and cognitive components, for instance, "divergent thinking," would be called "items," whereas the content area would be the "trait" and the cognitive component would be the "method."

But the chosen kind of analysis here is not the only possible way of looking at these data. For instance, the items of one cognitive component ("method") are rather different with respect to the particular pieces of knowledge included in the 10 tasks. There is more variation among the items of the same method than could be expected by simply taking into account the differences in the content areas. This implies, for example, that a model assuming constant item (content) difficulties under all methods has little chance to fit the data. However, this conclusion will be drawn from the results and need not be subject to speculation in advance. The analyses were conducted along the structure of the family of multimethod Rasch models described earlier. In particular, the LLTM, the Rasch model, and its generalizations to the multidimensional and mixed population case will be applied.

Table 18.5 gives an overview of the application of all models mentioned. The first four models have been calculated using the ConQuest program, and the mixed Rasch models were calculated by means of a not-yet-published program by Matthias von Davier. Note that the likelihood and the number of parameters increases from left to right in Table 18.5. For model selection, the Bayes information criterion (BIC) is preferable to Akaike's information criterion (AIC) because the German PISA science test consists of 70 items and was administered to 781 students, resulting in a large number of different