Generalizability Theory

The central issue in Generalizability (G) theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972) is the generalization from a sample of measurements to a universe of possible measurements. This universe is defined in terms of measurement conditions from which the observed measurements are a random sample. The question to be answered is how well measures taken in one condition can be generalized to other conditions. In other words, how well the observed scores correspond to the average scores acquired under all possible conditions. In the classical true score model, observed scores consist of two components, a systematic component called the true score and a random error component. The reliability is then defined as the correlation between the observed and the true scores and all possible observed scores on this particular test. In generalizability theory, the variance of the measurements is divided into several different variance components. The generalizability coefficient based on this partition is defined analogous to the reliability coefficient: the true variance divided by the expected observed-score variance (Shavelson & Webb, 1991). The variance partition in generalizability theory requires a clear description of all relevant measurement conditions. These conditions are called facets (the terminology is similar to facet theory, where the facets refer mostly to question formats, whereas in generalizability theory they refer mostly to measurement conditions). In the simplest case, there is only one facet. For instance, when students take a test consisting of 20 multiple-choice items at the end of a course, the examiner is not interested in the answers on these particular 20 items, but in the students' knowledge of the whole course content. From this perspective, the 20 items are a sample of all possible items. The items are the facet of the measurement. When all students answer the same 20 items, the design is crossed. This means that all students have the same conditions (items). When all students answer different items, the design is nested. Then, all students have different conditions.

Assume that the 20 items of the foregoing example are not multiple-choice items, but behavioral observations. When these observations are coded by trained judges, the design becomes a two-facet design. The observations are the first facet and the judges the second facet. In this case we must generalize over both observations and judges to obtain an estimate of the true score we are interested in.