When data are continuous and there are one or more underlying factors, confirmatory factor analysis procedures may be used to test measurement invariance. Meredith (1993) considered the issue of measurement invariance across groups, and he developed a sequence of model comparisons that provide a close parallel to the IRT approach. Widaman and Reise (1997) presented a clear description of these procedures, and Meredith and Horn (2001) have recently extended this approach to testing measurement invariance over time. In brief, a hierarchical set of models with increasingly strict constraints are compared. First, a baseline model is estimated. In this model, the value of the factor loadings of each measured variable on an underlying construct may differ over time. For example, consider the model of conscientiousness (Figure 21.1) discussed in the prior section on "examining stability: autoregressive models." Suppose we had allowed the factor loadings to vary over time (Model 1) and this model fit the data. Such a model, known as a configurai model, would suggest that similar constructs were measured at each measurement wave. In contrast, imagine that although the single factor of conscientiousness fit the data adequately at Wave 1, over the course of a longer-term study the conscientiousness factor split into two separate factorsâ€”one factor representing orderliness and reliability and a second factor representing decisiveness and industriousness. Such a result would indicate the fundamental nature of the conscientiousness factor had changed over time (failure of configurai invariance), making difficult any interpretation of stability or change in conscientiousness.

When the configurai model fits the data (as in our earlier example), we can investigate questions related to the rank-order stability of the general construct. Note, however, that the conscientiousness latent construct (factor) at each measurement wave would not necessarily be characterized by a scale with the same units. To establish that the units are identical over time, we need to show that the factor loadings are equal across time. As we saw in the model represented in Figure 21.1, the imposition of equal factor loadings did not significantly affect the fit of the model in our example. Thus, our study of stability was improved by our ability to correlate constructs measured using the same units at each measurement wave.

Finally, suppose that we wish to establish that the scale of the construct has both the same units and the same origin over time (i.e., interval level of measurement). Recall that this condition must be met for proper growth modeling. To illustrate differences in the origin, consider that the Celsius and Kelvin temperature scales have identical units (1 degree difference is identical on both scales). However, the origin (0 degrees) of the Celsius scale is the freezing point of water, whereas the origin of the Kelvin scale is absolute 0 (where molecular motion stops). To establish that the origins are identical, we need to consider the level of each measured variable (mean structure) in addition to the covariance structure. If the origin of the scale does not change over time, then the intercept (the predicted value on each measured variable when the level of the underlying construct 0 = 0) also must not change over time. If the fit of a model in which the intercepts for each measured variable are allowed to vary over time does not significantly differ from that of a more restricted model in which the each variable's intercept is constrained to be equal over time, this condition is established.3 If this condition can be met, then the level of measurement invariance over time necessary for proper growth curve modeling has been established. Widaman and Reise (1997) discussed still more restrictive forms of measurement invariance that can be useful in some specialized applications. Muthen (1996), Mehta et al. (2004), and Millsap and Tein (in press) present extensions of the confirmatory factor analysis approach that can be used to establish measurement invariance for multidimensional constructs measured by dichoto-mous or ordered categorical measures.

VERTICAL EQUATING: ADDRESSING AGE-RELATED CHANGES IN ITEM CONTENT

The items required to measure a latent construct can change as participants age. In educational research children are expected to acquire knowledge and learn appropriate skills. For example, in a test of mathematical proficiency, items related to multiplication may be needed in third grade, whereas items related to fractions may be needed in sixth grade. The test forms for each grade level must be equated onto a single common metric to measure educational progress. Vertical equating must be achieved externally prior to any longitudinal modeling of the data.

Vertical equating uses Rasch models or the 2-parameter IRT models to calibrate tests onto a single common "long" interval scale. This "long" scale covers the full range of proficiency as assessed using easier tests in the lower grade levels and more difficult tests in the higher grade levels. The equating of test forms is made possible by embedding common item sets in the test forms. The common item sets serve as "anchor" or "link" items for the equating. Any change in the probability of getting each item correct should only occur if there is a change in the individual's level on the underlying construct; otherwise, the item is showing DIF as a function of grade level. For example, an item that is assessing problem-solving skills at Grade 2 but is just assessing routine skills at Grade 4 may very likely show DIE Even though the wording of the item is identical, this item functions differently across the two different grades and will not make a good link item. Thus, for unidimensional constructs vertical equating combines testing for DIF and establishing measurement invariance of link items and linking scales (see Embretson & Reise, 2000). Applications of these equating procedures permit the development of computerized adaptive tests (see Drasgow & Chuah, chap. 7, this volume) that select the set of items that most precisely assess each participant's level on the underlying latent construct 6. Unfortunately, vertical equating of multidimensional constructs is difficult to achieve because the rate or form of growth may vary across dimensions so that common item set(s) that adequately represent each of the dimensions cannot always be constructed.

5The full confirmatory factor analysis model including mean structure can be expressed as Y = v + Ar] + e. Y is the p x 1 vector of observed scores, v is p x 1 vector of intercepts, ij is the m x 1 vector of latent variables, A is the p x m matrix of the loadings of the observed scores on the latent variables 7), and e is the p x 1 vector of residuals. For modeling longitudinal measurement, a model in which both A and V are constrained to be equal over time must fit the data.

In contrast to research on measures of educational progress and abilities, far less attention has been given to equating psychological constructs like traits and attitudes across age. Typically, the same instrument is used at each measurement wave to assess individuals on a construct of interest. This practice is often appropriate when the time spanning the study is relatively short and the study does not cross different periods of development. If the reading level and the response format are appropriate for the participants over the duration of the study, serious age-related problems with the instrument are unlikely to occur. However, when a measure crosses developmental periods, for example, in a study that follows subjects from adolescence to young adulthood, the instrument may not capture the same construct adequately as subjects mature. Some items may need to be phased out over time while other items are being phased in. What results are instruments that are not identical, but that have overlapping items for different developmental periods. For example, the Achenbach Youth Self-Report externalizing scale (YSRE) was developed for youth up to age 18 (Achenbach & Edelbrock, 1987), and the Young Adult Self-Report externalizing scale (YASRE) was developed for young adults over age 18. Each measure has approximately 30 items, yet only 19 of these items are in common across the two forms. If participants were administered the two forms of the YSRE during a longitudinal study that crossed these developmental periods, the two forms would need to be equated onto a common scale if growth is to be studied. Such vertical equating of psychological measures is rare.

Many of the standard measures used in psychology were designed for cross-sectional studies to examine differences between individuals; they were not developed for the study of change within an individual across time. As an illustration, many traditional instruments used for research in developmental psychological are normed for the different ages. Norm-referenced metrics do not comprise an interval scale and are often not suitable for capturing change. One example of a norm-referenced met ric is the grade-equivalent scale (e.g., reading at a fifth-grade level) used in measuring reading achievement. Seltzer, Frank, and Bryk (1994) compared growth models of reading achievement using the grade equivalent metric and using interval-level scores based on Rasch calibration. They found that the results were very sensitive to the metrics used.

Theoretically, structural equation modeling approaches could also be used for vertical equating. However, McArdle, Grimm, Hamagami, and Ferrer-Caja (2002) noted that such efforts to date with continuous measures have typically involved untestable assumptions and have often led to estimation difficulties. At the same time, studies to date have not carefully established common pools of items (or subscales) that could be used to link the different forms of the instrument. Mehta et al. (2004) addressed vertical equating of ordinal items.

Was this article helpful?

## Post a comment