As has been shown, the most restrictive models that fit the data are the quasi-independence model with a constrained 5 parameter and the symmetry model.
Because both models are not nested, the likelihood ratio difference test cannot be conducted (see Figure 17.1). To decide which model fits best, information criteria have to be considered. Information criteria such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) are based on the %2 value, and they weigh the number of parameters of a model with a penalty function to identify the most parsimonious model (for more details, see Akaike, 1987; Bozdogan, 1987; Sclove, 1987). The smallest information criterion indicates the most parsimonious model. The symmetry model is the most parsimonious well-fitted model and should be considered as the model of choice (see Table 17.8).
Besides these statistical considerations there are also some theoretical considerations that may influence the choice of a model. Compared to the quasi-independence models, the quasi-symmetry as well as the symmetry model yields the benefit that observer differences and category distinguishability can be examined in detail (Darroch & McCloud, 1986) because both agreement and disagreement have to be modeled. If the quasi-symmetry model holds, we can presume that raters produce the same amount of under- or overrepresentation for given
Comparison of the Information Criteria for the Quasi-Independence and Symmetry Model for the Data Presented in Table 17.3
Quasi-independence3 5.96 3 .11 420.69 405.53
Symmetry 1.49 3 .69 415.47 400.32
Note. "Quasi-independence model with constrained 8 parameters on the main diagonal.
combinations of categories and are, thus, interchangeable to a certain degree. Moreover, if the symmetry model holds, both raters are completely interchangeable (Agresti, 1992). A better-fitting symmetry model compared to the quasi-symmetry model indicates a stronger association between ratings and greater interchangeability of raters
General Discussion of Association Methods for Categorical Data
As has been shown, there is no best way to measure agreement and disagreement by general agreement indices. However, some basic comparisons of association methods and models can be accomplished. In general, associations can be detected by the %2 test, and as a special case of association, rater agreement may be detected by coefficient K. Model-based analysis of associations yields additional and more precise information than that provided by general association methods. In contrast to coefficient K, agreement can also be analyzed in cases where the number of categories between raters differs. Loglinear models allow testing of the goodness of fit. They provide fitted cell probabilities and enable researchers to make predictions of classifications under certain conditions such as receiving a particular response by an observer given the responses of other observers, receiving a response given the true status of an observation or assessing the true status of an observation given ratings by several observers (Agresti, 1990, 1992; Bishop et al., 1975; Goodman, 1978; Haberman, 1978, 1979; Hagenaars, 1990). Thus, first analyses of rater agreement—as a special variant of convergence between multiple methods—
can be conducted by overall agreement indices, but more detailed information is only available by use of loglinear models.
Extensions and Special Variants of Methods for Rater Agreement
In this chapter, emphasis was placed on the analysis of rater agreement of two observers who rate each subject once. Extensions of this design have already been developed. Conger (1980) presented a generalization of coefficient kappa to assess agreement of multiple raters. Tanner and Young (1985) proposed loglinear models that determine interrater agreement when there are more than two observers; moreover, they developed a method to analyze agreement between several observers and a standard even if the nonstandard raters examine different subsamples of a larger sample. The model presented by Hui and Zhou (1998) examines the sensitivity and specificity of ratings when there is no "gold" standard.
Hagenaars (1993) introduced latent variables to the framework of loglinear modeling. Moreover, he showed that latent class analysis (LCA; Clogg, 1995; Lazarsfeld & Henry, 1968) is a special variant of loglinear modeling with latent variables. LCA has also been used to examine rater agreement (see Rost & Walter, chap. 18, this volume). For example, Dillon and Mullani (1984) developed a probabilistic latent class model for assessing interjudge reliability, and Agresti and Lang (1993) proposed symmetric latent class models to analyze rater agreement when there is more than one variable.
Much work has focused on solving the problems that arise during the analysis of contingency tables by use of loglinear models (for an overview, see Clogg & Eliason, 1987). Some solutions for specific problems will be mentioned here. For example, many studies contain varying panels of diagnosticians and a varying number of ratings per case for which Uebersax and Grove (1990) provided a model to estimate diagnostic accuracy even under these conditions. Moreover, Becker and Agresti (1992) introduced the use of a jackknife procedure to solve sparse table problems that may arise when many observers are involved who only rate a few subjects. Hui and Zhou (1998) provided a model to estimate sensitivity and specificity of ratings when no golden standard is available. Rindskopf (1990) gave some valuable hints on how to deal with structurally missing data. His approach can even be used to detect homogeneous subgroups in multidimensional tables. Clogg (1982) as well as Becker and Clogg (1989) also developed mixture models to detect subgroups for which different models of agreement apply. Their extensions are based on the partitioning of %2 values.
Although there are various extensions of the loglinear modeling approach, much work is still necessary. For instance, we need additional information about the minimum sample size requirements for any given number of raters, number of categories, and number of observations to gain valid results. Moreover, the influence of chance on coefficient kappa, if the joint distribution is not assumed to follow the assumption of independence, is worth investigation. Because marginal distributions are not always independent from each other, individual decision-making processes have to be examined to detect the cases in which raters are guessing, in which they feel rather sure, or in which they feel absolutely sure. Only if we know more about the decision-making process can agreement by chance be solidly determined and rater agreement be accurately identified. These problems not only affect coefficient k but loglinear models as well. As demonstrated by Schuster (2001), coefficient K can be incorporated in the symmetry model yielding a new (equivalent) model equation. Hence, the process of decision making should be examined more deeply, and the findings should be incorporated into models of rater agreement.
Was this article helpful?