In general, associations between variables or methods can be detected by the %2 test. Normally, this test is conducted on the basis of the null hypothesis that all variables are independent. The %2 value provides information on whether the data differ significantly from the expected cell frequencies. Information about the strength of association can be obtained by the corrected contingency coefficient and Cramer's V.

The special case of rater agreement, on the other hand, can be estimated by several methods. As pointed out, many of them are afflicted by specific problems. The most promising approach seems to be the K coefficient, a method that is a chance-corrected version of proportion agreement. Suen, Ary, and Ary (1986) demonstrated the mathematical relationship between K and proportion agreement and also provided conversion procedures from one index to the other. Unfortunately, most journal articles do not provide sufficient information for taking advantage of these direct comparisons (Suen & Ary, 1989). In early psychological literature, but increas ingly less frequently, research reports presented percentage agreement values containing no information about the amount of chance inflation in them. To overcome this dissatisfying situation, Berk (1979) suggested that researchers should also report the original statistics (cell frequencies and marginals).

Many authors suggest K to be the most preferable agreement index because it corrects for chance agreement, is related to percentage (proportion) agreement, and is comparable between studies (see Suen & Ary, 1989), whereas others state it is not comparable between studies (Cicchetti & Feinstein, 1990; Feinstein & Cicchetti, 1990; Thompson & Walter, 1988a, 1988b; Uebersax, 1987). Indeed, K can be used to test whether ratings agree to a greater extent than expected by chance. Yet there is still concern about using K as a measure of agreement because it is only chance-corrected for the assumption of independent ratings, an assumption that is implicitly made but legitimated by no means. Uebersax (1987) impressively demonstrated how differences in the accuracy with which positive and negative cases can be detected (i.e., differences in the mathematical characteristics of the particular decision-making process) affect the value of K. Moreover, this problem increases when there are different base rates. In general, if the sample consists of cases that belong to an easily identifiable category, a higher K is obtained, although the diagnostic accuracy remains the same compared to a sample consisting of less easily identifiable cases. Diagnosability curves representing the degree to which diagnosticians are able to accurately judge subjects with respect to the subjects' true status may actually differ so much that K values obtained for the same symptom (criterion) with similar base rates cannot be compared across studies. Unless there is an explicit model of rater decision making, it remains unclear how chance affects decisions of actual raters and how one might correct for it (Uebersax, 1987).

All agreement indices were introduced for the simplest case consisting of two variables comprising two categories each creating a contingency table of four (2 x 2) cells. If there are more than two categories for each of the variables, the application of the associations indices presented here can be applied in a straightforward manner. However, when the number of observers increases, the application of the general agreement indices becomes more complicated. In this case, K should be determined for each rater pair, and the median value should be taken as the overall value (Conger, 1980; Fleiss, 1971). Fleiss (1971) developed modifications of K to determine rater agreement when objects are rated by the same number of nonidenti-cal raters, to compute agreement with regard to a particular object, and to estimate agreement within a particular category.

Coefficient K can also be computed if some categories have more in common than others. Assume that there are two child psychologists who want to categorize a child's behavior as "very active, easy to distract, impulsive, aggressive, or restless," which are all indicators of hyperactivity, or as "playful." The overlapping of the first five categories can be considered by use of the weighted kappa (Cohen, 1968). The weighted kappa allows for differential weights for individual observed cells and individual marginals. Disagreement between raters choosing "very active" versus "impulsive" can be regarded as less striking than between them choosing "playful" versus "impulsive." Thus, the latter combination must be weighted to a larger degree than the first. Coefficient weighted kappa can be computed by

Was this article helpful?

## Post a comment