The proportion agreement index (p0), which indicates how often two observers' ratings concur, is an intuitive and useful first measure of observer agreement. It is computed by dividing the number of times raters agree by the number of objects rated:

Jiy denotes the number of cases in the cell ij of the cross-classified table, and nu represents the cells on the main diagonal (where i = j), which indicate concordant ratings. The same information is provided by the percentage agreement index (p%), which is the pQ index multiplied by 100 to obtain the actual percentages (see Suen & Ary, 1989).

In Table 17.1a this index is p0 = .93. Sometimes p0 and p% are referred to as percent agreement (Hartmann, 1977), interval-by-interval agreement (Hawkins & Dotson, 1975), exact agreement (Repp, Deitz, Boles, Deitz, & Repp, 1976), overall reliability (Hopkins & Hermann, 1977), total agreement (House, House, & Campbell, 1981), or point-by-point reliability (Kelly, 1977).

Unfortunately, as Suen and Ary (1989) have shown, the proportion agreement index is inflated by chance agreement and suffers from its dependency on the marginal distributions. This can best be illustrated by the data in Table 17.1b. Assume, for example, that 55 pupils actually are hyperactive and 445 are not. Both raters agreed 445 times in their diagnoses of pupils as "not hyperactive," whereas in the other 55 times, Rater A correctly judged "hyperactive" while B assessed the same pupils as "not hyperactive." The proportion agreement index yields a value of p0 = .89, which is quite similar to the value obtained by the data presented in Table 17.1a. However, both raters did not agree in even one critical case, whereas in the first data set both raters agreed in 40 critical cases. The high agree ment stems only from the low prevalence of hyperactivity, which is correctly reflected by the marginal distribution of Psychologist As judgments and the agreement between both raters for "normal" cases. Because A correctly identified hyperactive pupils, the high proportion agreement index may lead to the improper conclusion that B did as well—but B did not even detect one critical case. Hence, both the percentage agreement index and the proportion agreement index suffer severely from their insensi-tivity to critical cases and their dependency on the criterion's distribution (i.e., its prevalence). As the actual prevalence of behavior occurrence approaches unity or zero, there is a greater possibility that the proportion agreement index is inflated (Costello, 1973; Hartmann, 1977; Hopkins & Herman, 1977; Johnson & Bolstad, 1973; Mitchell, 1979). The closer the prevalence is to .50, the less likely the proportion agreement index is inflated (Suen & Ary, 1989).

If, for example, both raters assume a prevalence of .50 and if both raters are only guessing, their ratings could be based on the toss of a coin yielding probabilities of .25 for each cell of the cross-table. That is, given independent ratings (coin tosses), a base agreement of .25 would be expected in each cell (see Table 17.2a), and, therefore, p0 = .50. Assuming a prevalence of .90, these base agreements are not equally distributed but strongly skewed (see Table 17.2b), and the p0 is much higher (p = .82). Base agreement is most often referred to as agreement by chance (albeit this term is a bit misleading). Agreement by chance corresponds to the expected cell frequencies under the assumption of independence.

Because the magnitude of percentage agreement can be inflated by agreement by chance—which itself depends on the prevalence of behavior—it is impossible to provide a reasonable threshold for acceptable and unacceptable interobserver agreement. Additionally, the magnitudes of interobserver agreement cannot be directly compared between studies with different rates of prevalence. Thus, many authors have argued that the proportion agreement index should no longer be used (Hartmann, 1977; Hartmann & Wood, 1982; Hawkins & Dotson, 1975; Kratochwill & Wetzel, 1977; Suen &

Fridtjof W. Nussbeck Agreement by Chance

(a) Agreement by chance with a prevalence rate of .50

Was this article helpful?

## Post a comment