exp indicates how kappa is computed (.84 in this case). Agreement expected due just to chance is subtracted from both numerator and denominator, thus kappa gives the proportion of agreement corrected for chance.
Exact second-by-second agreement may be too stringent. Given human reaction time and other considerations, investigators may be willing to permit their coders some tolerance, which the GSEQ program allows. For the tallies in Figure 10.3, a
2-second tolerance was specified, thus agreements were tallied as long as the second observer agreed with the first within 2 seconds. Figure 10.4 shows how this works in practice. Displayed is a 40-sec-ond segment of a time plot from a reliability session of the sort GSEQ produces. The first coder's second-by-second record is shown on the first line, the second coder's on the second line, and disagreements on the third line. Seconds underlined with periods were disagreements but were not counted as such because the second coder agreed with the first coder within 2 seconds. For example, coder 1 assigned cj at 54:23 and 54:24 whereas coder 2 assigned Ss. However coder 2 assigned cj at 54:21 and 54:22, which was within the tolerance specified. In contrast, seconds underlined with hyphens did count as disagreements. For example, coder 1 assigned Ss at 54:35; because coder 2 did not assign Ss within 2 seconds (i.e., from 54:33-54:37), this counted as a disagreement.
The procedure of cross classifying time units to assess observer reliability raises a couple of potential concerns. First, because the time unit is arbitrary (recall the minute vs. moment discussion earlier), what would happen if half-seconds or tenths of a second were used instead, thereby dou-
FIGURE 10.4. A segment from a plot, as displayed by GSEQ, of 40 seconds coded by two observers, with a tolerance of 2 seconds. Here, s = supported, c = coordinated, S = symbol-infused supported, and C = symbol-infused coordinated joint engagement. Seconds underlined with dashes are counted as disagreements. Seconds underlined with periods were not counted as disagreements because there was agreement within ± 2 seconds, but they would be counted as disagreement if 0 tolerance was specified. Seconds not underlined represent exact agreement.
0:54:20 0:54:30 0:54:40 0:54:50 |----+----|----+----|----+----|----+----
Coder 1 CCC^SSSSSS^SSSSSS^CCCCCCCCCCSS^SSSS Coder 2 c|§SSSSSSSS^^sssssss^jccccccccCCCC^lSS
H----->i bling or increasing the number of tallies by an order of magnitude? Other things being equal, the value of kappa would not be affected; it is a magnitude of effect statistic, unchanged by the number of tallies (unlike, e.g., chi-square). True, its standard error would decrease with more tallies, but whether or not a kappa is statistically significantly different from zero is almost never of concern; significant kappas could still indicate unacceptable agreement. Quite rightly, investigators are concerned with the size of kappa, not its statistical significance. For example, Fleiss (1981) characterized values over .75 as excellent, between .60 and .75 as good, and between .40 and .60 as fair; nonetheless, Bakeman and Gottman (1997) recommended viewing values of kappa less than . 70 with some concern.
Which coder is designated first and which second is also arbitrary. When no tolerance is specified, values of kappa are identical, no matter which coder is considered first. However, and this is the second concern, when a tolerance is specified slightly different values of kappa are generated depending on which coder is first (because the algorithm considers each time unit for the first coder in turn and tallies an agreement if a match is found for the second coder within the tolerance specified). In practice, any difference in the values of the two kappas is usually quite small. Nonetheless, such indeterminacy makes most of us a bit uncomfortable, and so we recommend computing both values and then reporting the lower of the two, which seems a conservative strategy.
Cohen's kappa has many advantages with respect to the traditional percentage of agreement. By eliminating the portion of nonreliable agreement due to chance from the total agreement, the index becomes an index of reliability in a classical measurement theory sense that assumes a ratio between true variance and total variance; it can be weighted when the variable is ordinal so that more versus less serious confusions about codes between observers can be taken into account (for details see Bakeman & Gottman, 1997). Kappa can be calculated both for the general category system and even for each single category (by extracting a series of 2 x 2 tables from the agreement matrix); thus, along with the help of the agreement matrix, different kappas for different codes within the same set can be compared to detect particularly unreliable codes.
In sum, when using observational methods, reliability is a central concern, from training of coders to publication of research reports. Validity is a concern too, but one that applies to all our studies, all the time, no matter what measurement approach is used, and that usually is integrated with data analysis. Still, it is worth noting that one common approach to training observers combines both validity and reliability concerns. This approach involved preparation of standard protocols that are assumed accurate and against which observers are tested. The standard protocol is regarded as a "gold standard," one that the researcher prepares with the consultation of experts and that is regarded as representing "the true state of affairs;" that is, in psychometric terms, it is an external measure that the researcher can reasonably assume to be accurate. Comparing each observer with this protocol by means of the confusion matrix and Cohen's kappa provides a simple way to understand if observers are really coding what the researcher wants them to code. This procedure has at least two advantages: It identifies coders' errors while eliminating the possibility that the coders share a common but nonetheless deviant worldview, and it permits all future coders to be trained to a common criterion of known (presumed) validity.
When observational methods are used, and the timing of events is recorded (i.e., onset times and implicit or explicit offset times)—a circumstance that current technology makes easy, routine, and increasingly common—assessing reliability of coders is facilitated when data are represented as successive coded time units (e.g., seconds). An alternative strategy, sometimes encountered in older literature, is to attempt to align two protocols and somehow, attempting to take commissions and omissions into account, identify similar stretches of time assigned the same code as a single agreement, and then report a percentage of agreement statistic. This is both imprecise and does not give coders credit for the moment-by-moment nature of their decisions. It also does not give them credit for not coding an event, even when that may be the correct decision. The time-based approach to reliability presented here seems preferable. In the next section we demonstrate how representing data as successive coded time units can facilitate data analysis as well.
Was this article helpful?