The standard psychometric concerns of reliability and validity are in no way unique to observational methods. The precision and accuracy of any measuring device needs to be established before weight can be given to the data collected with it. Such considerations apply no matter whether observational methods or other measuring approaches are used. Nonetheless, for some measuring approaches, reliability issues do not loom large. For example, usually we assume that, once calibrated, electromechanical measuring instruments are accurate. Similarly, we assume that a transcriber is accurate and do not ask what the reliability is of a transcription (although perhaps we should). Furthermore, for some kinds of measurement, reliability matters are quite codified, and so it is routine to compute and report Cronbach's internal consistency alpha for self-report scales.
In contrast, for observational methods, reliability issues do loom large and are quite central to the approach. For the sort of observational systems described here, the measuring device consists of trained human observers applying a coding scheme or schemes to streams of behavior, often video recorded. Thus the main source of error in observational methodology is the human observer. The careful training of observers, and establishing their reliability, is an important part of the observational enterprise. Quite correctly, we are a bit skeptical of our fellow humans and want to assure ourselves, and others, that data recorded by one observer are not idiosyncratic, unique to that observer's way of viewing the world.
Thus the first concern is for reliability. Validity is more complex, and evidence for it accumulates slowly, as we will discuss subsequently. As is standard (e.g., Nunnally & Bernstein, 1994; Pedhazur & Schmelkin, 1991), by reliability we understand agreement and replicability. Whatever is being measured is being measured consistently. When two observers agree with each other, or agree with themselves over time, we have evidence for reliability. It is possible, of course, that two observers might share a deviant worldview, in which case they would be reliable but not valid. Validity implies accuracy, that we are indeed measuring what we intend. As is widely appreciated, measures may be reliable without being valid, but they cannot be valid without being reliable.
Reliability can be established using fairly narrow statistical means (e.g., Cronbach's alpha for internal consistency of self-report scales), whereas validity involves demonstrating that a measure correlates in sensible ways with different measures allegedly associated with it in the present (concurrent validity) or in the future (predictive validity) and with other measures assumed to measure the same construct (convergent validity), and does not correlate with other measures assumed to measure other constructs (divergent validity). It is not necessarily demonstrated in one study, but requires more an accumulation of evidence, coupled with some judgment. Because reliability is required for validity, and because a specific statistic (Cohen's kappa) dominates observational reliability, whereas validity approaches are much more general (as, indeed, the content and organization of this volume attests), in this chapter we emphasize reliability as applied to observational methods. We might add that not all authors regard agreement as "an index of reliability at all" because it addresses only a particular source of error (Pedhazur & Schmelkin, 1991, p. 145). Earlier Bakeman and Gottman (1997) attempted to distinguish reliability from agreement but, on reflection, we would argue that the distinction is less useful than the more firmly psychometric view presented here.
As previously noted, usually observers are asked to make categorical distinctions, thus the most common statistic used to establish interobserver reliability in the context of observational studies is Cohen's kappa, a coefficient of agreement for categorical scales (Cohen, I960; also see Nussbeck, chap. 17, this volume). Cohen's kappa corrects for chance agreement and thus is much preferred to the percentage agreement statistics sometimes encountered, especially in older literature. Moreover, the agreement matrix (also called a confusion matrix), required for its computation, is helpful when training observers due to the graphic way it portrays specific sources of disagreement. In the following paragraphs, we demonstrate the use of kappa using an example based on research in the development of joint attention in infants and toddlers by Adamson and colleagues. This example is useful because it allows us to integrate material introduced earlier in this chapter concerning coding schemes and the representation of observational data with reliability in a way that demonstrates what has been a theme throughout, the usefulness of conceptualizing observational data as a sequence of coded time units.
First, the coding schemes: Adamson, Bakeman, and Deckner (2004) have examined how language (or symbolic means generally) becomes infused into and transforms joint attention with toddlers. To this end, and based on earlier work (Bakeman & Adam-son, 1984), they defined seven engagement states for toddlers playing with their mothers. Four, listed first, are of primary theoretic interest, whereas three more complete the ME&E set:
1. Supported Joint Engagement (sj), infant and mother actively involved with same object, but the infant does not overtly acknowledge the mother's participation; symbols (primarily language) not involved.
2. Coordinated Joint Engagement (cj), infant and mother actively involved with same object or event, and the infant acknowledges the mother's participation; symbols not involved.
3. Symbol-Infused Supported Joint Engagement (Ss), toddler and mother involved with same object, the toddler is attending to symbols, but the toddler does not overtly acknowledge mother's participation.
4. Symbol-Infused Coordinated Joint Engagement (Cs), toddler and mother involved with same object, the toddler is attending to symbols, and the toddler actively acknowledges mother's participation.
5. Unengaged, Onlooking, or Person (ulp). Initially these three were coded separately but the distinctions were not of primary interest, so we combined them into a single code (using GSEQ's OR command).
6. Object, infant engaged with objects alone (ob).
7. Symbol-Only, Object-Symbol, Person-Symbol (Yop). These three were defined to complete logical possibilities but, as expected, were very infrequent, and not of primary interest, so we combined them into a single code.
Once codes are defined, a focus on issues of reliability serves several purposes, ranging from training of coders to final publication of research reports. Three important questions investigators face are as follows: First, given that we are asking coders to identify times when behaviors (in this case, engagement states) occur, how do we provide them feedback concerning their reliability? Second, how do we assure ourselves that they are reliable? And third, how do we convince colleagues, including editors and reviewers, that they are reliable? When the timing of events is recorded, these questions become tractable once we identify a time unit as the coding unit and represent the data as a sequence of coded time units, as discussed earlier. Assume a time unit of a second, as is common. Then the seconds of the observation become the thing tallied. Rows and columns of a matrix are labeled with the codes in a ME&E set. Rows represent one observer and columns a second observer. Then, each second of the observation is cross classified. Tallies in the cells on the upper-left to lower-right diagonal represent agreement, whereas tallies in off-diagonal cells represent disagreement.
Such an agreement matrix provides coders a graphic display of the coders' agreement and disagreement. It pinpoints codes on which they disagree; for example, for the agreement matrix in Figure 10.3 the most common disagreement was between object and supported joint engagement (31 seconds). Moreover when codes are ordered from simpler to more complex (as they are in Figure 10.3), tallies disproportionately above the diagonal, for example, would suggest that the second observer consistently had lower thresholds (was more sensitive) than the first. Thus patterns of disagreement suggest areas for further training, whereas patterns of agreement assure investigators that coders are faithfully executing the coding.
Moreover, the extent of agreement can be quantified using Cohen's kappa (1960; Robinson & Bakeman, 1998), which is used in published reports to assure others of the reliability with which the coding scheme was applied. Kappa is an index that summarizes agreement between two coders when assigning things (here seconds) to the codes of an ME&E set. Thus kappa is an index of the reliability with which two coders use a categorical scale (i.e., a set of ME&E codes), derived from the agreement matrix. Let x(. indicate a cell of the matrix and a plus sign indicate summation, then x indicates the total for the ith row and x++ indicates the total number of tallies in the matrix, where k is the number of codes in the set. Then
Was this article helpful?