Measurement reliability is often thought of as a high test-retest correlation, which is not an accurate conceptualization. Reliability, as an aspect of measurement, refers to the degree to which observed scores reflect the "true" amount of the construct being measured. We never have access to "true" scores, so we must estimate reliability. A test-retest correlation is an appropriate estimate of reliability for between-subjects constructs (i.e., for trait constructs), where the variability of interest is between participants. With trait constructs we assume little or no meaningful within-participant variance. Intelligence is a good example of a between-subject, or strictly trait, construct, where we assume that, for any single individual, intelligence is stable and not easily changed, at least not over a few weeks or months. As such, reliable measures of intelligence will demonstrate high test-retest correlations, and test-retest results are adequate reliability estimates.
Emotion, on the other hand, is often typically conceptualized as a within-subject construct (i.e., as a state construct), and we therefore assume that it may change frequently within any single individual. To make matters more complicated, as already noted, emotion can be both a between-subjects construct as well as a within-subject construct, where there is meaningful variance within people over time (i.e., reactivity) as well as meaningful variance between individuals (i.e., individual differences in set-point or expected value). Because emotions are hybrid state-trait constructs, we cannot use simple test-retest correlations as estimates of measurement reliability.
Another estimate of reliability is through internal consistency indicators, such as coefficient alpha, or odd-even item-composite correlations. These are actually measures of item homogeneity, assessing the degree to which items measure the same underlying construct. Because many self-report emotion measures are factor analytically constructed, such item homogeneity is built in during the scale construction process. Internal consistency analysis is thus one way to estimate reliability, and it works equally well for both state and trait measures. However, internal consistency estimates of reliability work only for multi-item scales. Single-item measures, whether they are self-reports, observer ratings, or an experimental task, simply cannot be examined in terms of internal consistency. One could, however, estimate the consistency across multiple indicators, but such an analysis is verging more on construct validity than on the classical concern of measurement reliability of the single measures themselves.
Researchers using single measures of emotion are really left in the dark about measurement reliability. One might choose to ignore reliability concerns altogether and focus instead on concerns about validity. This appears reasonable because measurement reliability sets the upper bound on validity correlations. In other words, a measure cannot correlate with external criteria higher than it can correlate with itself. As such, valid measures are de facto reliable. However, a researcher who passes up reliability concerns proceeds at some risk of being unable to make credible conclusions, particularly if some hypothesized validity relationship is not found. However, strong evidence for validity, with multiple converging methods and replicated patterns of association, can add credibility to the claim that a particular measure is reliable.
Measurement reliability is most crucial when it comes to interpreting failures to refute the null hypothesis. For example, if a study was conducted, and the predicted relationships were not found, three obvious reasons must be entertained: Either the theory was wrong, the measures used were not reliable, or some auxiliary conditions of the study were not met (cf. Meehl, 1978, for more detailed discussion). When a study fails to find the predicted results, and the researcher is confident that the measures used are reliable, then the set of reasons is narrowed to questioning the theory or looking for something that might have gone wrong with the procedure or data. It is precisely in such circumstances that reliability evidence is crucial. In the absence of evidence for reliability, conclusions cannot be made about whether the theory was adequately tested. However, because there are typically several different measures of the same facet of emotion (e.g., multiple observational ratings, or ratings from several raters), the facets measures can be modeled for reliability.
As for validity concerns, the construct of emotion also poses several unique challenges to researchers. Emotion is a theoretical construct that is only probabilistically linked to observable indicators. Even though it may be represented by many different measures, emotion is not equivalent, nor can it be reduced to, any single measure. This underscores the importance of construct validity, especially multimethod construct validity, in understanding the scientific meaning of emotion terms.
In construct validity (Cronbach & Meehl, 1955), meaning is given to a scientific term (e.g., emotion) by the nomological network of assertions in which that term appears. Our theories and measurement models guide us in proposing a network of associations around the construct. The proposed links in this network then become hypotheses to be tested in empirical research. In construct validation, theory testing and measurement development proceed in tandem. Each link in the network helps add to the scientific meaning of the term as well as providing evidence on the validity of the measure.
Some links in the network refer to positive associations (convergent validity), and some refer to negative or null associations (discriminant validity). In addition, some links specify the conditions under which emotions are likely to be evoked (predictive validity).
The total collection of relationships—links in the nomological network—built up around the construct of "emotion," or around specific emotions, creates a mosaic of research findings (Messick, 1980). When enough pieces of the network are in place, we "get the picture." That is, when enough empirical results are available about what something is, what it isn't, and what it predicts, we begin to have the feeling that we "understand" it and can measure it. Moreover, the credibility of our scientific understanding of the meaning of a construct grows with the diversity of the methods that go into establishing the nomological links. That is, the greater the methodological distance between two nodes in a nomological network (e.g., a physiological measure correlating with an evaluative self-report), then greater credibility is given to claims regarding the scientific meaning of the construct. This is not to say that our understanding of a construct is complete at this point; construct validity is always unfinished, and things are always "true until further notice." Nevertheless, there comes a point where we reach some consensual agreement about the scientific meaning of a construct, such as emotion, as well as the utility of the different measures that are used as indicators of that construct.
Because emotions implicate multiple response systems (e.g., facial action, autonomic activity, subjective experience, action tendencies), the issue arises about whether we should expect strong convergence among indicators of these different response systems. This validity question is particularly vexing in emotion research because of the nature of the multiple response systems (i.e., loosely coupled and complexly interacting systems). Moreover, the various response systems have functions beyond serving as indicators of emotion. For instance, in addition to reacting to emotions, the autonomic nervous system functions to regulate to metabolic input and output and to maintain homeostasis. The facial muscles, in addition to producing outward expressions of feel ings, are used for vocalizing and eating. The cardiovascular system, in addition to speeding up during emotion, functions mainly to circulate blood to all organs of the body. These so-called emotional response systems have more to do that just respond to emotions, and this should make any researcher question the validity of any single measure of "emotion," such as heart rate. Perhaps the strongest evidence for validity is when a theory about a particular emotion can be used to generate predictions about the conditions under which that emotion will be evoked, or the type of persons for whom that emotion will be most easily evoked. Couple this with measurement theory and a knowledge of multiple measures of emotion, and very specific predictions may be generated and tested.
Because the various components of emotion will never correlate substantially with each other, because of the concerns described, they pose special challenges to those researchers interested in using some of the statistical models described elsewhere in this handbook. That is, emotion measures may not cohere well enough to be modeled by the standard techniques. More advanced methods, especially those that can accommodate multiple but weakly correlated measures, will be needed in the areas of emotion.
Emotion researchers need to keep clearly in mind that constructs are never purely measured. Rather, all measures are construct-method composites. For example, the measurement of anxiety is not the same across different methodological contexts. Instead, we should consider using terms that specify the method and the construct together, such as self-reported anxiety or cardiovascular anxiety or observer-rated anxiety. This acknowledges the fact that the theoretical meaning of a construct is given, in part, by the methods used to measure it. We turn now to a consideration of specific methods of measurement commonly used in the emotion domain.
Was this article helpful?