## Learning Effects for the Mammography Experiment

In the mammography experiment, the radiologists saw each study at least five times during the complete course. These five versions were the analog originals, the digitized versions, and the three wavelet compressed versions. Some images would be seen more than five times, as there were JPEG compressed images, and there were also some repeated images, included in order to be able to directly measure intraobserver variability.

In this work, we looked for whether learning effects were present in the management outcomes using what is known in statistics as a "runs" test. We illustrate the method with an example. Suppose a study was seen exactly five times. The management outcomes take on four possible values (RTS, F/U, C/B, BX). Suppose that for a particular study and radiologist, the observed outcomes were BX three times and C/B two times. If there were no learning, then all possible "words" of length five with three BXs and two C/Bs should be equally likely. There are 10 possible words that have three BXs and two C/Bs. These words have the outcomes ordered by increasing session number; that is, in the chronological order in which they were produced. For these 10 words, we can count the number of times that a management outcome made on one version of a study differs from that made on the immediately previous version of the study. The number ranges from one (e.g., BX BX BX C/B C/B) to four (BX C/B BX C/B BX). The expected number of changes in management decision is 2.4, and the variance is 0.84. If the radiologists had learned from previous films, one would expect that there would be fewer changes of management prescription than would be seen by chance. This is a conditional runs test, which is to say that we are studying the conditional permutation distribution of the runs.

We assume that these "sequence data" are independent across studies for the fixed radiologist, since examining films for one patient probably does not help in evaluating a different patient. So we can pool the studies by summing over studies the observed values of the number of changes, subtracting the summed (conditional) expected value, and dividing this by the square root of the sum of the (conditional) variances. The attained significance level (p-value) of the resultant Z value is the probability that a standard Gaussian is < Z.

Those studies for which the management advice never changes have an observed number of changes 0. Such studies are not informative with regard to learning, since it is impossible to say whether unwavering management advice is the result of perfect learning that occurs with the very first version seen, or whether it is the result of the obvious alternative, that the study in question was clearly and independently the same each time, and the radiologist simply interpreted it the same way each time. Such studies, then, do not contribute in any way to the computation of the statistic. The JPEG versions and the repeated images, which are ignored in this analysis, are believed to make this analysis and p-values actually conservative. If there were no learning going on, then the additional versions make no difference. However, if there were learning, then the additional versions (and additional learning) should mean that there would be even fewer management changes among the five versions that figure in this analysis.

The runs test for learning did not find any learning effect at the 5% significance level for the management outcomes in the mammography experiment. For each of the three judges, approximately half of the studies were not included in the computation of the statistic, since the management decision was unchanging. For the three judges, the numbers of studies retained in the computation were 28, 28, and 27. The Z values obtained were — 0.12, — 0.86, and — 0.22, with corresponding p-values of 0.452, 0.195, and 0.413.

0 0