## N

In the consensus gold standard, there were never more than four abnormalities found. So the A,j are fractions with denominators not more than 4 and are utterly non-Gaussian. (For the personal gold standard, the denominator could be as large as 8.) Therefore, computations of attained significance (p-values) are based on the restricted permutation distribution of tBFW. For each of the N images, we can permute the results from the two levels [A^B and B^A] or not. There are 2N points possible in the full permutation distribution, and we calculate tBFW for each one. The motivation for the permutation distribution is that if there were no difference between the bit rates, then in computing the differences A,j, it should not matter whether we compute level A — level B or vice versa, and we would not expect the "real" tBFW to be an extreme value among the 2N values. If k is the number of permuted tBpw values that exceed the "real" one, then (k + 1)/2N is the attained one-sided significance level for the test of the null hypothesis that the lower bit rate performs at least as well as the higher one. As discussed later, the one-sided test of significance is chosen to be conservative and to argue most strongly against compression.

When the judges were evaluated separately, level A (the lowest bit rate) was found to be significantly different at the 5% level against most of the other levels for two of the judges, for both lung and mediastinum sensitivity. No differences were found among levels B through G. There were no significant differences found between any pair of levels for PVP. When judges were pooled, more significant differences were found. Level A was generally inferior to the other levels for both lung and mediastinal sensitivity. Also, levels B and C differed from level G for lung sensitivity (p = 0.016 for both) and levels B and C differed from level G for mediastinal sensitivity (p = 0.008 and 0.016, respectively). For PVP, no differences were found against level A with the exception of A vs E and F for the lungs (p = 0.039 and 0.012, respectively), but B was somewhat different from C for the lungs (p = 0.031), and C was different from E, F, and G for the mediastinum (p = 0.016, 0.048, and 0.027, respectively).

Using the consensus gold standard, the results indicate that level A (0.56 bpp) is unacceptable for diagnostic use. Since the blocking and prediction artifacts became quite noticeable at level A, the judges tended not to attempt to mark any abnormality unless they were quite sure it was there. This explains the initially surprising result that level A did well for PVP, but very poorly for sensitivity. Since no differences were found among levels D (1.8bpp), E (2.2bpp), F (2.64bpp), and G (original images at 12 bpp), despite the biases against compression contained in our analysis methods, these three compressed levels are clearly acceptable for diagnostic use in our applications. The decision concerning levels B (1.18bpp) and C (1.34 bpp) is less clear, and would require further tests involving a larger number of detection tasks, more judges, or use of a different gold standard that in principle could remove at least one of the biases against compression that are present in this study.

Since the personal gold standard has the advantage of using all the images in the study, and the consensus gold standard has the advantage of having little bias between original and compressed images, we can capitalize on both sets of advantages with a two-step comparison. Sensitivity and PVP values relative to the consensus gold standard show there to be no significant differences between the slightly compressed images (levels D, E, and F) and the originals. This is true for both disease categories, for judges evaluated separately and pooled, and using both the Behrens-Fisher-Welch test to examine the sensitivity and PVP separately and using the McNemar test (discussed in the next chapter) to examine them in combination. With this assurance, the personal gold standard can then be used to look for differences between the more compressed levels (A, B, C) and the less compressed ones (D, E, F). The most compressed level A (0.56 bpp, 21:1 compression ratio) is unacceptable as observations made on these images were significantly different from those on less compressed images for two judges. Level B (1.18 bpp) is also unacceptable, although barely so, because the only significant difference was between the sensitivities at levels B and F for a single disease category and a single judge. No differences were found between level C and the less compressed levels, nor were there any significant differences between levels D, E, and F.

In summary, using the consensus gold standard alone, the results indicate that levels D, E, and F are clearly acceptable for diagnostic use, level A is clearly unacceptable, and levels B and C are marginally unacceptable. Using the personal and consensus gold standard data jointly, the results indicate that levels C, D, E, and F are clearly acceptable for diagnostic use, level A is clearly unacceptable, and level B is marginally unacceptable.

We would like to conclude that there are some compression schemes whose implementation would not degrade clinical practice. To make this point, we must either use tests that are unbiased, or, acting as our own devil's advocates, use tests that are biased against compression. This criterion is met by the fact that the statistical approach described here contains four identifiable biases, none of which favors compression. The biases are as follows.

(1) As discussed in the previous chapter, the gold standard confers an advantage upon the original images relative to the compressed levels. This bias is mild in the case of the consensus gold standard, but severe in the case of the personal gold standard.

(2) There is a bias introduced by multiple comparisons [20]. Since (for each gold standard) we perform comparisons for all possible pairs out of the 7 levels, for both sensitivity and PVP, for both lung and mediastinal images, and both for 3 judges separately and for judges pooled, we are reporting on 21x2x2x4 = 336 tests for each gold standard. One would expect that, even if there were no effect of compression upon diagnosis, 5% of these comparisons would show significant differences at the 5% significance level.

(3) A third element that argues against compression is the use of a one-sided test instead of a two-sided test. In most contexts, for example when a new and old treatment are being compared and subjects on the new treatment do better than those on the old, we do a two-sided test of significance. Such two-sided tests implicitly account for both possibilities: that new interventions may make for better or worse outcomes than standard ones. For us, a two-sided test would implicitly recognize the possibility that compression improves, not degrades, clinical practice. In fact, we believe this can happen, but to incorporate such beliefs in our formulation of a test would make us less our own devil's advocates than would our use of a one-sided test. Our task is to find when compression might be used with clinical impunity, not when it might enhance images.

(4) The fourth bias stems from the fact that the summands in the numerator of tBFW may well be positively correlated (in the statistical sense), though we have no way to estimate this positive dependence from our data. If we did, the denominator of tBFW would typically be smaller, and such incorporation would make finding "significant" differences between compression levels more difficult.

For all ofthese reasons, we believe that the stated conclusions are conservative.

0 0