FIGURE 9 Mean apme vs mean bit rate using the personal gold standard. The dotted, dashed, and dash-dot curves are quadratic splines fit to the data points for judges 1, 2, and 3, respectively. The solid curve is a quadratic spline fit to the data points for all judges pooled.

judge 3 overmeasured at the compressed bit rates with respect to the personal gold standard.

The t-test results indicate that levels 1 (0.36 bpp) and 4 (1.14 bpp) have significantly different pme associated with them than does the personal gold standard. The results of the Wilcoxon signed rank test on percent measurement error using the personal gold standard are similar to those obtained with the independent gold standard. In particular, only level 1 at 0.36 bpp differed significantly from the originals. Furthermore, levels 1,3, and 4 were significantly different from level 5.

FIGURE 10 Apme vs actual bit rate using the personal gold standard. The x's indicate data points for all images, pooled across judges and compresssion levels. The solid curve is a quadratic spline fit to the data.

Since the t-test indicates that some results are marginally significant when the Wilcoxon signed rank test indicates the results are not significant, a Bonferroni simultaneous test (union bound) was constructed. This technique uses the significance level of two different tests to obtain a significance level that is simultaneously applicable for both. For example, in order to obtain a simultaneous significance level of a% with two tests, we could have the significance of each test be at (a/2)%. With the simultaneous test, the pme at level 4 (1.14 bpp) is not significantly different from the uncompressed level. As such, the simultaneous test indicates that only level 1 (0.36 bpp) has significantly different pme from the uncompressed level. This agrees with the corresponding result using the independent gold standard. Thus, pme at compression levels down to 0.55 bpp does not seem to differ significantly from the pme at the 9.0 bpp original.

In summary, with both the independent and personal gold standards, the t-test and the Wilcoxon signed rank test indicate that pme at compression levels down to 0.55 bpp did not differ significantly from the pme at the 9.0 bpp original. This was shown to be true for the independent gold standard by a direct application of the tests. For the personal gold standard, this was resolved by using the Bonferroni test for simultaneous validity of multiple analyses. The status of measurement accuracy at 0.36 bpp remains unclear, with the t-test concluding no difference and the Wilcoxon indicating significant difference in pme from the original with the independent gold standard, and both tests indicating significant difference in pme from the original with the personal gold standard. Since the model for the t-test is fitted only fairly to moderately well by the data, we lean towards the more conservative conclusion that lossy compression by our vector quantization compression method is not a cause of significant measurement error at bit rates ranging from 9.0 bpp down to 0.55 bpp, but it does introduce error at 0.36 bpp.

A radiologist's subjective perception of quality changes more rapidly and drastically with decreasing bit rate than does the actual measurement error. Radiologists evidently believe that the usefulness of images for measurement tasks degrades rapidly with decreasing bit rate. However, their actual measurement performance on the images was shown by both the t-test and Wilcoxon signed rank test (or the Bonferroni simultaneous test to resolve differences between the two) to remain consistently high down to 0.55 bpp. Thus, the radiologist's opinion of an image's diagnostic utility seems not to coincide with its utility for the clinical purpose for which the image was taken. The radiologist's subjective opinion of an image's usefulness for diagnosis should not be used as the sole predictor of actual usefulness.

3.2 Discussion

There are issues of bias and variability to consider in comparing and contrasting gold standards. One disadvantage of an independent gold standard is that since it is determined by the measurements of radiologists who do not judge the compressed images, significant differences between a compressed level and the originals may be due to differences between judges. For example, a biased judge who tends to overmeasure at all bit rates may have high pme that will not be entirely reflective of the effects of compression. In our study, we determined that two judges consistently overmeasured relative to the independent gold standard. The personal gold standard, however, overcomes this difficulty. A personal gold standard also has the advantage of reducing pme and apme at the compressed levels. This will result in a clarification of trends in a judge's performance across different compression levels. Differences will be based solely on compression level and not on differences between judges. Another argument in favor of a personal gold standard is that in some clinical settings a fundamental question is how the reports of a radiologist whose information is gathered from compressed images compare to what they would have been on the originals. Indeed, systematic biases of a radiologist are sometimes well recognized and corrected for by the referring physicians.

One disadvantage with the personal gold standard, however, is that by defining the measurements on the original images to be "correct," we are not accounting for the inherent variability of a judge's measurement on an uncompressed image. For example, if a judge makes an inaccurate measurement on the original and accurate measurements on the compressed images, these correct measurements will be interpreted as incorrect. Thus the method is biased against compression. An independent gold standard reduces the possibility of this situation occurring since we need an agreement by two independent radiologists on the "correct" measurement.

The analysis previously presented was based on judges, vessels, and images pooled. Other analyses in which the performances of judges on particular vessels and images are separated demonstrate additional variability. Judges seem to have performed significantly differently from each other. Judges 2 and 3 consistently overmeasured. As a result, the Wilcoxon signed rank test using the independent gold standard indicates significant differences between the gold standard and the measurements of judges 2 and 3 at all compression levels, including the original. Judge 1, however, does not have any significant performance differences between the gold standard and any compression levels. In addition, certain vessels and images had greater variability in pme than others. To examine the validity of pooling the results of all judges, vessels, and images, an analysis of variance (ANOVA) [21] was used to assess whether this variability is significant. The ANOVA took the judges, vessels, and images to be random effects and the levels to be fixed effects, and separated out the variance due to each effect. For technical reasons it is not feasible here to use direct F-tests on each of the variances estimated. Thus, we obtained confidence regions for each component of variance using a jackknife technique [21]. In particular, if zero falls within the 95% confidence interval of a certain effect, then the effect is not considered significant at the 5% level. Using the jackknife technique, the ANOVA indicates that the variability in judges, vessels, images, and levels were not significantly different from zero, thereby validating the pooling.

0 0

Post a comment