subjective score

FIGURE 12 Subjective scores: Lossy compressed digital at 0.15 bpp. Reprinted with permission from S.M. Perlmutter, RC. Cosman, R.M. Gray, R.A. Olshen, D. Ikeda, C.N. Adams, B.J. Betts, M. Williams, K.O. Perlmutter, J. Li, A. Aiyer, L. Fajardo, R. Birdwell, and B.L. Daniel, Image Quality in Lossy. Compressed Digital Mammograms, Signal Processing, 59:189-210, 1997. © Elsevier.

Using the Wilcoxon signed rank test, the results were as follows.

Judge A: All levels were significantly different from each other except the digital to 0.4 bpp, digital to 1.75 bpp, and 0.4 to 1.75 bpp.

Judge B: The only differences that were significant were 0.15 bpp to 0.4 bpp and 0.15 bpp to digital.

Judge C: All differences were significant.

All judges pooled: All differences were significant except digital to 0.15 bpp, digital to 1.75 bpp, 0.15 to 0.4 bpp, and 0.15 to 1.75 bpp.

Comparing differences from the independent gold standard, for judge A all were significant except digital uncompressed; for judge B all were significant; and for judge C all were significant except 1.75 bpp. When the judges were pooled, all differences were significant.

There were many statistically significant differences in subjective ratings between the analog and the various digital modalities, but some of these may have been a result of the different printing processes used to create the original analog films and the films printed from digital files. The films were clearly different in size and in background intensity. The judges in particular expressed dissatisfaction with the fact that the background in the digitally produced films was not as dark as that of the photographic films, even though this ideally had nothing to do with their diagnostic and management decisions.

6 Diagnostic Accuracy and ROC Methodology

Diagnostic "accuracy" is often used to mean the fraction of cases on which a physician is "correct," where correctness is determined by comparing the diagnostic decision to some definition of "truth." There are many different ways that "truth" can be determined, and this issue is discussed in Section 7. Apart from this issue, this simple definition of accuracy is flawed in two ways. First, it is strongly affected by disease prevalence. For a disease that appears in less than 1% of the population, a screening test could trivially be more than 99% accurate simply by ignoring all evidence and declaring the disease to be absent. Second, the notion of "correctness" does not distinguish between the two major types of errors, calling positive a case that is actually negative, and calling negative a case that is actually positive. The relative costs of these two types of errors are generally not equal. These can be differentiated by measuring diagnostic performance using a pair of statistics reflecting the relative frequencies of the two error types.

Toward this end suppose for the moment that there exists a "gold standard" defining the "truth" of existence and locations of all lesions in a set of images. With each lesion identified in the gold standard, a radiologist either gets it correct (true positive or TP) or misses it (false negative or FN). For each lesion identified by the radiologist, either it agrees with the gold standard (TP as above) or it does not (false positive or FP).

The sensitivity or true positive rate (or true positive fraction (TPF)) is the probability pTP that a lesion is said to be there given that it is there. This can be estimated by relative frequency

The complement of sensitivity is the false negative rate (or fraction) pFN = 1 — pTP, the probability that a lesion is said to not be there given that it is there.

In an apparently similar vein, the false positive rate pFP (or false positive fraction (FPF)) is the probability that a lesion is said to be there given that it is not there and the true negative rate pTN or specificity is its complement. Here, however, it is not possible to define a meaningful relative frequency estimate of these probablities except when the detection problem is binary, that is, each image can have only only one lesion of a single type or no lesions at all. In this case, exactly one lesion is not there if and only if 0 lesions are present, and one can define a true negative TN as an image that is not a true positive. Hence if there are N images, the relative frequency becomes

As discussed later, in the nonbinary case, however, specificity cannot be defined in a meaningful fashion on an image-by-image basis.

In the binary case, specificity shares importance with sensitivity because perfect sensitivity alone does not preclude numerous false alarms, while specificity near 1 ensures that missing no tumors does not come at the expense of calling false ones.

An alternative statistic that is well defined in the nonbinary case and also penalizes false alarms is the predictive value positive (PVP), also known as positive predicted value (PPV) [64]. This is the probability that a lesion is there given that it is said to be there:

0 0

Post a comment