Sensitivity, PVP, and, when it makes sense, specificity can be estimated from clinical trial data and provide indication of quality of detection. The next issues are these:

(1) How does one design and conduct clinical experiments to estimate these statistics?

(2) How are these statistics used in order to make judgments about diagnostic accuracy?

Together, the responses to these questions form a protocol for evaluating diagnostic accuracy and drawing conclusions on the relative merits of competing image processing techniques. Before describing the dominant methodology used, it is useful to formulate several attributes that a protocol might reasonably be expected to have:

• The protocol should simulate ordinary clinical practice as closely as possible. Participating radiologists should perform in a manner that mimics their ordinary practice. The trials should require little or no special training of their clinical participants.

• The clinical trials should include examples of images containing the full range of possible anomalies, all but extremely rare conditions.

• The findings should be reportable using the American College of Radiology (ACR) Standardized Lexicon.

• Statistical analyses of the trial outcomes should be based on assumptions as to the outcomes and sources of error that are faithful to the clinical scenario and tasks.

• The number of patients should be sufficient to ensure satisfactory size and power for the principal statistical tests of interest.

• "Gold standards" for evaluation of equivalence or superiority of algorithms must be clearly defined and consistent with experimental hypotheses.

• Careful experimental design should eliminate or minimize any sources of bias in the data that are due to differences between the experimental situation and ordinary clinical practice, e.g., learning effects that might accrue if a similar image is seen using separate imaging modalities.

Receiver operating characteristic (ROC) analysis is the dominant technique for evaluating the suitability of radi-ologic techniques for real applications [26,38,39,61]. ROC analysis has its origins in signal detection theory. A filtered version of a signal plus Gaussian noise is sampled and compared to a threshold. If the sample is greater than the threshold, the signal is declared to be there; otherwise, it is declared absent. As the threshold is varied in one direction, the probability of erroneously declaring a signal absent when it is there (a false dismissal) goes down, but the probability of erroneously declaring a signal there when it is not (a false alarm) goes up. Suppose one has a large database of waveforms, some of which actually contain a signal, and some of which do not. Suppose further that for each waveform, the "truth" is known of whether a signal is present or not. One can set a value of the threshold and examine whether the test declares a signal present or not for each waveform. Each value of the threshold will give rise to a pair (TPF, FPF), and these points can be plotted for many different values of the threshold. The ROC curve is a smooth curve fitted through these points. The ROC curve always passes through the point (1,1) because if the threshold is taken to be lower than the lowest value of any waveform, then all samples will be above the threshold, and the signal will be declared present for all waveforms. In that case, the true positive fraction is 1. The false positive fraction is also equal to 1, since there are no true negative decisions. Similar reasoning shows that the ROC curve must also always pass through the point (0, 0), because the threshold can be set very large, and all cases will be declared negative. A variety of summary statistics such as the area under the ROC curve can be computed and interpreted to compare the quality of different detection techniques. In general, larger area under the ROC curve is better.

ROC analysis has a natural application to some problems in medical diagnosis. For example, in a blood serum assay of carbohydrate antigens (e.g., CA 125 or CA 19-9) to detect the presence of certain types of cancer, a single number results from the diagnostic test. The distributions of result values in actually positive and actually negative patients overlap. So no single threshold or decision criterion can be found that separates the populations cleanly. If the distributions did not overlap, then such a threshold would exist, and the test would be perfect. In the usual case of overlapping distributions, a threshold must be chosen, and each possible choice of threshold will yield different frequencies of the two types of errors. By varying the threshold and calculating the false alarm rate and false dismissal rate for each value of the threshold, an ROC curve is obtained.

Transferring this type of analysis to radiological applications requires the creation of some form of threshold whose variation allows a similar trade-off. For studies of the diagnostic accuracy of processed images, this is accomplished by asking radiologists to provide a subjective confidence rating of their diagnoses (typically on a scale of 1-5) [39,61]. An example of such ratings is shown in Table 5.

First, only those responses in the category of highest certainty of a positive case are considered positive. This yields a pair (TPF, FPF) that can be plotted in ROC space and corresponds to a stringent threshold for detection. Next, those cases in either of the highest two categories of certainty of a positive decision are counted positive. Another (TPF, FPF) point is obtained, and so forth. The last nontrivial point is obtained by scoring any case as positive if it corresponds to any of the highest four categories of certainty for being positive. This corresponds to a very lax threshold for detection of disease. There are also two trivial (TPF, FPF) points that can be obtained, as discussed previously: All cases can be declared negative (TPF = 0, FPF = 0) or all cases can be declared positive (TPF = 1, FPF = 1).

This type of analysis has been used extensively to examine the effects of computer processing on the diagnostic utility of medical images. Types of processing that have been evaluated include compression [6,9,14,21,22,30,35,56,65], and enhancement (unsharp masking, histogram equalization, and noise reduction).

Although by far the dominant technique for quantifying diagnostic accuracy in radiology, ROC analysis possesses several shortcomings for this application. In particular, it violates several of the stated goals for a clinical protocol. By and large, the necessity for the radiologists to choose 1 of 5 specific values to indicate confidence departs from ordinary clinical practice. Although radiologists are generally cognizant of differing levels of confidence in their findings, this uncertainty is often represented in a variety of qualitative ways, rather than with a numerical ranking. Further, as image data are non-Gaussian, methods that rely on Gaussian assumptions are suspect. Modern computer-intensive statistical sample reuse techniques can help get around the failures of Gaussian assumptions. Classical ROC analysis is not location specific. The case in which an observer misses the lesion that is present in an image but mistakenly identifies some noise feature as a

TABLE 5 Subjective confidence ratings used in ROC analysis

1 Definitely or almost definitely negative

2 Probably negative

3 Possibly negative

4 Probably positive

5 Definitely or almost definitely positive lesion in that image would be scored as a true-positive event. Most importantly, many clinical detection tasks are nonbinary, in which case sensitivity can be suitably redefined, but specificity cannot. That is, sensitivity as defined in Eq. (3) yields a fractional number for the whole data set. But for any one image, sensitivity takes on only the values 0 and 1. The sensitivity for the whole data set is then the average value of these binary-valued sensitivities defined for individual images. When the detection task for each image becomes nonbinary, it is possible to redefine sensitivity for an individual image:

Sensitivity :

# of true positive decisions within 1 image

# of actually positive items in that 1 image

Or, changing the language slightly,

# of abnormalities correctly found

# of abnormalities actually there

In this case, the sensitivity for each individual image becomes a fractional number between 0 and 1, and the sensitivity for the entire data set is still the average of these sensitivities defined for individual images. A similar attempt to redefine the specificity leads to

Specificity =

# of abnormalities correctly said not to be there

# of abnormalities actually not there

This does not make sense because it has no natural or sensible denominator, as it is not possible to say how many abnormalities are absent. This definition is fine for a truly binary diagnostic task such as detection of a pneumothorax, for if the image is normal, then exactly one abnormality is absent. Early studies were able to use ROC analysis by focusing on detection tasks that were either truly binary or that could be rendered binary. For example, a nonbinary detection task such as "locating any and all abnormalities that are present" can be rendered binary simply by rephrasing the task as one of "declaring whether or not disease is present." Otherwise, such a nonbinary task is not amenable to traditional ROC analysis techniques. Extensions to ROC to permit consideration of multiple abnormalities have been developed [11-13,18,59]. For example, the free-response receiver operating characteristic (FROC) observer performance experiment allows an arbitrary number of abnormalities per image, and the observer indicates their perceived locations and a confidence rating for each one. While FROC resolves the binary task limitations and location insensitivity of traditional ROC, FROC does retain the constrained 5-point integer rating system for observer confidence, and makes certain normality assumptions about the resultant data. Finally, ROC analysis has no natural extension to the evaluation of measurement accuracy in compressed medical images. By means of specific examples we describe an approach that closely simulates ordinary clinical practice, applies to nonbinary and non-Gaussian data, and extends naturally to measurement data.

The recent Stanford Ph.D. thesis by Bradley J. Betts, mentioned earlier, includes new technologies for analyses of ROC curves. His focus is on regions of interest of the curve, that is, on the intersection of the area under the curve with rectangles determined by explicit lower bounds on sensitivity and specificity. He has developed sample reuse techniques for making inferences concerning the areas enclosed and also for constructing rectangular confidence regions for points on the curve.

0 0

Post a comment