## ROC Curve Analysis

It is important to note that the preceding table was produced by a case detection algorithm that was operating at a preselected and fixed "detection threshold.'' Classification algorithms, such as a case detection algorithm, typically produce continuous figure 20.1 Classifier discrimination. Subjects are classified as positive when the output from the classifier exceeds a threshold, and negative otherwise. TP, true positive; FN, false negative; FP, false positive; TP, true positive.

The left-most curve in Figure 20.1 is the distribution of numeric test results for control patients, and the right-most curve is the distribution of numeric results for cases. We can understand the numeric results produced by a classification algorithm as a measurement of some property of the individual patients (e.g., temperature) that we are hoping will be useful for discriminating between individuals with disease and individuals without disease. The fact that the two curves in Figure 20.1 overlap indicates that the classifier cannot perfectly discriminate between cases and controls.

Since most classification algorithms produce numeric results, the evaluator must pick some threshold to convert the numeric output of a case detection algorithm (or diagnostic test) into a determination that an individual "has disease'' or "does not have disease.'' Figure 20.1 shows an arbitrarily selected detection threshold of six (dashed line).

The results of the experiment—including the counts in the 2 x 2 table above and the measured sensitivity and specificities—are highly dependent on the threshold selected by the evaluator. In fact, the areas under the curves labeled TN, FN, TP, and FP in Figure 20.1 correspond to the numbers that go into each of the four cells in the above table.

Note that if the evaluator were to change the detection threshold in Figure 20.1 by moving it to the left or to the right, the areas under the curves corresponding to TN, FN, FP and TP would all change, as would the numbers in the table and the sensitivity and specificity. As she increases the threshold (by moving it to the right), the sensitivity (fraction of the disease cases that are correctly classified) decreases, but the specificity (fraction of controls that are correctly classified) increases. Since both sensitivity and specificity are desirable properties of a classifier, there is obviously a tradeoff between sensitivity and specificity involved when setting the threshold for a classifier. Since an evaluator does not know whether a future user of a classifier will prefer sensitivity to specificity, she explores a range of thresholds spanning the overlap between the two curves (thresholds from five to approximately 12 in Figure 20.1), recording the sensitivity and specificity at each threshold and plotting a graph of sensitivity versus speci-ficity.This type of graph is called a receiver operator characteristic (ROC) curve (Egan, 1975, Metz, 1978, Lusted, 1971). Note that most evaluators plot 1 - specificity, which as we previously discussed is the false alarm rate, on the horizontal axis. Their reason is that users of classifiers are most interested in the tradeoff between the value of detecting cases (sensitivity) and the cost and other consequences of misdiagnosing healthy people.

In Figure 20.2, an evaluator has plotted two ROC curves, corresponding to the results of evaluations of two different case detection algorithms. The evaluator would conclude that Algorithm 1 is a better classifier than Algorithm 2 for the hypothetical disease in question because at every value of false-alarm rate, Algorithm 1 has better sensitivity than Algorithm 2. figure 20.2 ROC curves from a hypothetical experiment comparing two algorithms.

Evaluators use the area under an ROC curve as an overall measure of the classification accuracy of a classifier. The technical term they use is area under (the ROC) curve (AUC). AUC has a very natural interpretation: if the evaluator produced one positive patient and one negative patient and asked the classifier to compare the two as to which is more likely to be infected with SARS, then the AUC is the probability that the classifier would correctly rank the positive patient higher than the negative patient. A perfect classifier has an AUC of 1.0, and a classifier that guesses at random has an AUC of 0.5, which corresponds to a diagonal ROC curve from the point (0,0) to the point (1,1) in Figure 20.2.