## Annex Algorithm Evaluation

Whenever objects are detected automatically, the performance of the algorithm has to be evaluated. In the medical domain, results are normally compared to the results obtained by one or more specialists.

Let us consider a medical examination (diagnostic test). Often, such a test can only be positive or negative (the patient suffers from the disease or not). In order to evaluate the efficiency of this diagnostic test, its result is compared to reality; "the truth" is found by other diagnostic methods. For this comparison, we define:

• True Positive (TP): The patient suffers from the disease and the test was positive.

• False Positive (FP): The patient does not suffer from the disease, but the test was positive.

• True Negative (TN): The patient does not suffer from the disease, and the test was negative.

• False Negative (FN): The patient suffers from the disease, but the test was negative.

With these definitions, we can evaluate the performance of a diagnostic test by means of sensitivity and specificity, defined as

sensitivity = -

TP + FN is the number of patients suffering from the disease and TN + FP is the number of patients not suffering from the disease; the sensitivity is the percentage of detected cases of the disease and the specificity is the percentage of correctly classified healthy persons.

These definitions can be transfered to the evaluation of detection/classification algorithms, i.e. true positives are correctly detected pathologies, false positives are nonpathological objects falsely classified by the algorithm, etc.

There is, however, a difference between detection and classification algorithms: in detection problems, the number of objects is not limited as it is the case for classification problems (e.g. the classification of patients). In detection problems, a definition of true negatives does not make sense. There are two possibilities to resolve this problem:

• If the number of objects is an important quantity (number of lesions, e.g. microaneurysms), then the number of false positives may be a good indicator for the quality of the algorithm.

• If the number cannot be determined or if it is not the important quantity (this is the case if these are strong variations in shape and size of lesions— like for exudates for example), a pixel-wise comparison between the two results is preferable. In this case, the predictive value can be calculated:

This is the probability that an object (or pixel) classified as positive is really positive.

With these values (sensitivity, number of false positives, predictive value), the quality of automatic pathology detection algorithms can be assessed.

0 0