The typical scenario for evaluating diagnostic accuracy of computer processed medical images involves taking some database of original unprocessed images, applying some processing to them, and having the entire set of images judged in some specified fashion by radiologists. Whether the subsequent analyses will be done by ROC or some other means, it is necessary to determine a "gold standard" that can represent the diagnostic truth of each original image and can serve as a basis of comparison for the diagnoses on all the processed versions of that image. There are many possible choices for the gold standard:
• A consensus gold standard is determined by the consensus of the judging radiologists on the original.
• A personal gold standard uses each judge's readings on an original (uncompressed) image as the gold standard for the readings of that same judge on the compressed versions of that same image.
• An independent gold standard is formed by the agreement of the members of an independent panel of particularly expert radiologists.
• A separate gold standard is produced by the results of autopsy, surgical biopsy, reading of images from a different imaging modality, or subsequent clinical or imaging studies.
The consensus method has the advantage of simplicity, but the judging radiologists may not agree on the exact diagnosis, even on the original image. Of course, this may happen among the members of the independent panel as well, but in that case an image can be removed from the trial or additional experts called upon to assist. Either case may entail the introduction of concerns as to generalizability of subsequent results.
In the CT study, in an effort to achieve consensus for those cases where the initial CT readings disagreed in the number or location of abnormalities, the judges were asked separately to review their readings of that original. If this did not produce agreement, the judges discussed the image together. Six images in each CT study could not be assigned a consensus gold standard because of irreconcilable disagreement. This was a fundamental drawback ofthe consensus gold standard, and our subsequent studies did not use this method. Although those images eliminated were clearly more controversial and difficult to diagnose than the others, it cannot be said whether the removal of diagnostically controversial images from the study biases the results in favor of compression or against it. Their failure to have a consensus gold standard defined was based only on the uncompressed versions, and it cannot be said a priori that the compression algorithm would have a harder time compressing such images. The consensus, when achieved, could be attained either by initial concordance among the readings of the three radiologists, or by subsequent discussion of the readings, during which one or more judges might change their decisions. The consensus was clearly more likely to be attained for those original images where the judges were in perfect agreement initially and thus where the original images would have perfect diagnostic accuracy relative to that gold standard. Therefore, this gold standard has a slight bias favoring the originals, which is thought to help make the study safely conservative, and not unduly promotional of specific compression techniques.
The personal gold standard is even more strongly biased against compression. It defines a judge's reading on an original image to be perfect, and uses that reading as the basis of comparison for the compressed versions of that image. If there is any component of random error in the measurement process, since the personal gold standard defines the diagnoses on the originals to be correct (for that image and that judge), the compressed images cannot possibly perform as well as the originals according to this standard. That there is a substantial component of random error in these studies is suggested by the fact that there were several images during our CT tests on which judges changed their diagnoses back and forth, marking, for example, one lesion on the original image as well as on compressed levels E and B, and marking two lesions on the in-between compressed levels F, D, and A. With a consensus gold standard, such changes tend to balance out. With a personal gold standard, the original is always right, and the changes count against compression. Because the compressed levels have this severe disadvantage, the personal gold standard is useful primarily for comparing the compressed levels among themselves. Comparisons of the original images with the compressed ones are conservative. The personal gold standard has, however, the advantage that all images can be used in the study. We no longer have to be concerned with the possible bias from the images eliminated due to failure to achieve consensus. One argument for the personal standard is that in some clinical settings a fundamental question is how the reports of a radiologist whose information is gathered from compressed images compare to what they would have been on the originals, the assumption being that systematic biases of a radiologist are well recognized and corrected for by the referring physicians who regularly send cases to that radiologist. The personal gold standard thus concentrates on consistency of individual judges.
The independent gold standard is what many studies use, and would seem to be a good choice. However, it is not without drawbacks. First of all, there is the danger of a systematic bias appearing in the diagnoses of a judge in comparison to the gold standard. For example, a judge who consistently chooses to diagnose tiny equivocal dots as abnormalities when the members of the independent panel choose to ignore such occurrences would have a high false positive rate relative to that independent gold standard. The computer processing may have some actual effect on this judge's diagnoses, but this effect might be swamped in comparison to this baseline high false positive rate. This is an argument for using the personal gold standard as well as the independent gold standard. The other drawback of an independent gold standard is somewhat more subtle and is discussed later. In the MR study, the independent panel was composed of two senior radiologists who first measured the blood vessels separately and then discussed and remeasured in those cases where there was initial disagreement.
A separate standard would seem to be the best choice, but it is generally not available. With phantom studies, there is of course a "diagnostic truth'' that is established and known entirely separately from the diagnostic process. But with actual clinical images, there is often no autopsy or biopsy, as the patient may be alive and not operated upon. There are unlikely to be any images from other imaging modalities that can add to the information available from the modality under test, since there is typically one best way for imaging a given pathology in a given part of the body. And the image data set for the clinical study may be very difficult to gather if one wishes to restrict the image set to those patients for whom there are follow-up procedures or imaging studies that can be used to establish a gold standard. In any case, limiting the images to those patients who have subsequent studies done would introduce obvious bias into the study.
In summary, the consensus method of achieving a gold standard has a major drawback together with the minor advantage of ease of availability. The other three methods for achieving a gold standard all have both significant advantages and disadvantages, and perhaps the best solution is to analyze the data against more than one definition of diagnostic truth.
Was this article helpful?