## CT Study Example of Detection Accuracy

The detection of abnormal lymphoid tissue is an important aspect of chest imaging. This is true especially for the mediastinum, the central portion of the chest that contains the heart, major blood vessels, and other structures. Abnormally enlarged lymph nodes, or lymphadenopathy, in the mediastinum can be caused by primary malignancy such as lymphoma, metastatic disease that results from the spread of breast or lung cancer through the lymphatics, tuberculosis, or non-infectious inflammatory diseases such as sarcoidosis. Typically radiologists can easily locate lymph nodes in a chest scan. The detection task is therefore to determine which of the located lymph nodes are enlarged.

The detection of lung nodules is another major objective in diagnostic imaging of the chest. A common cause of these nodules is malignancy, primary or metastatic. The latter, which spreads through the blood stream from a primary cancer almost anywhere in the body, can cause multiple nodules in one or both lungs. Other causes include fungal and bacterial infections, and noninfectious inflammatory conditions. Nodules range in size from undetectably small to large enough to fill an entire segment of the lung.

The compressed and original images were viewed by three radiologists. For each of the 30 images in a study, each radiologist viewed the original and 5 of the 6 compressed levels,

Copyright © 2000 by Academic Press.

### All rights of reproduction in any form reserved.

and thus 360 images were seen by each judge. The judges were blinded in that no information concerning patient study or compression level was indicated on the film. Images were viewed on hardcopy film on a lightbox, the usual way in which radiologists view images. The "windows and levels" adjustment to the dynamic range of the image was applied to each image before filming. This simple contrast adjustment technique sets maximum and minimum intensities for the image. All intensities above the maximum are thresholded to equal that maximum value. All intensities below the minimum are thresholded to equal that minimum value. This minimum value will be displayed on the output device as black, and the maximum value will be displayed as white. All intensity values lying in between the minimum and maximum are linearly rescaled to lie between black and white. This process allows for more of the dynamic range of the display device (in this case, the film) to be used for the features of interest. A radiologist who was not involved in the judging applied standard settings for windows and levels for the mediastinal images, and different standard settings for the lung nodule images. The compressed and original images were filmed in standard 12-on-1 format on 14" x 17" film using the scanner that produced the original images.

The viewings were divided into 3 sessions during which the judges independently viewed 10 pages, each with 6 lung nodule images and 6 mediastinal images. The judges marked abnormalities directly on the films with a grease pencil, although mediastinal lymph nodes were not marked unless their smallest cross-sectional diameter measured 10 mm or greater. All judges were provided with their own copy of the films for marking. No constraints were placed on the viewing time, the viewing distance, or the lighting conditions; the judges were encouraged to simulate the conditions they would use in everyday work. They were, however, constrained to view the 10 pages in the predetermined order, and could not go back to review earlier pages. At each session, each judge saw each image at 2 of the 7 levels of compression (7 levels includes the original). The two levels never appeared on the same film, and the ordering of the pages ensured that they never appeared with fewer than 3 pages separating them. This was intended to reduce learning effects. Learning effects will be discussed in the next chapter. A given image at a given level was never seen more than once by any one judge, and so intraobserver variability was not explicitly measured. Of the 6 images in one study on any one page, only one image was shown as the original, and exactly 5 of the 6 compressed levels were represented. The original versions of the images are denoted "g." The compressed versions are "a" through "f." The randomization follows what is known as a "Latin square" arrangement.

The consensus gold standard for the lung determined that there were, respectively, 4 images with 0 nodules, 9 with 1, 4 with 2, 5 with 3, and 2 with 4 among those images retained. For the mediastinum, there were 3 images with 0 abnormal nodes, 17 with 1, 2 with 2, and 2 with 3.

Once a gold standard is established, a value can be assigned to the sensitivity and the predictive value positive (PVP). The sensitivity and PVP results are shown graphically using scatter plots, spline fits, and associated confidence regions. The spline fits are quadratic splines with a single knot at 1.5 bits per pixel (bpp), as given in the previous chapter. The underlying probability model that governs the 450 observed values of y (= sensitivity or PVP) is taken to be as follows. The random vector of quadratic spline coefficients (a0, ax, a2, b2) has a single realization for each (judge, image) pair. What is observed as the bit rate varies is the value for the chosen five compression levels plus independent mean 0 noise. The expected value of y is

E(y) = E(a0) + E(a1)x + E(a2)x2 + E(b2)(max(0, x - 1.5))2, where the expectation is with respect to the unconditional distribution of the random vector (a0, a1; a2, b2). Associated with each spline fit is the residual root mean square (RMS), an estimate of the standard deviation of the individual measurements from an analysis of variance of the spline fits.

The standard method for computing simultaneous confidence regions for such curves is the "S" (or "Scheffe") method [20], which is valid under certain Gaussian assumptions that do not hold for our data. Therefore we use the statistical technique called "the bootstrap" [4,10-12], specifically a variation of the "correlation model" [13] that is related to the bootstrap-based prediction regions of Olshen et al. [22]. We denote the estimate of PVP for the lung study at a bit rate bpp by E(y(bpp)).

(1) A quadratic spline equation can be written as

E(y (bpp)) = o0 + a:x + a2x2 + b2(max(0, x — Xo))2, where x0 is the "knot" (in this study, x = bit rate and x0 = 1.5 bpp). This equation comes from the linear model

Y = + e with one entry of Y (and corresponding row of D) per observation. D is the "design matrix" of size 450 x 4. It has four columns, the first having the multiple of a0 (always 1), the second the multiple of ax (that is, the bit rate), and so on. We use E(a) to denote the four-dimensional vector of estimated least squares coefficients:

(2) For a given bit rate b, write the row vector dictated by the spline as dt = dt(b). Thus, E(y(bpp)) = dtE(a).

(3) The confidence region will be of the form dtÊ(a) — SV'F \Jd' (D*D) 1d < y < dtÊ(a) + SVf \Jd' (D'D)—1d, where S is the square root of the residual mean square from an analysis of variance of the data. So, if Y is «x 1 and 0 is k x 1, then

The region will be truncated, if necessary, so that always 0 < y < 1.

(4) The bootstrapping is conducted by first drawing a sample of size 3 with replacement from our group of three judges. This bootstrap sample may include one, two, or all three of the judges. For each chosen judge (including multiplicities, if any), we draw a sample of size 30 with replacement from the set of 30 original images. It can be shown that typically about 63% = (100(1 — e-1))% of the images will appear at least once in each bootstrap sample of images. For each chosen judge and original image, we include in the bootstrap sample all five of the observed values of y. The motivation for this bootstrap sampling is simple: The bootstrap sample bears the same relationship to the original sample that the original sample bears to "nature." We do not know the real relationship between the true data and nature; if we did, we would use it in judging coverage probabilities in steps 7 and 8. However, we do know the data themselves, and so we can imitate the relationship between nature and the data by examining the observed relationship between the data and a sample from them [10,12].

(5) Each bootstrap sample Y* entails a bootstrap design matrix D*, as well as corresponding Ê*(a) and S*. This bootstrap process will be carried out «b = 1000 times.

(6) For the jth bootstrap sample compute the four new bootstrap quantities as in 5.

(7) Compute for each VF

Gb(VF) = («b)—1{#j : d'Ê*(a) — S*VF^d' (D**D*)—1d

= («b)—1{#j : (Ê*(a) — Ê(a))*(D**D*)(Ê*(a) — Ê (a))< F (S*)2}.

Note that the latter expression is what is used in the computation. This is the standard Scheffé method, as described in [20].

(8) For a 100% confidence region compute (VF)p = min {a/F : GB(VF) > p} and use that value in the equation in step 4. In our case, we are interested in obtaining a 95% confidence region, so VF is chosen so that for 95% of the bootstrap samples

In this model, the bit rate is treated as a nonrandom predictor that we control, and the judges and images are "random effects" because our three judges and 30 images have been sampled from arbitrarily large numbers of possibilities.

Figure 1 displays all data for lung sensitivity and lung PVP (calculated relative to the consensus gold standard) for all 24 images, judges, and compressed levels for which there was a consensus gold standard. There are 360 x's: 360 = 3 judges x 24 images x 5 compressed levels seen for each image. Figure 2 is the corresponding figure for the mediastinum relative to the personal gold standard. The o's mark the average of the x's for each bit rate. The values of the sensitivity and PVP are simple fractions such as 1/2 and 2/3 because there are at most a few abnormalities in each image. The curves are least squares quadratic spline fits to the data with a single knot at 1.5 bpp, together with the two-sided 95% confidence regions. Since the sensitivity and PVP cannot exceed 1, the upper confidence curve was thresholded at 1. The residual RMS is the square root of the residual mean square from an analysis of variance of the spline fits. Sensitivity for the lung seems to be nearly as good at low rates of compression as at high rates, but sensitivity for the mediastinum drops off at the lower bit rates, driven primarily by the results for one judge. PVP for the lung is roughly constant across the bit rates, and the same for the mediastinum.

Table 1 shows the numbers of original test images (out of 30 total) that contain the listed number of abnormalities for each disease type according to each judge. Also, the rows marked All show the number of original test images (out of

TABLE 1 Number of test images that contain the listed number of abnormalities (Mdst = mediastinum)

Number of abnormalities

TABLE 1 Number of test images that contain the listed number of abnormalities (Mdst = mediastinum)

Number of abnormalities

Type |
Judge |

## Quit Smoking Today

Quit smoking for good! Stop your bad habits for good, learn to cope with the addiction of cigarettes and how to curb cravings and begin a new life. You will never again have to leave a meeting and find a place outside to smoke, losing valuable time. This is the key to your freedom from addiction, take the first step!

## Post a comment