## Study Design and Statistical Analysis

To simulate normal clinical practice, test images were selected from 30 sequential thoracic MR examinations of diagnostic quality obtained after February 1, 1991. The patients studied included 16 females and 14 males, with ages ranging from 1 to 93 years and an average age of 48.0 + 24.7 years (mean + s.d.). Clinical diagnoses included aortic aneurysm (n = 11), thoracic tumors (n = 11), pre- or post-lung transplant (n = 5), constrictive pericarditis (n = 1), and subclavian artery rupture (n = 1). From each examination, one image that best demonstrated all four major vessels of interest was selected. The training images were selected similarly from different examinations. All analyses were based solely on measurements made on the test images.

The 30 test scans compressed to 5 bit rates plus the originals give rise to a total of 180 images. These images were arranged in a randomized sequence and presented on separate hardcopy films to three radiologists. The viewing protocol consisted of 3 sessions held at least 2 weeks apart. Each session included 10 films viewed in a predetermined order with 6 scans on each film. The 3 radiologists began viewing films at a different starting point in the randomized sequence. To minimize the probability of remembering measurements from past images, a radiologist saw only 2 of the 6 levels of each image in each session, with the second occurrence of each image spaced at least 4 films after the first occurrence of that image.

Following standard clinical methods for detecting aneurysms, the radiologists used calipers and a millimeter scale available on each image to measure the four blood vessels appearing on each scan. Although the use of digital calipers might have allowed more accurate measurements, this would have violated one of our principal goals, namely to follow as closely as possible actual clinical practice. It is the standard practice of almost all radiologists to measure with manual calipers. This is largely because they lack the equipment, or they would prefer not to take the time to bring up the relevant image on the terminal and then perform the measurements with electronic calipers. We asked radiologists to make all measurements between the outer walls of the vessels along the axis of maximum diameter. It is this maximum diameter measurement that is used to make clinical decisions. Both the measurements and axes were marked on the film with a grease pencil.

The independent gold standard was set by having two radiologists come to an agreement on vessel sizes on the original scans. They first independently measured the vessels on each scan and then remeasured only those vessels on which they initially differed until an exact agreement on the number of millimeters was reached. These two radiologists are different from the three radiologists whose judgments are used to determine diagnostic accuracy. A personal standard was also derived for each of the three judging radiologists by taking their own measurements on the original images.

Once the gold standard measurement for each vessel in each image was assigned, measurement error can be quantified in a variety of ways. If z is the radiologist's measurement and g represents the gold standard measurement, then some potential summary statistics are

These statistics have invariance properties that bear upon understanding the data. For example, z — g is invariant to the same additive constant (that is, to a change in origin), log(z/g) is invariant to the same multiplicative constant (that is, to a change in scale), and (z — g)/g is invariant to the same multiplicative constant and to the same sign changes. For simplicity and appropriateness in the statistical tests carried out, the error parameters chosen for this study are percent measurement error ( pme), pme z - g x 100%, and absolute percent measurement error (apme)

apme g I

x 100%, both of which scale the error by the gold standard measurement to give a concept of error relative to the size of the vessel being measured.

The differences in error achieved at each bit rate can be quantified as statistically significant by many tests. Each should respect the pairing of the measurements being compared and the multiplicity of comparisons being made. In order to ensure that our conclusions are not governed by the test being used, we chose to use two of the most common, the t and Wilcoxon tests. We also employed statistical techniques that account for this multiplicity of tests. The measurements are considered paired in a comparison of two bit rates since the same vessel in the same image is measured by the same radiologist at both bit rates. For instance, let x1 be the measurement of a vessel at bit rate 1, x2 be its measurement at bit rate 2, and gbe the vessel's gold standard measurement. Then the pme at bit rates 1 and 2 are pme1 = Xl—g x 100% and pme2 = X2—g x 100%, gg and their difference is xx pmeD = —-2 x 100%.

In such a two-level comparison, pme more accurately preserves the difference between two errors than does apme. Avessel that is overmeasured by a% (positive) on bit rate 1 and under-measured by a% (negative) on bit rate 2 will have an error distance of 2a% if pme is used but a distance of zero if apme is used. Therefore, both the t-test and the Wilcoxon signed rank test are carried out using only pme. Apme is used later to present a more accurate picture of error when we plot an average of apme across the 30 test images vs bit rate.

The t-statistic quantifies the statistical significance of the observed difference between two data sets in which the data can be paired. Unlike the CT study, in which the Behrens-Fisher-Welch t-test was used because of the obviously different variances present for different images, here the ordinary t-test was applicable. The difference in error for two bit rates is calculated for all the vessels measured at both bit rates. If the radiologists made greater errors at bit rate 1 than at bit rate 2, the average difference in error over all the data will be positive. If bit rate 1 is no more or less likely to cause error than bit rate 2, the average difference in error is zero. The t-test assumes that the sample average difference in error between two bit rates varies in a Gaussian manner about the real average difference [27]. If the data are Gaussian, which they clearly cannot exactly be in our application, the paired t-test is an exact test. Quantile-Quantile plots of pme differences for comparing levels vary from linear to S-shaped; in general, the Q-Q plots indicate a moderate fit to the Gaussian model. The size of our data set (4 vessels x 30 images x 6 levels x 3 judges = 2160 data points) makes a formal test for normality nearly irrelevant. The large number of data points serves to guarantee failure of even fairly Gaussian data at conventional levels of significance. (That is, the generating distribution is likely not to be exactly Gaussian, and with enough data, even a tiny discrepancy from Gaussian will be apparent.) Even if the data are non-Gaussian, however, the central limit theorem renders the t-test approximately valid. With the Wilcoxon signed rank test [27] the significance of the difference between the bit rates is obtained by comparing a standardized value of the Wilcoxon statistic against the normal standard deviate at the 95% two-tail confidence level. The distribution of this standardized Wilcoxon is nearly exactly Gaussian if the null hypothesis is true for samples as small as 20.

### Results Using the Independent Gold Standard

Plots of trends in measurement error as a function of bit rate are presented in Figs 3-6. In all cases, the general trend of the data is indicated by fitting the data points with a quadratic spline having one knot at 1.0 bpp. Figure 3 gives average pme against the mean bit rate for all radiologists pooled (i.e., the data for all radiologists, images, levels, and structures, with each radiologist's measurements compared to the independent gold standard) and for each of the three radiologists separately. In Fig. 4, the pme vs actual achieved bit rate is plotted for all

FIGURE 3 Mean pme vs mean bit rate using the independent gold standard. The dotted, dashed, and dash-dot curves are quadratic splines fit to the data points for judges 1, 2, and 3, respectively. The solid curve is a quadratic spline fit to the data points for all judges pooled. The splines have a single knot at 1.0 bpp.

FIGURE 3 Mean pme vs mean bit rate using the independent gold standard. The dotted, dashed, and dash-dot curves are quadratic splines fit to the data points for judges 1, 2, and 3, respectively. The solid curve is a quadratic spline fit to the data points for all judges pooled. The splines have a single knot at 1.0 bpp.

FIGURE 5 Mean apme vs mean bit rate using the independent gold standard. The dotted, dashed, and dash-dot curves are quadratic splines fit to the data points for judges 1, 2, and 3, respectively. The solid curve is a quadratic spline fit to the data points for all judges pooled.

data points. The relatively flat curve begins to increase slightly at the lowest bit rates, levels 1 and 2 (0.36, 0.55 bpp). It is apparent from an initial observation of these plots that except for measurement at the lowest bit rates, accuracy does not vary greatly with lossy compression. Possibly significant increases in error appear only at the lowest bit rates, whereas at the remaining bit rates measurement accuracy is similar to that obtained with the originals. The average performance on images compressed to level 5 (1.7 bpp) is actually better than performance on originals.

Although the trends in pme vs. bit rate are useful, over-measurement (positive error) can cancel under-measurement (negative error) when these errors are being averaged or fitted

FIGURE 4 Percent measurement error vs. actual bit rate using the independent gold standard. The x's indicate data points for all images, pooled across judges and compression levels. The solid curve is a quadratic spline fit to the data with a single knot at 1.0 bpp. Reprinted with permission, from Proceedings First International Conference on Image Processing, ICIP 194, I: 861-865, Austin, Texas, Nov. 1994.

FIGURE 4 Percent measurement error vs. actual bit rate using the independent gold standard. The x's indicate data points for all images, pooled across judges and compression levels. The solid curve is a quadratic spline fit to the data with a single knot at 1.0 bpp. Reprinted with permission, from Proceedings First International Conference on Image Processing, ICIP 194, I: 861-865, Austin, Texas, Nov. 1994.

FIGURE 6 Apme vs actual bit rate using the independent gold standard. The x's indicate data points for all images, pooled across judges and compresssion levels. The solid curve is a quadratic spline fit to the data.

with a spline. For this reason, we turn to apme which measures the error made by a radiologist regardless of whether it originated from overmeasurement or undermeasurement. Figure 5 plots average apme vs average bit rate for each radiologist and for all radiologists pooled. Figure 6 shows actual apme vs actual bit rate achieved. These plots show trends similar to those observed before. The original level contains more or less the same apme as compression levels 3, 4, and 5 (0.82, 1.14, 1.7bpp). Levels 1 and 2 (0.36, 0.55 bpp) show slightly higher error. These plots provide only approximate visual trends in data.

The t-test was used to test the null-hypothesis that the "true" pme between two bit rates is zero. The standardized average difference is compared with the "null" value of zero by comparing with standard normal tables. None of the compressed images down to the lowest bit rate of 0.36 bpp was found to have a significantly higher pme when compared to the error made on the originals. Among the compressed levels however, level 1 (0.36 bpp) was found to be significantly different from level 5 (1.7bpp). As was mentioned, the performance on level 5 was better than that on all levels, including the uncompressed level.

When using the Wilcoxon signed rank test to compare compressed images against the originals, only level 1 (0.36 bpp) differed significantly in the distribution of pme. Within the levels representing the compressed images, levels 1, 3, and 4 (0.36, 0.82, 1.14 bpp) had significantly different pme than those at level 5 (1.7 bpp). Since measurement accuracy is determined from the differences with respect to the originals only, a conservative view of the results of the analyses using the independent gold standard is that accuracy is retained down to 0.55 bpp (level 2).

## Post a comment