## Info FIGURE 1 Relative to the consensus gold standard: (a) lung sensitivity (RMS = 0.177), (b) lung PVP (RMS = 0.215).

FIGURE 1 Relative to the consensus gold standard: (a) lung sensitivity (RMS = 0.177), (b) lung PVP (RMS = 0.215).

24 total) that contain the listed number of abnormalities according to the consensus gold standard. We examine this table to determine whether or not it is valid to pool the sensitivity and PVP results across judges. Simple chi-square tests for homogeneity show that for both the lung and the mediastinum, judges do not differ beyond chance from equality in the numbers of abnormalities they found. In particular, if for the lung we categorize abnormalities found as 0, 1, 2, 3, or at least 4, then the chi-square statistic is 3.16 (on 8 degrees of freedom). Six cells have expectations below 5, a traditional concern, but an exact test would not have a different conclusion. Similar comments apply to the med iastinum, where the chi-square value (on 6 degrees of freedom) is 8.83. However, Table 1 does not fully indicate the variability among the judges. For example, the table shows that each judge found six lung nodules in an original test image only once. However, it was not the same test image for all three for which this occurred.

2.1 Behrens-Fisher-Welch ¿-statistic

The comparison of sensitivity and PVP at different bit rates was carried out using a permutation distribution of a two-sample t-test that is sometimes called the Behrens-Fisher-Welch test FIGURE 2 Relative to the personal gold standard: (a) Mediastinum Sensitivity (RMS = 0.243), (b) Mediastinum PVP (RMS = 0.245).

[3,18].The statistic takes account of the fact that the within group variances are different. In the standard paired t-test where we have n pairs of observations, let denote the true, and unknown, average difference between the members of a pair. If we denote the sample mean difference between the members of the pairs by D, and the estimate of standard deviation of these differences by Sd, then the quantity

follows (under certain normality assumptions) Student's t distribution with (n — 1) degrees of freedom, and this may be used to test the null hypothesis that = 0, that is, that there is no difference between the members of a pair . Now, with our sensitivity and PVP data, there is no single estimate sd of the standard deviation that can be made. For an image I1 that has only one abnormality according to the consensus gold standard, the judges can have sensitivity equal to either 0 or 1, but for an image I2 with three abnormalities the sensitivity can equal 0, 0.33, 0.67, or 1. So, in comparing bit rates b1 and b2, when we form a pair out of image I1 seen at bit rates b1 and b2, and we form another pair out of image I2 seen at bit rates b1 and b2, we see that the variance associated with some pairs is larger than that associated with other pairs. The Behrens-Fisher-Welch test takes account of this inequality of variances. The test is exact and does not rely on Gaussian assumptions that would be patently false for this data set. The use of this statistic is illustrated by the following example. Suppose Judge 1 has judged N lung images at both levels A and B. These images can be divided into 5 groups, according to whether the consensus gold standard for the image contained 0, 1, 2, 3, or 4 abnormalities. Let N, be the number of images in the ¿th group. Let Ajj represent the difference in sensitivities (or PVP) for the jth image in the ¿th group seen at level A and at level B. Let A,-be the average difference:

We define

1 j and then the Behrens-Fisher-Welch t statistic is given by t E,A,