Comparison of Judges

In the CT study, comparisons of judges to each other were carried out using the permutation distribution of Hotelling's paired T2 statistic applied to the consensus gold standard results. T2 as we used it is a generalization of (the square of) a univariate paired t statistic. We illustrate its use by an example. Suppose that judges 1 and 2 are compared for their sensitivities on compressed lung images. The vector for comparison is six-dimensional, one coordinate for each level of compression. Each image (i) and bit rate (b) evaluated by both judges gives rise to a difference d(i, b) of the sensitivities, judge 1 — judge 2, and to a sample mean d(b) and sample variance s2(b). Each image i for which both judges evaluated at bit rates b and b' contributes a term

to the sample covariance s(b, b'). Write D for the column vector with bth coordinate d( b), and S for the 6x6 matrix with b, b' coordinate s(b, b'). The version of T2 we use is

It differs from the usual version [7] by a norming constant that implies an F distribution for T2 when {d(i, b)} are jointly Gaussian and the numbers of (b, b') pairs are equal. As our data are decidedly non-Gaussian, computations of attained significance are again based on the permutation distribution of T2 [7], though only on 999 permutations plus the unpermuted value and not on the full distribution, which is neither computationally feasible nor necessary.

The permutation distribution is motivated by the fact that, were there no difference between the judges, then in computing the difference d(i, b), it should not matter whether we compute judge 1 — judge 2 or vice versa, or whether we randomize the choice with a fair coin toss. The latter is exactly what we do, but we constrain the randomization so that for fixed i, the signs of {d(i, b)} are all the same. The constraint tends to preserve the covariance structure of the set of differences, at least when the null hypothesis of no difference is approximately true. (Unconstrained randomization would render the signs of d(i, b) and d(i, b') independent, and this is clearly not consistent with the data.) After randomizing the signs of all differences, we compute T2 again; the process is repeated a total of 999 times. There results a list of 1000 T2 values, the "real" (unpermuted) one and 999 others. Were there no difference between the judges, the 1000 values would be (conditional on the data) independent and identically distributed. Otherwise, we expect the "real" value to be larger than at least most of the others. The attained significance level for the test of the null hypothesis that there is no difference between the judges is (k + 1)/1000, where k is the number of randomly permuted T2 values that exceed the "real" one.

Some comparisons we hoped to make with T2 were not possible to compute because not only are the d(i, b) not independent, but also S was singular. We could have extended the domain of applicability of the Hotelling T2 approach to the case when the covariance matrix is not invertible by making an arbitrary choice of a pseudoinverse. This is not a customary approach to T2 in the usual Gaussian case, and also the inferences we draw are quite clear without resorting to that technique.

The actual p-value for the comparison of judges 1 and 2 for their sensitivity in finding lung nodules is not significant, and the same is true for comparisons of judges 1 and 3 and that of judges 2 and 3. Two of the three comparisons of predictive value positive for the lung are not significant; for the other (1 vs 2) it is not possible to compute because S is singular. The analogous comparisons for the mediastinum give rather different results. Judge 2 seems to differ from both other judges in sensitivity (both p-values about 0.04). Judge 2 also seems to differ from judge 3 in predictive value positive at the same p-value. Similar results were obtained from a seven-dimensional comparison, in which the additional coordinate comes from data on the original images. The basic message is that judges seem to differ from one another in judging the mediastinum but not the lung.

0 0