## W

N(2,1)

N(2,2)

equal, then the conditional distribution of N(1,2) given N(1, 2)+N(2,1) is binomial with parameters N(1,2)+ N(2, 1) and 0.5; that is,

This is the conditional distribution under the null hypothesis that the two modalities are equivalent. The extent to which N(1, 2) differs from (N(1,2)+ N(2,1))/2 is the extent to which the technologies were found to be different in the quality of performance with their use. Let B(n, 1/2) denote a binomial random variable with these parameters. Then a statistically significant difference at level 0.05, say, will be detected if the observed k is so unlikely under the binomial distribution that a hypothesis test with size 0.05 would reject the null hypothesis if k were viewed. Thus, if

Pr(|B(n, 1/2) - 2| > |n(1, 2) - 2|) < 0.05, then we declare that a statistically significant difference has occurred.

Whether and how to agglomerate the multiple tables is an issue. Generally speaking, we stratify the data so that any test statistics we apply can be assumed to have sampling distributions that we could defend in practice. It is always interesting to simply pool the data within a radiologist across all gold standard values, though it is really an analysis of the off-diagonal entries of such a table that is of primary interest. If we look at such a 4x4 table in advance of deciding upon which entry to focus, then we must contend with problems of multiple testing, which would lower the power of our various tests. Pooling the data within gold standard values but across radiologists is problematical because our radiologists are patently different in their clinical performances. This is consistent with what we found in the CT and MR studies. Thus, even if one does agglomerate, there is the issue of how. Minus twice the sum over tables of the natural logarithms of attained significance levels has, apart from considerations of discreteness of the binomial distribution, a chi-square distribution with degrees of freedom twice the number of summands if the null hypothesis of no difference is true for each table and if the outcomes of the tables are independent. This method was made famous by R. A. Fisher. Then again, (N(1,2)- N(2, l))2/(N(1,2)+ N(2,1)) has, under the null hypothesis of no difference, approximately at least, a chi-square distribution with one degree of freedom, if the null hypothesis of no difference in technologies is correct for the table. One can sum across tables and compare with chi-square tables where the degrees of freedom are the number of summands, a valid test if tables are independent.