studies or how many radiologists we would have, one could always vary the parameters so that we would need more of either or both.

If we think of sensitivity for detection being 0.85, say, then at least for that quantity 400 studies and 9 radiologists seem ample. At this time one good recommendation would be to start with 400 studies, 12 radiologists, three at each of four centers, and find an attained significance level for a test of the null hypothesis that there is no difference between technologies. And, perhaps at least as important, estimate the parameters of Table 2. At that point possible numbers of required further radiologists or studies, if any, could be estimated for particular values of size and power that reviewers might require. The design could be varied so that the pool of studies would include more than 400, but no single radiologist would read more than 400. In this way we could assess fairly easily the impact of variable prevalence of adverse findings in the gold standard, though we could get at that issue even in the situation we study here.

Computations of power apply equally well in our formulation to sensitivity and specificity. They are based on a sample of 400 studies for which prudent medical practice would dictate return to screening for 200, and something else (6-month followup, additional assessment needed, or biopsy) for the other 200. Thus, there are 200 studies that figure in computation of sensitivity and the same number for specificity. All comparisons are in the context of "clinical management," which can be "right" or "wrong." It is a given that there is an agreed-upon gold standard, independent or separate. For a given radiologist who has judged two technologies — here called I and II and meant to be digital and analog or analog and lossy compressed digital in application — a particular study leads to an entry in a 2 by 2 agreement table of the form of Table 1.

If the null hypothesis of "no difference in technologies" is true, then whatever be ^ and y, h = 0. An alternative hypothesis would specify h = 0, and without loss (since we are free to call whichever technology we want I or II) we may take h>0 under the alternative hypothesis that there is a true difference in technologies. Under the null, given b + c, b has a binomial distribution with parameters b + c and 1/2. Under the alternative, given b + c, b is binomial with parameters b + c and (1 — ^ — h — y)/(2 — 2^ — 2y — h). The usual McNemar conditional test of the null hypothesis is based on (b — c)2/(b + c) having approximately a chi-square distribution with one degree of freedom.

In actual practice we intend to use R radiologists for R = 9,12,15, or 18, to assume that their findings are independent, and to combine their data by adding the respective values of their McNemar statistics. We always intend that the size = probability of Type I error is 0.05. Since the sum of independent chi-square random variables is distributed as chi-square with degrees of freedom the sum of the respective degrees of freedom, it is appropriate to take as the critical value for our test the number C, where

Pr(x| >C) = 0.05. The four respective values of C are therefore 16.92, 21.03, 25.00, and 28.87.

Computation of power is tricky because it is unconditional, since before the experiment, b + c for each radiologist is random. Thus, the power is the probability that a noncentral chi-square random variable with R degrees of freedom and noncentrality parameter [(p1 —

exceeds C/4p1q1, where b + q has a binomial distribution with parameters N and 2 — 2^ — 2y — h; and the R random integers are independent; p1 = (1 — ^ — h — y)/ (2 — 2^ — 2y — h) = 1 — q1. This entails that the noncen-trality parameters of the chi-square random variable that figures in the computation of power is itself random. Note that a noncentral chi-square random variable with R degrees of freedom and noncentrality parameter Q is the distribution of (G1 + Q1/2)2 + Gf +----h GR, where G1;..., GR are independent, identically distributed standard Gaussians. On the basis of previous work and this pilot study, we have chosen to compute the power of our size 0.05 tests for N always 200, ^ from 0.55 to 0.85 in increments of 0.05; y = 0.03,0.05,0.10; and, as was stated, R = 9,12,15, and 18. The simulated values of power can be found in [2] and code for carrying out these computations is in Appendix C of Betts [1]. These form the basis of our earlier estimates for the necessary number of patients and will be updated as data is acquired.

Was this article helpful?

## Post a comment