## N

The null hypothesis that there is no difference between the proportions of type "A" individuals in the two populations is

Denote r + 5 by n. If the null hypothesis holds, given n disparate or "untied" pairs, the number of pairs of type 2 (or of type 3) would follow a binomial distribution with parameter equal to 1/2. Typically, a large sample test is obtained by regarding the quantity as a standardized normal deviate. However, in this study we make no assumption of normality. This McNemar analysis is applied to study intrasession learning effects in the CT study as follows: In each session, each image was seen at exactly two levels, and the ordering of the pages ensured that they never appeared with fewer than three pages separating them. For each judge / and each session S and each image I, we pair the judge's reading for a given compression level L1 with the same judge's reading for compression level L2 for the same image and same session, where L: was seen before L2. For each member of the pair, the reading is either perfect (sensitivity = 1 and PVP = 1, type "A") or not perfect (type "not A''). For example, judge 1 in evaluating lung nodules over the course of three sessions saw 71 pairs of images, in which an image seen at one compression level in a given session is paired with the same image seen at a different level in the same session. Of the 71 pairs, 53 times both images in the pair were judged perfectly, and 5 times both images were judged incorrectly.

We concern ourselves with the other 13 pairs: 9 times the image seen first was incorrect while the second one was correct, and 4 times the image seen second was incorrect when the first one was correct. If it did not matter whether an image was seen first or second, then conditional on the numbers of the other two types, these would have a binomial distribution with parameters 13 and 1/2. This example is shown in Table 4. The probability that a fair coin flipped 13 times will produce a heads/tails split at least as great as 9 to 4 is 0.267; thus, this result is not significant. These calculations were carried out for

TABLE 4 Judge 1, pairing for CT lung nodules

Second occurrence

Perfect Not Perfect

First Perfect 53 4 57

occurrence Not Perfect 9 5 14

### 62 9 71

2x4x2 = 16 different subsets of the data (lungs vs mediastinum (2), judges 1, 2, 3 considered separately or pooled together (4), and consensus or personal gold standards (2)), and in no case was a significant difference found at the 5% significance level (p-values ranged from 0.06 to 1.0). An analysis of variance using the actual sensitivity and PVP observations (without combining them into "perfect" and "not perfect") similarly indicated that page order and session order had no significant effect on the diagnostic result.