## Detailed Description of the Spatial Scan Statistic

As discussed above, Kulldorff's spatial scan statistic attempts to detect spatial regions where the underlying disease rates qi are significantly higher inside the region than outside the region. Thus, we wish to test the null hypothesis H0 ("the underlying disease rate is spatially uniform") against the set of alternative hypotheses H¡(S): "The underlying disease rate is higher inside region S than outside region S.'' More precisely, we have:

H0:ci ~ Poisson(qall p¡) for all locations s¡, for some constant qall.

H1(S):ci ~ Poisson(qin p¡) for all locations s¡ in S, and ci ~ Poisson(qout p¡) for all locations s¡ outside S, for some constants q¡n > qoM.

The test statistic that we use is the likelihood ratio, that is, the likelihood (denoted by Pr) of the data under the alternative hypothesis H1(S) divided by the likelihood of the data under the null hypothesis H0. This gives us, for any region S, a score function:

Pr(Data I H0) For Kulldorff's statistic, we obtain

in_o

see Kulldorff (1997) for a derivation. In this equation, Cn and Cout represent the aggregate count ^ c; inside and outside region S, and Pin and Pout represent the aggregate population ^pi inside and outside region S, respectively. See Figure 16.3 for an example of the evaluation of D(S) for a region. Kulldorff (1997) proved that this likelihood ratio statistic is individually most powerful for finding a single region of elevated disease rate: For the given model assumptions (H0 and H1), for a fixed false alarm rate, and for a given set of regions searched, it is more likely to detect the cluster than any other test statistic.

Given the above test statistic D(S), the spatial scan statistic method can be easily applied by choosing a set of regions S, calculating the score function D(S) for each of these regions, and obtaining the highest scoring region S* and its score D* = D(S*).We can imagine this procedure as moving a "spatial window'' (like the rectangle drawn in Figure 16.3) all around

figure 16.3 Counts inside and outside rectangular regions.

the search area, changing the size and shape of the window as desired, and finding the window which gives the highest score D(S). Even though there are an infinite number of possible window positions, sizes, and shapes, we only need to evaluate the score function a finite number of times, since any two regions containing the same set of spatial locations si will have the same score. The region with the highest score D(S) is the "most significant region,'' that is, the region that is most likely to have been generated under the alternative hypothesis rather than the null hypothesis, and thus the region most likely to be a cluster. We typically search over the set of all "spatial windows'' of a given shape and varying size; for example, Kulldorff et al. (1997) search over circular regions, Neill and Moore (2004a) search over square regions, and Neill and Moore (2004b) search over rectangular regions. Searching over a set of regions which includes both compact and elongated regions (e.g., rectangles or ellipses) has the advantage of higher power to detect elongated clusters resulting from wind dispersal of pathogens, but because the number of regions to search is increased, this also makes the scan statistic more difficult to compute. We discuss computational issues in more detail below. Chapter 19 describes more accurate modeling of windborne dispersion patterns.

Once we have found the regions with the highest scores D(S), we must still determine which of these "potential clusters'' are likely to be "true clusters'' resulting from a disease outbreak, and which are likely to be due to chance. To do so, we calculate the statistical significance (p-value) of each potential cluster, and all clusters with p-value less than some fixed significance level a are reported. Because of the multiple hypothesis testing problem discussed above, we cannot simply compute separately whether each region score D(S) is significant, because we would obtain a large number of false positives, proportional to the number of regions searched. Instead, for each region S, we ask the question, "If this data set were generated under the null hypothesis H0, how likely would we be to find any regions with scores higher than D(S)T' To answer this question, we use the method known as randomization testing: we randomly generate a large number of "replicas'' under the null hypothesis, and compute the maximum score D* = maxs D(S) of each replica. More precisely, each replica is a copy of the original search area that has the same population values pi as the original, but has each value ci randomly drawn from a Poisson distribution

Caii with mean P Pi, where Cal and PaU are respectively the total

Pall number of cases and the total population for the original search area. Once we have obtained D* for each replica, we can compute the statistical significance of any region S by comparing D(S) to these replica values of D*, as shown in Figure 16.4. The p-value of region S can be computed as R + 1

beat- where R is the total number of replicas created, and

Rbeat is the number of replicas with D* greater than D(S). If this p-value is less than our significance level a, we conclude that the region is significant (likely to be a true cluster); if the p-value is greater than a, we conclude that the region is not significant (likely to be due to chance). We typically start from the most significant region S* and test regions in order of decreasing D(S), since if a region S is not significant, no region with lower D(S) will be significant. We note that the randomization testing approach given here has the benefit of bounding the overall false positive rate: regardless of the number of regions searched, the probability of any false alarms is bounded by the significance level a.Also, the more replications performed (i.e., the larger the value of R), the more precise the p-value we obtain; a typical value would be R = 1000. However, since the run time is proportional to the number of replications performed, this dramatically increases the amount of computation necessary. Finally, we note that spatial scan software is available at www.satscan.org and www.autonlab.org. The former is the very widely used SaTScan software of (Kulldorff and Information Management Services Inc., 2002). The latter is prototype software that is discussed below.