## Generalizing the Spatial Scan Statistic

In this subsection, we consider a general statistical framework for the spatial scan statistic, extending it to allow for a large class of underlying models and thus a wide variety of application domains. As above, we wish to test a null hypothesis H0 (a model of how the data is generated, assuming there are no clusters of interest) against the set of alternative hypotheses H¡(S), each of which represents a relevant cluster in some region S of space. Assuming that the null hypothesis and each alternative hypothesis are point hypotheses (with no free parameters), we can use the likelihood ratio

, s Pr(Data !H,(S)) D(S) =---1—— as our test statistic.A more interesting v ' Pr( Data ! H0) 5

question is what to do when each hypothesis has some parameter space 0: let 6j(S) e01(S) denote parameters for the alternative hypothesis H¡(S), and let 90 e0o denote parameters for the null hypothesis H0. There are two possible answers to this question. In the more typical maximum likelihood framework, we use the estimates of each set of parameters that maximize the likelihood of the data:

. We then perform randomization testing using the maximum likelihood estimates of the parameters under the null hypothesis. In the marginal likelihood framework, we instead average over the possible values of each parameter:

figure 16.4 Calculating the largest region score 1000 times. On the left it is calculated on the real data. On the right it is calculated 999 times on randomized data.

Neill et al. (2005c) present a Bayesian variant of Kulldorff's spatial scan statistic using the marginal likelihood framework; here we focus on the simpler, maximum likelihood approach, and give an example of how new scan statistics can be derived.

Our first step is to choose the null hypothesis and set of alternative hypotheses that we are interested in testing. Here we consider the expectation-based scan statistic discussed above, where we are given the baseline (or expected count) bi and the observed count ci for each spatial location s, and our goal is to determine if any spatial location si has ci significantly greater than b. We test the null hypothesis H0 against the set of alternative hypotheses H1(S), where:

H0: ci ~ Poisson(b;) for all spatial locations si.

H1(S): ci ~ Poisson(q b;) for all spatial locations si in S, and ci ~ Poisson(b;) for all spatial locations si outside S, for some constant q >1.

Here, the alternative hypothesis H1(S) has one parameter, q (the relative risk in region S), and the null hypothesis H0 has no parameters. Computing the likelihood ratio, and using the maximum likelihood estimate for our parameter q, we obtain the following expression for D(S):

maVi ns eS Pr(C; ~ Poisson(^ ))n s eS Pr(c ~ Poisson^ ))

We find that the value of q that maximizes the numerator is q = max(1, C/B), where C and B are the total count ^ Ci and total baseline ^ bi of region S respectively. Plugging in this value of q, and working through some algebra, we obtain: D(S) = ^B^ exp(B - C), if C > B, and D(S) = 1 otherwise. Then the most significant region S* is the one with the highest value of D(S), as above.We can calculate the statistical significance (p-value) of this region by randomization testing as above, where replicas are generated under the null hypothesis ci ~ Poisson(bi).

For the population-based method, a very similar derivation can be used to obtain Kulldorff's statistic: we compute the maximum likelihood parameter estimates qin = Cin/Pin, qout = Cou/Pout, and qali = C^/Pail- We have also used this general framework to derive scan statistics assuming that counts ci are generated from normal distributions with mean (i.e., expected count) hi and variance oi2; these statistics are useful if counts might be overdispersed or underdispersed. Many other likelihood ratio scan statistics are possible, including models with simultaneous attacks in multiple regions and models with spatially varying (rather than uniform) disease rates. We believe that some of these more complex model specifications may have more power to detect relevant and interesting clusters, while excluding those potential clusters which are not epidemiologically relevant.

0 0