In this chapter, we focus on the task of spatial cluster detection: finding spatial areas where some monitored quantity is significantly higher than expected. For example, we may want to monitor the observed number of cases of influenza, or some other specific type of disease, and find any regions where the number of cases is abnormally high. The spatial cluster detection techniques that we describe are disease independent; that is, they are capable of detecting clusters of any type of disease including those of previously unknown diseases.

At present, health departments are using automatic spatial cluster detection primarily on syndromic data with a goal of detecting regions in a city or even a country with abnormally high case counts of some syndrome (e.g., respiratory), based on observed quantities such as the number of emergency department visits or sales of over-the-counter cough and cold medication. The detected clusters of disease may be indicative of a naturally occurring outbreak, a bioterrorist attack (e.g., anthrax release) or an environmental hazard.

The main goals of spatial scanning are to identify the locations, shapes, and sizes of potential clusters (i.e., pinpointing those areas which are most relevant), and to determine whether each of these potential clusters is more likely to be a "true'' cluster (requiring further investigation by public health officials) or simply a chance occurrence (which can safely be ignored).

As mentioned previously, the spatial cluster detection task involves the two questions: Is anything unexpected going on? If so, where? In order to answer these questions, we must first have some idea of what we expect to see. We typically take one of two approaches, illustrated in Figure 16.1. In the population-based approach, we estimate an at-risk population for each area (e.g., zip code); this population can either be estimated simply from census data, or can be adjusted for a variety of covariates (patients' age and gender, seasonal and day of week effects, etc.). Then the expected number of cases in an area is assumed to be proportional to its at-risk population, and thus clusters are areas where the disease rate (number of cases per unit population) is significantly higher inside the region than outside. The expectation-based approach directly estimates the number of cases we expect to see in each area; typically by fitting a model based on past data (e.g., the number of cases in each area on each previous day). Wagner et al. (2003) describes such an approach. Using an expectation-based method, clusters are areas where the number of cases is significantly greater than its expectation. One important difference between these two approaches is their response to a global increase (i.e., where the number of cases increases in the entire area being monitored). The expectation-based approach

Handbook of Biosurveillance ISBN 0-12-369378-0

Elsevier Inc. All rights reserved.

will find this increase very significant because the counts everywhere are higher than expected. However, the population-based approach will only find the increase significant if there is spatial variation in the amount of increase; otherwise the ratio of disease rate inside the region to disease rate outside the region remains constant, so no significant increase is detected. We discuss these approaches in more detail below.

For now, let us consider the expectation-based approach, assuming that we are given the observed number of cases c, as well as an estimated mean |i and standard deviation o, for each zip code. How can we tell whether any zip code has a number of cases that is significantly higher than expected? One simple possibility would be to perform a separate statistical test for each zip code: for example, we might want to detect all zip codes with observed count more than three standard deviations above the mean. However, there are two main problems with this simple method. First, treating each zip code separately prevents us from using information about the spatial proximity of adjacent zip codes. For instance, while a single zip code with count two standard deviations above the mean might not be sufficiently surprising to trigger an alarm, we would probably be interested in detecting a cluster of four adjacent zip codes each with count two standard deviations above the mean. Thus, the first problem with performing separate statistical tests for each zip code is reduced power to detect clusters spanning multiple zip codes: we cannot detect such increases unless the amount of increase is so large as to make each zip code individually significant. A second, and somewhat more subtle, problem is that of multiple hypothesis testing. We typically perform statistical tests to determine if an area is significant at some fixed level a, such as a = 0.05, which means that if there is no abnormality in that area (i.e., the "null hypothesis" of no clusters is true) our probability of a false alarm is at most a. A lower value of a results in less false alarms, but also reduces our chance of detecting a true disease cluster. Now let us imagine that we are searching for clusters in a large area containing 1000 zip codes, and that there happen to be no outbreaks today, so any areas we detect are false alarms. If we perform a separate significance test for each zip code, we expect each test to trigger an alarm with probability a = 0.05. But because we are doing 1000 separate tests, our expected number of false alarms is 1000 x 0.05 = 50. Moreover, if these 1000 tests were independent, we would expect to get at least one false alarm with probability 1-(1-0.05)1000«1. Of course, counts of adjacent zip codes are likely to be correlated, so the assumption of independent tests is not usually correct. The main point here, however, is that we are almost certain to get false alarms every day, and the number of such false alarms is proportional to the number of tests performed. One way to correct for multiple tests is the Bonferroni correction (Bonferroni, 1935): If we want to ensure that our probability of getting any false alarms is at most a, we report only those regions which are significant at level a /N, where N is the number of tests. The problem with the Bonferroni correction is that it is too conservative, thus reducing the power of the test to detect true outbreaks. In our example, with a = 0.05 and N = 1000, we only signal an alarm if a region's statistical significance (p-value) is less than 0.00005, and thus only very obvious outbreaks can be detected.

As an alternative to this simple method, we can choose a set of regions to search over, where each region consists of a set of one or more zip codes. We can define the set of regions based on what we know about the size and shape of potential outbreaks; we can either fix the region shape and size, or let these vary as desired. We can then do a separate test for each region rather than for each zip code. This resolves the first problem of the previous method: assuming we have chosen the set of regions well, we can now detect attacks whether they affect a single zip code, a large number of zip codes, or anything in between. However, the disadvantage of this method is that it makes the multiple hypothesis testing problem even worse: the number of regions searched, and thus the number of tests performed, is typically much larger than the number of zip codes. In principle, the number of regions could be as high as 2Z, where Z is the number of zip codes, but in practice the number of regions searched is much smaller (because we want to enforce constraints on the connectedness, size, and shape of regions). For example, if we consider circular regions centered at the centroid of some zip code, with continually varying radius (assuming that a region contains all zip codes with centroids inside the circle), the number of distinct regions is proportional to Z2. For the example above, this would give us one million regions to search, creating a huge multiple hypothesis testing problem; less restrictive constraints (such as testing ellipses rather than circles) would require testing an even larger number of regions.

This method of searching over regions, without adjusting for multiple hypothesis testing, was first used by Openshaw et al. (1988) in their geographical analysis machine (GAM). Openshaw et al. test a large number of overlapping circles of fixed radius, and draw all of the significant circles on a map; Figure 16.2 gives an example of what the output of the GAM might look like. Because we expect a large number of circles to be drawn even if there are no outbreaks present, the presence of detected clusters is not sufficient to conclude that there is an outbreak. Instead, the GAM can be used as a descriptive tool for outbreak detection: whether any outbreaks are present, and the location of such outbreaks, must be inferred manually from the number and spatial distribution of detected clusters. For example, in Figure 16.2 the large number of overlapping circles in the upper right of the figure may indicate an outbreak, while the other circles might be due to chance. The problem is that we have no way of determining whether any given circle or set of circles is statistically significant, or whether they are due to chance and multiple testing; it is also difficult to precisely locate those clusters which are most likely to correspond to true outbreaks. Besag and Newell (1991) propose a related approach, where the search is performed over circles containing a fixed number of cases; this approach also suffers from the multiple hypothesis testing problem, but again is valuable as a descriptive method for visualizing potential clusters.

The scan statistic was first proposed by Naus (1965) as a solution to the multiple hypothesis testing problem. Let us assume we have a score of some sort for each region, such as the Z-score, Z= (c-))/a. The Z-score is the number of standard deviations that the observed count c is higher than

the expected count |i; a large Z-score indicates that the observed number of cases is much higher than expected. Rather than triggering an alarm if any region has Z-score higher than some fixed threshold, we instead find the distribution of the maximum score of all regions under the null hypothesis of no outbreaks. This distribution tells us what we should expect the most alarming score to be when the system is executed on data in which there is no outbreak. Then we compare the score of the highest-scoring (most significant) region on our data against this distribution to determine its statistical significance (or p-value). In other words, the scan statistic attempts to answer the question, "If there were no outbreaks, and we searched over all of these regions, how likely would we be to find any regions that score at least this high?'' If the analysis shows that we would be very unlikely to find any such regions under the null hypothesis, we can conclude that the discovered region is a significant cluster. The main advantage of the scan statistics approach is that we can adjust correctly for multiple hypothesis testing: we can fix a significance level a, and ensure that the probability of having any false alarms on a given day is at most a, regardless of the number of regions searched. Moreover, because the scan statistic accounts for the fact that our tests are not independent, it will typically have much higher power to detect than a Bonferroni-corrected method. In some applications, the scan statistic results in a most powerful statistical test (see Kulldorff, 1997 for more details).

Although the scan statistic focuses on finding the single most significant region, it can also be used to find multiple regions: secondary clusters can be examined, and their significance found, though the test is typically somewhat conservative for these. The technical difficulty, however, is finding the distribution of the maximum region score under the null hypothesis. Turnbull et al. (1990) solved this problem for circular regions of fixed population, using the maximum number of cases in a circle as the test statistic, and using the method of randomization testing (discussed below) to find the statistical significance of discovered regions. The disadvantage of this approach is that it requires a fixed population size circle, and thus a multiple hypothesis testing problem still exists if we want to search over regions of multiple sizes or shapes. Kulldorff and Nagarwalla (1995) and Kulldorff (1997) solved the problem for variable size regions using a likelihood ratio test: The test statistic is the maximum of the likelihood ratio under the alternative and null hypotheses, where the alternative hypothesis represents clustering in that region and the null hypothesis assumes no clusters. We discuss their method, the "spatial scan statistic,'' in the following section.

Was this article helpful?

## Post a comment