Computational Considerations

While considering deployment of spatial methods, it may be necessary to account for the computational time of an algorithm if it is to be run over a very large area or use many randomizations. In this section we return to the question of what set of regions to search over, and discuss how to perform this search efficiently. First, we note that the run time of the spatial scan can be approximated by the product of three factors: the number of replications R, the average number of regions searched per replication ISI, and the average time to search a region t. The number of replications R is typically fixed in advance, but we can stop early if many replicas beat the original search area (i.e., the maximum region scores D* of the replicas are higher than the maximum region score D* of the original). If this happens, it is clear that no significant clusters are present.The other two factors ISI and t depend on both the set of regions to be searched and the algorithm used to search these regions. For a set of M distinct spatial locations in two dimensions, the number of circular or axis-aligned square regions (assuming that the size of the circle or square can vary) is proportional to M3, while the number of axis-aligned rectangular regions (assuming that both dimensions of the rectangle can vary) is proportional to M4. For non—axis-aligned squares or rectangles, we must also multiply this number by the number of different orientations searched. However, most algorithms only search a subset of these regions: for example, Kulldorff (1999) algorithm searches only circles centered at one of the M spatial locations, and the number of such regions is proportional to M2, not M3. Another possibility is to aggregate the spatial locations to a grid, either uniform or based on the distinct spatial coordinates of the data points. For a two-dimensional NxN grid, the number of axis-aligned square regions is proportional to N3, and the number of axis-aligned rectangular regions is proportional to N4. Whatever set of regions we choose, the simplest possible implementation of the scan statistic is to search each of these regions by stepping through the M spatial locations, determining which locations are inside and outside the region, computing the aggregate populations/baselines and counts, and applying the score function. Thus, in this approach, we have ISI (number of regions searched per replication) equal to the total number of distinct regions, and t (time to search a region) proportional to the number of spatial locations M.

There are several possible ways to improve on the runtime of this naïve approach. First, we can reduce the time to search a region t, making this search time independent of the number of spatial locations M. We consider two possible methods for searching a region in constant time. The first method, which we call "incremental addition,'' assumes that we want to search over all regions of a given type: for example, in the approach of Kulldorff (1999), we want to search all distinct circular regions centered at one of the spatial locations. To do so, we increase the region's size incrementally, such that one new spatial location at a time enters the region; for each new location, we can add that location's count and population/baseline to the aggregates, and recompute the score function. For example, in Kulldorff's method, for each location si we keep a list of the other locations, sorted from closest to furthest away. Then we can search over the M distinct circular regions centered at si by adding the other points one at a time in order. Because the sorting only has to be done once (and does not have to be repeated for each replication), this results in constant search time per region. In other words, Kulldorff's method requires time proportional to M2 to search over all M2 regions. This must be done for each of the R replications, giving total search time proportional to RM2. The second method assumes that points have been aggregated to an NxN grid, and that we are searching over squares or rectangles. We can use the well-known "cumulative counts'' technique to search in constant time per region; see Neill and Moore (2004b) for more details. As a result, we can perform the scan statistic for gridded square or rectangular regions in time proportional to R times the number of regions, that is, RN3 or RN4 for square or rectangular regions, respectively.

Even if we can search in constant time per region, the spatial scan statistic is still extremely computationally expensive, because of the large number of regions searched. For example, to search over all rectangular regions on a 256x256 grid, and perform randomization testing (assuming R = 1000 replications), we must search a total of 1.1 trillion regions, which would take 14 to 45 days on our test systems. This is clearly far too slow for real-time detection of emerging outbreaks. While one option is to simply search fewer regions, this reduces our power to detect disease clusters. A better option is provided by the fast spatial scan algorithms of Neill and Moore (2004a, 2004b) and Neill et al. (2005a), which allow us to reduce the number of regions searched, but without losing any accuracy. The idea is that, since we only care about the most significant regions, that is, those with the highest scores D(S), we do not need to search a region S if we know that it will not have a high score. Thus, we start by examining large regions S, and if we can show that none of the smaller regions contained in S can have high scores, we do not need to actually search each of these regions. Thus, we can achieve the same result as if we had searched all possible regions, but by only searching a small fraction of these. Further speedups are gained by the use of multiresolution data structures, which allow us to efficiently move between searching at coarse and fine resolutions. A detailed description of these structures is beyond the scope of this chapter, but this has been an active area of research in computer science and biomedical informatics. In these algorithms, we apply multiresolution techniques in order to accelerate the spatial scan statistic for searching square regions (Neill et al. 2005a) or rectangular regions (Neill and Moore, 2004b). We have also extended these techniques to rotated rectangular regions and multidimensional data sets (Neill et al., 2005a). These methods are able to search hundreds or thousands of times faster than an exhaustive search, without any loss of accuracy (i.e., the fast spatial scan finds exactly the same region and p-value as exhaustive search). As a result, these methods have enabled us to perform spatial scans on data sets such as nationwide over-the-counter sales data, from over 20,000 stores in near real time, searching for disease clusters in minutes or hours rather than days or weeks. In Figure 16.5, we show a screen shot from the current version of our spatial scan software, available from our websites (Auton Laboratory) and (RODS Laboratory).

Was this article helpful?

0 0

Post a comment