## Calculation of Populations Baselines

In the discussion of the spatial scan methods above, we have paid relatively little attention to the question of how the underlying populations or baselines are obtained. In the population-based methods, we often start from census data, which gives an unadjusted population pi corresponding to each spatial location si. This population can then be adjusted for covariates such as the distribution of patient age and gender, giving an estimated "at-risk population'' for each spatial location. In a recent paper, Kleinman et al. (2005) suggest two additional model-based adjustments to the population estimates. First, they present a method for temporal adjustment (accounting for day of week, month of the year, and holidays),

figure 16.5 A screen shot from the current RODS spatial scan software.The bottom time series is the recent history of electrolyte sales in a region north of Indianapolis that was determined as the most significant region in the state on that day (the actual region is hidden in order to avoid providing information that might reveal the data providers).

figure 16.5 A screen shot from the current RODS spatial scan software.The bottom time series is the recent history of electrolyte sales in a region north of Indianapolis that was determined as the most significant region in the state on that day (the actual region is hidden in order to avoid providing information that might reveal the data providers).

making the populations larger on days when more visits are likely (e.g., Mondays during influenza season) and smaller on days when fewer visits are likely (e.g., Sundays and holidays). Second, they apply a "generalized linear mixed models'' (GLMM) approach, first presented in Kleinman et al. (2004), to adjust for the differing baseline risk in each census tract. This makes the adjusted population larger in tracts that have a larger baseline risk, which makes sense since a given number of observed cases should not be as significant if the observed counts in that region are consistently high. These baseline risks are computed from historical data, that is, the time series of past counts in each census tract, using the GLMM version of logistic regression to fit the model; see Kleinman et al. (2004) for details.

In the expectation-based methods, we also make use of the historical data, but for these methods the goal is to directly estimate the number of cases we expect to see in each area. Thus, we must predict the expected number of cases bi for each spatial location si based on the history of past counts at that location (and optionally, considering spatial correlation of counts at nearby locations). This becomes, in essence, a univariate time series analysis problem, and many of the techniques discussed in Chapter 14 can be used. For example, to adjust for day of week effects, we can either stratify by day of week (i.e., predict Tuesday's expected count by using only prior Tuesdays) or adjust for day of week using the sickness availability method (see Chapter 14). For data sets without strong seasonal effects, simple mean or exponentially weighted moving average methods (e.g., estimating the expected value of today's count as the mean of the counts 7, 14, 21, and 28 days ago) can be sufficient, but for data sets with strong seasonality, these methods will lag behind the seasonal trend, resulting in numerous false positives for increasing trends (e.g., sales of cough and cold medication at the start of winter) or false negatives for decreasing trends (e.g., cough and cold sales at the end of winter). To account for these trends, we recommend the use of regression methods (either weighted linear regression or nonlinear regression depending on the data) to extrapolate the current counts; see Neill et al. (2005b) for more details of this expectation-based approach. Another possibility is to make the assumption of independence of space and time, as in Kulldorff et al. (2005); this means that the expected count in a given region is equal to the total count of the entire area under surveillance, multiplied by the historical proportion of counts in that region. This approach is successful in detecting very localized outbreaks, but loses power to detect more widespread outbreaks (Neill et al., 2005b). The reason for this is that a widespread outbreak will increase the total count significantly, thus increasing the expected count in the outbreak region, and hence making the observed increase in counts seem less significant. In the worst scenario, a massive outbreak which causes a constant, multiplicative increase in counts across the entire area under surveillance would be totally ignored by this approach; this is also true for many of the population-based methods, since they only detect spatial variation in disease rate, not an overall increase in counts. If these methods are used, we recommend using a purely temporal method in parallel to ensure that large-scale outbreaks (as well as localized outbreaks) can be detected. Either way, the accurate inference of expected counts from historical data is still an open problem, with different methods performing well for different data sets and outbreak types. See Neill et al. (2005b) for empirical testing of the various time series methods and further discussion.

## Post a comment