## Correlation Analysis

If the evaluator can obtain a second time series of data from the outbreak (e.g., pneumonia and influenza deaths), he can use the correlation function to compare the two time series. The correlation function finds the time lag at which the correlation between two time series is maximized as well as the strength of the correlation (Figure 21.2).

The second time series must have face validity; that is, it must already be known to reflect outbreak activity.

figure 21.2 Correlation analysis.Two time series (A and B) show outbreak effect. The correlation function finds the time shift that will bring the two time series into maximal alignment. An evaluator uses this difference (hypothetically given as two weeks in the example) as a measure of the relative earliness of two types of surveillance data.

figure 21.2 Correlation analysis.Two time series (A and B) show outbreak effect. The correlation function finds the time shift that will bring the two time series into maximal alignment. An evaluator uses this difference (hypothetically given as two weeks in the example) as a measure of the relative earliness of two types of surveillance data.

1 It may be possible to bring signal out of noise by focusing the analysis more precisely on the subset of patients affected by the outbreak. For example, if all of the affected individuals live in the same zip code, the evaluator will create a time series for that zip code. If all of the affected individuals are also children, the evaluator will create a time series of data for children in that zip code.

Pneumonia and influenza deaths, for example, have been used for many years by researchers to identify the existence of influenza outbreaks and, therefore, have face validity.

The evaluator uses the time lag as a relative indication of earliness of detection (relative to the second time series). He uses the strength of correlation as an indication of the degree to which the perturbations in the surveillance data are caused by outbreak versus non-outbreak effects.

The conclusions the evaluator can draw about timeliness of surveillance data for the outbreak in question using correlation analysis are limited. The correlation function finds the time lag at which the peaks of the signals in the time series data (which occur mid-outbreak, typically) are most highly correlated. The evaluator is interested in knowing the time latency between the initial upticks in the two time series, which correlation analysis cannot measure. Although the time difference between peaks and the time difference between initial upticks may be the same, it will only be so if the shape of the outbreak effect is identical in the two signals. For this reason, most evaluators also use the detection algorithm method, described in the next section.

The idea of using the correlation function to evaluate biosurveillance data was first demonstrated in 1979 by Welliver et al. (1979). Other studies of biosurveillance data that have used the correlation function include (Hogan et al., 2003, Magruder and Florio, 2003, Espino et al., 2003, Campbell et al., 2004, Johnson et al., 2004).