Introduction

Healthcare data are commonly represented as a sequence of observations recorded at regular time intervals. For instance, these observations could be the number of emergency department (ED) cases per hour, the number of ED cases involving respiratory symptoms per day, or the number of thermometers sold per day. This sequence of observations is called a univariate time series. A time series is a sequence of observations made over time from a random process. The term univariate refers to the fact that we only get one piece of information per time step, such as the number of patients with respiratory symptoms appearing at an ED. Multivariate data consist of records with more than one variable associated with each record. An example of multivariate data would be patient records containing information about the patient's age, gender, home zip code, work zip code, and symptoms. In this chapter, we will discuss algorithms for detecting unexpected increases in a univariate time series that might indicate an outbreak. We will present multivariate detection algorithms in the next chapter.

The purpose of this chapter is a tour of a nonexhaustive set of approaches that, along the way, introduce the reader to some of the issues faced when choosing or implementing a time-series based method. We keep the discussion at a high level, with plenty of examples, and without going into the many statistical details. Our intent is that anyone who wishes to skip over the equations we do provide will still (primarily through the worked examples) understand the central issues in this kind of analysis.

Figure 14.1 contains an example of a univariate time series in which we plot the number of physician visits in which the patient's recorded complaint was of a respiratory problem in a simulated town for each day between August 12, 2004 and September 30,2004. This graph also illustrates the effects of a simulated outbreak on the data. One of the key assumptions of many detection algorithms is that the observed data are assumed to be the sum of cases from regular background activity plus any cases from an outbreak. We produced this graph according to this assumption: outbreak cases were injected into the background activity.

One simple way to detect anomalies in a univariate time series is to look for an obvious spike. However, by visually inspecting Figure 14.1, it is difficult to determine exactly when the outbreak begins: it is clearly happening by Monday, September 27, but could it have started before? The answer is revealed in Figure 14.2 the dark gray line indicates a "ramp outbreak'' that starts on Saturday, September 25. This injection simulates a situation in which on Saturday there were 10 extra visits on top of the initial series; 20 additional cases on Sunday; 30 additional cases on Monday, 40 on Tuesday and 50 on Wednesday. By inspecting Figure 14.2 carefully, it is possible to notice day-of-week effects in the weeks leading up to the simulated outbreak. Most noticeably, Sundays tend to have much lower counts, and Mondays often have slightly higher counts than the surrounding days. Interestingly, if we take day-of-week effects into account, the increment on Sunday, September 26 is noticeable. It is the first Sunday we have seen with counts higher than the previous Friday. It is arguable whether the small rise on Saturday is in any way detectable.

Let us now consider how to develop automated systems to notice rises such as these in time series data. If we could model the background activity, which represents the conditions under non-outbreak periods, we could then quantify the degree of departure from the predicted count for the day. We could then detect outbreaks by looking for severe deviations from the background activity. The majority of univariate algorithms discussed in this chapter work exactly in this manner. First, the univariate algorithm method estimates the background activity according to various assumptions, such as requiring the background activity to have a normal distribution. This first step is the main difference between the various algorithms we will present. Some algorithms are able to characterize the background data more accurately by taking the temporal fluctuations into account. In the second step, limits are imposed in order to determine the amount of deviation from the background that will be tolerated before an alert is raised.

How can we tell if one detection algorithm is better than another? This is a very large and important question, described in detail in Chapter 20. For the purposes of this chapter,

Handbook of Biosurveillance ISBN 0-12-369378-0

^Xàbor Jay count

figure 14.1 A time series of physician visits over time, with a simulated outbreak added. A puzzle for the reader: when did the simulated outbreak start? (See answer in the text.)

figure 14.2 Light gray ("inject'') is the simulated data. It is the sum of two time series: the black baseline ("count''), which is mostly covered by "inject,'' and the simulated outbreak (labeled ramp) (dark gray).

MON MOM HON HON MON HON

AUG-23-2004 flUG-30-2004 SEP-06-2004 SEP-13-2004 SEP-20-2004 SEP-27-2004

figure 14.2 Light gray ("inject'') is the simulated data. It is the sum of two time series: the black baseline ("count''), which is mostly covered by "inject,'' and the simulated outbreak (labeled ramp) (dark gray).

we will evaluate detection algorithms based on the tradeoff between the detection time and the false-positive rate. The detection time is defined as the number of time steps needed to detect the outbreak after it begins. A false positive is defined as an alert that is raised in the absence of an outbreak. The false-positive rate is the number of false positives divided by the total number of time steps. The detection time/false positive tradeoff is best explained by considering the two extremes. At one extreme, we have a "boy who cried wolf'' algorithm that raises an alert extremely frequently. Due to the frequency of these alarms, however, the time to detect an outbreak will be very short. At the other extreme, we have a very insensitive algorithm that is reluctant to raise an alarm unless it is very sure an outbreak is occurring. In this case, the false-positive rate will be lower at the expense of a longer detection time. The ideal detection algorithm finds just the right compromise between the detection time and the false-positive rate.

In subsequent sections of this chapter we will describe some illustrative time series detection algorithms. Before that we will briefly discuss in more detail the simulated data we will be using.