## Bayesian Wrapper Method

In this section, we use the Cryptosporidium example to demonstrate how to compute the probability of an outbreak after observing a spike in the surveillance data when using a non-Bayesian detection algorithm. We refer to this approach as creating a Bayesian wrapper for detection systems because the approach uses Bayes' theorem to incorporate the prior probability of the outbreak into the interpretation of an alert from a non-Bayesian system. To make the example as concrete as possible, we demonstrate this technique for a specific detection system and decision scenario. We used the results of these calculations in the decision analysis that we presented in the previous chapter.

P(Alarm | Outbreak)P(Outbreak) + P(Alarm | No Outbreak)[l - P(Outbreak)J

^ {Sensitivity of Best System}P (Outbreak)

~f JP(Outbreak) + P(Alarm | No Outbreak)[l-P(Outbreak)]

{l.0}P(Outbreak) + P(Alarm | No Outbreak)[l- P(Outbreak)] {l.°}°.°°35 = {l.0}0.0035 + °.°°27 [l - °.°°35] = 0.5654

### 3.1. Detection System

As discussed in Chapter 29, we created a hypothetical surveillance system that monitors the area supplied by a single water treatment plant for a spike in over-the-counter (OTC) sales of diarrhea remedies. We use the Chicago metropolitan area for convenience because we previously studied detectability of Cryptosporidium from OTC data in Chicago. Chicago is served by two water treatment plants. We, therefore, assumed that each treatment plant serves half of the city, and that our detection system receives aggregated sales for the region served by a single plant.

The detection system uses as a detector the CuSum algorithm with a moving average component that accounts for day-of-week effects in the surveillance data. In biosurveillance, CuSum algorithms are used typically to monitor for outbreaks in which people become sick over several days or weeks ( Page, 1954; Hutwagner et al., 1997; Morton et al., 2001; Sonesson and Bock, 2003; Rogerson and Yamada, 2004), as is historically the case with Cryptosporidium outbreaks caused by water contamination. The moving average component enables the algorithm to adapt to short- and long-term trends in the baseline data while still detecting outbreaks of moderate duration. The CuSum algorithm uses a fixed detection threshold that we set to limit false alarms to a rate of one per year. Every day, the moving average component forecasts the daily sales of diarrhea remedies by using an analysis of variance (ANOVA) model with a day-of-week factor that is fit to the sales data for the previous 28 days. The biosurveillance system then computes the CuSum statistic based on the standardized forecast error; determines whether the statistic exceeds the detection threshold; and, if so, alerts the biosurveillance staff.

### 3.2. Decision Scenario

We reproduce Figure 29.3 (as Figure 30.3) to show a graphical representation of the decision model. The base model consists of a decision node act now with two possible actions: yes/no that corresponds to issuing an immediate boil-water advisory or testing and waiting for the results, and chance nodes labelled Crypto outbreak. We represent the costs and benefits associated with each combination of action and chance event by cb1, cb2, cb3, and cb4. p represents the posterior probability that there is an ongoing Cryptosporidium outbreak.

We assume that the decision to act now or wait is made by the biosurveillance staff in response to an alert received from the detection system. After receiving the alert, the biosurveillance staff reviews the logs from the surveillance system for the previous five days and finds that there were no alerts over this period. The probability p in Figure 30.3 is, therefore, the posterior probability that there is an ongoing Cryptosporidium outbreak given that there was an alert today and no alerts in the previous five days.

3.3. Bayes'Theorem

We formalize the calculation by letting C denote the event that there is an ongoing water-borne Cryptosporidium outbreak. We refer to the information that there was an alert today and no alerts over the past five days as the alert history and denote figure 30.3 Decision tree model.

it by A. With this notation, the posterior probability that there is an ongoing water-borne Cryptosporidium outbreak given the alert history can be written as We use Bayes' theorem to calculate this probability:

figure 30.3 Decision tree model.

In the equation above,P(C) is the prior probability that there is an ongoing Cryptosporidium outbreak, and P(C) = 1-P(C) is the prior probability that there is not a Cryptosporidium outbreak. The conditional probability P(AIC) is the probability that the system generates the alert history given that an outbreak is ongoing. Similarly, P(A IC) is the conditional probability of the alert history when there is not an outbreak of Cryptosporidium.

The calculation of P(A I C) requires additional notation for characteristics of an outbreak and to account for time. We denote the outbreak size (defined as the proportion of the population that is affected by an outbreak) by S, the duration of an outbreak by D, and the number of days that an ongoing outbreak has been in progress by Y We calculate P(A IC) by finding the expectation (average value) of P(A I S,D,Y,C) under the joint conditional distribution of S, D, and Y given C: P(A I C)=ESAYiC [P(AI S,D,Y,C)], where P(AIS,D,Y,C) is the conditional probability that the system generates the alert history given that a Cryptosporidium outbreak of size S and duration D has been ongoing for Y days.

We need two types of probabilities to compute the posterior probability of a Cryptosporidium outbreak. One type consists of prior probabilities. These include P(C), which arises in the application of Bayes' theorem above, and the joint conditional distribution of S, D, and Y given C, which is used to calculate P(AIC).The other type of probabilities consists of the conditional probabilities of the alert history: P(AI S, D, Y, C), and P(AIC)

### 3.4. Priors

The priors are subjective probabilities, elicited in the absence of the alert history. In a real analysis, we would elicit prior probabilities from domain experts and from literature review. In this analysis, we simply chose priors that appeared reasonable to us in order to illustrate the approach.

We need to specify P(C), as well as the joint conditional distribution of S, D, and Y given C. It is difficult to directly produce these priors owing to possible relationships among S, D, Y, and C. Instead, we specify several simpler priors that we then use to derive the required priors.

We suppose that the outbreak size S has a uniform distribution between 0% and 50%, that the duration D is uniformly distributed from 21 to 56 days, and that S and D are independent. We also suppose that Y is uniformly distributed from one to D days given D and C. Finally, we suppose that there might be one outbreak every 30 years on average; that is, the probability that an outbreak starts on any given day is 1/(30 x 365). We use these probability statements to derive the prior probability that there is an ongoing outbreak on any given day: P(C) = 38.5/30 x 365) = 0.0035.We also use these priors to find the joint conditional distribution of S, D, and Y given C.

3.5. Conditional Probabilities

We then only need the conditional probabilities P(A\S, D, Y, C), and P(A\C)

This is the probability of observing the alert history in a period during which there is no Cryptosporidium outbreak. One approach to computing this probability is to formulate a model for the surveillance data under nonoutbreak conditions. By using the model and the detection algorithm, we can compute the probability either analytically or using simulation.

An alternative approach is to compute this probability empirically by obtaining a sample of surveillance data for a period during which there are no outbreaks and computing the fraction of windows (with length equal to the length of the alert history) within the time series with alert histories equal to A. Unless there are several years of surveillance data available for analysis, this approach may fail because it is possible that the alert history A may not have arisen in the historical data, and therefore, we would calculate P(A \ C) = 0 and ultimately find P(C\ A) = 1.

A third approach, and the one that we use in this analysis, is to supplement a sample of nonoutbreak surveillance data with a theoretical assumption about the surveillance data and detection algorithm. Specifically, we assume that false alarms occur approximately independently of each other. In this case, we compute the probability as follows: P(A \ C) = (1-P(False Alarm))A x P(False Alarm)A+, where A- is the number of days in the alert history without an alert and A+ is the number of days with an alert. Under the independence assumption, we only need to know the false-alarm rate of the detection system to compute P(A \ C) for these scenarios. We compute the false-alarm rate of the system empirically by running the detection algorithm on our sample of nonoutbreak surveillance data and calculating the fraction of days with alerts.

Ideally, we would use real outbreaks of Cryptosporidium that have occurred in the region under study to compute the conditional probability of the alert history given an outbreak. However, the rarity of real outbreaks makes this strategy generally infeasible. The alternative approaches require either simulation or modeling of the surveillance data that would arise during an outbreak.

The HiFIDE methodology (Wallstrom et al., 2004) uses surveillance data collected during a real outbreak to construct a Poisson generalized linear model (McCullagh and Nelder, 1983)

for the effect of the outbreak on surveillance data. Under our model, the average effect is a smooth function of time that we estimate by using the EM algorithm (Dempster et al., 1977) in conjunction with Bayesian adaptive regression splines (Dimatteo et al., 2001).

We use the HiFIDE approach to model the effect of a Cryptosporidium outbreak on sales of diarrhea remedies using published sales data during the North Battleford Cryptosporidium outbreak (Stirling et al., 2001a,b). We then simulate the effect of an outbreak from this model and inject that effect into OTC data for Chicago. HiFIDE accounts for the population difference between Chicago and North Battleford as well as differences in OTC data quality owing to retailer market share (Wallstrom et al., 2004). This process enables us to parameterize the known outbreak effect on OTC sales in North Battleford in a way that we can vary the size (S) and duration (D) of the outbreak in Chicago. We use this approach to construct multiple outbreak data sets. Each data set contains the simulated daily sales of diarrhea remedies during a Cryptosporidium outbreak of size S and duration D. For each outbreak data set and each day within the outbreak, the detection system determines whether or not to send an alert. We store these results and use them to compute P(AI S,D,Y,C)

We use Monte Carlo integration (Gilks et al., 1996; Robert and Casella, 2004) and the equation P(A I C)=ESDY]C [P(A I S,D,Y,C)] to compute the probability P(A IC) from the values of P(AI S,D,Y,C). Specifically, we sampled 10,000 values of S, D, and Y from their joint conditional distribution given C. For each set of values, we compute P(AI s,d,y,C).The probability P(A IC) is then computed by averaging these values of P(AI s,d,y,C).

### 3.6. Posterior Probability

We used the above Bayesian wrapper method to compute p, the posterior probability that there is an ongoing Cryptosporidium outbreak given that there was an alert today and no alerts in the previous five days. We found p = 0.0410. Therefore, after observing the spike in sales of diarrhea remedies, we conclude that there is approximately a one out of 25 chance that there is a Cryptosporidium outbreak.