## Determining Whether Bayesian Algorithms Are Well Calibrated

For those algorithms that compute posterior probabilities (of a case, an outbreak, or an outbreak characteristic) from surveillance data, it is important to know whether the posterior probabilities are accurate; that is, do they in the long run reflect the actual frequency of cases or outbreaks as determined by a gold standard. We refer to an algorithm that satisfies this requirement as well calibrated. For example, if an evaluator runs a case detection algorithm on a sample of 100 patients of which 10 patients have a disease in question and 90 do not, the sum of the posterior probabilities of a well-calibrated case detection algorithm for the 100 patient would be 10.

We note that an algorithm can trivially meet this requirement by simply outputting the prior probability of an outbreak for each day; therefore, such an evaluation represents a necessary but not sufficient test of the algorithm. As illustrated in Figure 13.13 (Chapter 13), the RODS system uses the case detection algorithm SyCO2 to compute a time series of daily counts of "respiratory syndrome'' by summing the posterior probability of "respiratory syndrome'' produced by SyCO2 for all patients visiting an emergency room on each day. If SyCO2 is well-calibrated, the daily sums should equal the actual number of patients with respiratory illness each day, which could be verified by examination of patient charts or some other gold-standard ascertainment method.

The importance of accuracy calibration is that these algorithms may be used within a decision-analytic framework to compute the expected utility of a decision (Part V), which requires that the posterior probabilities are accurate.

As with other evaluations of case and outbreak detection algorithms, case data required to measure the calibration of a case detection algorithm are far more available than outbreak data required to evaluate an outbreak detection algorithm. To evaluate the calibration of an outbreak detection algorithm, an evaluator would assemble a library of time series consisting of outbreak (simulated or real) data and non-outbreak data. He might then compute the posterior probability of an ongoing outbreak for each day in each of the time series. The posterior is well calibrated if the sum of the posterior probabilities across all time series equals the total number of outbreak days in the time series. Again, the evaluator would have to ensure that the outbreak detection algorithm is not "gaming the system'' by always outputting a posterior probability that is equal to the prior probability.