Semisynthetic Test Data

Evaluators can generate semisynthetic data by injecting geometrically shaped spikes into real surveillance data collected during non-outbreak periods (Goldenberg et al., 2002,

Zhang et al., 2003, Reis et al., 2003, Reis and Mandl, 2003). This technique was used for illustration in Chapter 14. The advantage of the semisynthetic approach is that it allows the evaluator to manipulate the spike size to find the smallest spike that the algorithm can detect above the background noise in real surveillance data. The evaluator can manipulate the shape of the spike and its time course. He can inject the spike repeatedly on every single day of the year to explore the effect on detectability of seasonal and day-of-week variations in the surveillance data.

The key problem with using semisynthetic data is that the resulting measures of sensitivity, false alarm rate, and timeliness of detection are for spike detection, not outbreak detection. For example, a semi-synthetic analysis may determine that an algorithm can detect an outbreak that causes a 10-unit increase in sales of cough products. However, unless the eval-uator already knows how often sick individuals purchase cough products, the conclusions an evaluator can draw about the detectability properties of an algorithm (defined above as the relationship between outbreak size and sensitivity, specificity, and timeliness) are limited and rest on assumptions about the purchasing behaviors of sick individuals. The technique is, nevertheless, useful for early studies of algorithms and has greater validity than fully synthetic data because it employs real baseline data.

There is one other aspect of the semisynthetic approach that affects its validity. Geometric shapes are noise-free and, thus, underestimate the noise one expects to find in real surveillance data, a bias that may result in the overestimation of an algorithm's sensitivity and timeliness.

Was this article helpful?

0 0

Post a comment