## Modeling

Our methodology uses Bayesian networks to explicitly model an entire population of individuals. Since, in this chapter, we are specifically interested in disease outbreak detection from syndromic information, we will refer to models of these individuals as person models, although obviously the same ideas could be applied to model other entities that might provide information about disease outbreaks, such as biosensors and livestock.

We explicitly model each person in the population, and thus in our BN there will exist (at least conceptually) a subnetwork (or object) Pi for each person. Each such subnetwork is in essence a diagnostic expert system applied to a given person. By connecting those subnetworks appropriately, we obtain a population diagnostic model that is built from person-specific diagnostic components. An example of a complete model for four people is shown in Figure 18.2, where each person in the population is represented with a simple six-node subnetwork structure. In this particular example, there is only one person model (class), but the methodology can allow for more.

In this chapter, we restrict the methodology to model non-contagious diseases. We partition all the nodes X in the network into three parts:

1. A set of global nodes G,

2. A set of interface nodes, I, and

The set G, defined as G = X\{I u P}, contains nodes that represent global features common to all people. For the example in Figure 18.2, G consists of two nodes: Terror Alert Level (having states Green, White, Yellow, Orange, and Red), and Anthrax Release (having states Yes and No). Set I contains factors that directly influence the status of the outbreak of disease in people in the population. Each Pi subnetwork (object) represents a person in the population.

Structurally, we make the following two assumptions. Both assumptions use the notion of d-separation, which is described fully in (Neapolitan, 2004). In brief, d-separation is a graphical condition on Bayesian network structures that implies conditional independence. Thus, if for some BN the nodes in set A d-separate each node in set B from each node in set C, then it follows from the BN Markov condition that each node in B is independent of each node in C, given the states of the nodes in A.

Assumption 1: The interface nodes, I, d-separate the person subnetworks from each other, and any arc between a node I in I and a node X in some person subnetwork Pi is oriented from I to X.

Thus, we do not allow arcs between the person models. figure 18.2 A simplified four-person model for detecting an outbreak of anthrax. Each person P in the population is represented explicitly by a six-node subnetwork.

Assumption 2: The interface nodes, I, d-separate the nodes in G from the nodes in P, and any arc between a node G in G and a node I in I is oriented from G to I.

Figure 18.3 presents the above two assumptions in diagrammatic form.

For noncontagious diseases that may cause outbreaks, Assumptions 1 and 2 are reasonable when I contains all of the factors that significantly influence the status of an outbreak disease in individuals in the population. In the case of bioterrorist-released bioagents, for example, such information includes the time and place of release of the agent. Key characteristics of nodes in I are that they have arcs to the nodes in one or more person models, and they induce the conditional independence relationships described in Assumptions 1 and 2. Often the variables in I will be unmeasured. It is legitimate, however, to have measured variables in I. For example, the regional smog level (not shown in Figure 18.2) might be a measured variable that influences the disease status of people in the population, and thus it would be located in I.

Let T be a variable in G that represents a disease outbreak. In Figure 18.2, T is the node Anthrax Release. The goal of our biosurveillance method is to continually derive an updated posterior probability over the states of T as data about the population streams in.

We consider spatio-temporal data in deriving the posterior probability of T. For example, we consider information about when patient cases appear at the ED, as well as the home location (at the level of zip codes) of those patients. In our current implementation, spatio-temporal information is explicitly figure 18.3 The closed regions represent Bayesian subnetworks. The circles on the edges of the subnetworks denote nodes that are connected by arcs that bridge the subnetworks. Only two such "I/O" nodes are shown per subnetwork, but in general there could be any number. The arrows between subnetworks show the direction in which the Bayesian-network arcs are oriented between the subnetworks. The braces show which nodes can (possibly) be connected by arcs. In subnetwork I, the I/O nodes on the left and those on the right are not necessarily distinct.

figure 18.3 The closed regions represent Bayesian subnetworks. The circles on the edges of the subnetworks denote nodes that are connected by arcs that bridge the subnetworks. Only two such "I/O" nodes are shown per subnetwork, but in general there could be any number. The arrows between subnetworks show the direction in which the Bayesian-network arcs are oriented between the subnetworks. The braces show which nodes can (possibly) be connected by arcs. In subnetwork I, the I/O nodes on the left and those on the right are not necessarily distinct.

represented by nodes in the network, such as Location of Release, Time of Release, and patient Home Zip. We note that in Figure 18.2, the Disease Status nodes contain values that indicate when the disease started (if ever) and ended. This temporal representation has the advantage (over, for example, dynamic Bayesian networks (DBNs) [Neapolitan, 2004]) of being relatively compact. The method allows us to create a network with fewer parameters than the corresponding DBN, and simplifies our method for performing real-time inference.