Inference

When performing inference for biosurveillance, our goal is to continuously monitor target variable T by deriving its updated posterior probability as new data arrives. At any given time, there are two general sources of evidence we consider:

1. General evidence at the global level: g = {G = g: G e G}

2. The collective set of evidence e that we observe from the population of people: e = {X = e: X e p., Pi e P}

In our application, g might consist of the observation that the Terror Alert Level = Orange, and e might include information about the patients who have visited EDs in the region in recent days, as well as demographic information (e.g., age, gender, and home zip code) for the people in the region who have not recently visited the ED.

Given e and g, our goal is to calculate the following:

where the proportionality constant is k = 1/ Et P(e\T, g) • P(T\g).

Since T and g are in G, it follows from Assumptions 1 and 2 that the term P(T Ig) in Eq. 2 can be calculated using Bayesian network inference on just the portion of the model that includes G. Performing BN inference over just the nodes in G is much preferable to inference over all the nodes in X, because in the model we evaluated the number of nodes in X is approximately 107.

The term P(eIT, g) in Eq. 2 can be derived as follows:

P(eIT,g) = X P(e11 = i) • P(I = iIT,g), i because by Assumption 2 the set I renders the nodes in P (including e) independent from the nodes in G (including T and g).The above summation can be very demanding computationally, because e usually contains many nodes; therefore, we next discuss its computation in greater detail.

We first show an example from Figure 18.2. Here we are modeling exactly four people in the population. The two on the left have identical attributes, as do the two on the right. We want to calculate the probability of this configuration of evidence, given the interface nodes. For this example, we have two distinct sets of evidence, e1 = {Home Zip=15213, Age=20-30, Gender=M, Date Admitted=never, Respiratory symptoms=unknown} and e2 = {Home Zip=15260, Age=20-30, Gender=F, Date Admitted=today, Respiratory symptoms=yes}.3 We need to calculate:

By Assumption 1,1 d-separates each person model from each other, so this equation can be factored as follows:

P(e 11 ) = P( Pj = ej 11 ) • P( P2 = ej 11 ) • P( P3 = e211) •

It follows from Assumptions 1 and 2 that we can derive each quantity P(Pt = ej 11 = i) via BN inference using just the model fragment defined over the set of nodes in Pi u I. However, this quantity must be calculated for all configurations I = i of the interface nodes. Performing this calculation for each of millions of person models would not be feasible within the time limits required for real-time biosurveillance. We could cache these conditional probability tables so that at run-time they amount to a constant-time table lookup. This technique is problematic, however, because it requires caching of a conditional probability table for all configurations of I and for all possible states of evidence ej. Such a table would be very large. As described in the next two sections, we use two techniques to deal with the large size of the inference problem: equivalence classes and incremental updating. Using equivalence classes saves both space and reduces inference time. Using incremental updating also reduces inference time, often dramatically so.