As we discussed previously, the background activity in actual healthcare utilization data is never a clean signal. There are noisy fluctuations in the data due to seasonal, day-of-week, and holiday effects. The detection algorithms described up to this point have a difficult time working with such data because the background activity is so difficult to model. If the background cases are estimated from a time period with a high count due to seasonality, then the detection algorithm becomes very insensitive to high counts in future data. Conversely, if the background is estimated during a valley, then the algorithm becomes too sensitive. In this section we describe how a simple and familiar approach—linear regression—can help with these kinds of problems. First, we will very briefly recall what linear regression does. Next, we will see how it can help model seasonal effects and then other effects, such as the day of week.
Regression is a statistical method used to determine the relationship between an output Y and a given set of inputs X. In linear regression, Y is assumed to be a linear combination of the set of inputs X, that is, Y= ¡0+ ¡1X1 +...+PmXm. The output Y, however, is assumed to be shifted by some noise, where the noise is assumed to be generated from a Gaussian random variable with mean 0 and variance a2. We will refer to the variable Y as the dependent variable and the set as the independent variables.
We will illustrate linear regression using a simple example in which the set X1 consists of only variable X1. A typical regression problem begins with a set of n observed data points of the form (Xi, yi) such as a set of two dimensional coordinates shown in Figure 14.12. From these data, we would like to find a line of the form Y = ¡i1X+fi0 that best fits this data. The best fitting line is defined as the line that minimizes the sum of squared residuals, where the residual is defined as the
difference between the actual value and the predicted value, that is, y1 -P1X1 -fi0. The method of least squares picks the values of p0 and p1 that minimize the formula:
in our historical data, and the Y-values are the corresponding counts in our historical data. For example, if we wish to predict the expected count for January 1, 2005, from the data in Figure 14.5 in this manner, the regression we must perform is shown in Figure 14.14. It turns out that our example data sets in the regression tutorial were derived in exactly this way.
If we perform a linear regression on the points in Figure 14.12, the optimal values of p0 and p1 turn out to be j80 =70.97 and p1 = 34.01. The resulting predictions are
Was this article helpful?