9.2. Using Regression to Account for Seasonality
In order to handle seasonal effects in the data, we can use the part of the year that we are currently in as a hint to the method that predicts the expected count for today. For example, we can define the term hours_of_daylight as:
'I 2 * I Number of days since July 31 | n n 1 365.25 J-I
This expression does not really model hours of daylight, but it provides a numerical value that is low (-1) in summer when we expect little influenza activity and is high (+1) in winter when we expect more influenza. How can we do the best job of predicting the expected count for today based on this hours-of-daylight feature and a table of historical data? The answer is, of course, to use linear regression, where the X-values are hours-of-daylight values from all previous days
Using this approach, the predicted counts can be modeled as shown in Figure 14.15. The upper limit can be derived using a well-known technique called regression analysis, which we will not describe here. Note how in Figure 14.15 the expected counts and upper limits loosely model our expectations based on seasonality. As a result, there is lower estimated standard deviation in the noise compared with the control chart of Figure 14.5, and so the upper limit can stay closer to the expectation. This means the alarms can be more sensitive, and you can notice a far stronger reaction to the synthetic ramp outbreak.
There is an interesting feature in Figure 14.15 which might have caught the eye of a reader familiar with time-series regression. Why do the magnitudes of the sinusoids change over time? The reason is that at each date, the regression method can be trained only on data from before the current date, and so the training data set is varying as time goes by, which is why the linear relationship varies, and, in turn, causes the expected counts and upper limits to follow a varying pattern.
In addition to seasonal effects, regression can also deal with day-of-week effects. For example, if we know that on Mondays, the count for health utilization cases tends to differ relative to the surrounding days, we can add a binary term is_monday to the regression. With this extra term, the independent variable X no longer consists of a single value. Instead, X is now a tuple, meaning that each data point consists of the values
(hours _of_daylightl, is_mondayit yi) where is_Monday takes the value 0 for all days which are not Monday, and value 1 on all days that are Monday. Linear regression then learns the following relationship:
= P0 + P1 x hours_of_daylight if today is not a Monday expected_count
= P0 + P1 x hours_of_daylight + P2 if today is a Monday
Thus, Mondays can be given a little "bump" in their expectation, and historical data combined with linear regression can determine the best value for that bump. A close-up of the resulting predictions is shown in detail in Figure 14.16 and across all 3 years in Figure 14.17. Note that the pattern of alarm gets stronger and stronger during the ramp outbreak. Monday's alarm is substantial, but less substantial than in the earlier method because we were already anticipating the Monday "bump.'' Note in Figure 14.17 how our injected spike now has a far stronger alarm than any other period in the three years.
Suppose we would also like to account for the fact that a date falls on a Tuesday. We can simply add another binary term called is_tuesday to the regression. We can even consider all of the days of the week by adding the six binary terms is_monday, is_tuesday, is_wednesday, is_thursday, is_friday, and is_saturday. Note that a term for is_sunday is not needed because a value of 0 for all six binary terms would indicate
Bars show alarm levels: max = 6.24237
Was this article helpful?