The Hotelling Statistic

The Hotelling T2 test provides a statistical test of the deviation of the recent mean of a set of signals from the current time period relative to their expected values, accounting for known correlation between the signals. Let us consider a data set with p variables and n samples from the recent past. Let Xi, a vector of length p, denote the ith sample. The Hotelling T2 statistic is defined as:

where X is the sample mean vector of the n recent samples, S is the sample variance—covariance matrix and ^o is the expected mean vector. For example, in Figure 15.2, February 20 would be the data point with the largest T2 value.

It is well known that T2 is distributed as p(n—— F, n - p

where F represents the Fisher F distribution.

The Hotelling T2 test can be applied to a multivariate time series to detect whether a new datapoint has a substantial deviation from an expected value. The expected value ^o and the covariance matrix S can be computed from historical data. For example, we can take the mean and covariance of all the data seen so far. A more sophisticated approach, which

figure 15.2 A scatterplot of sales reveals an outlier.

accounts for gradual drift in the process is to model the time series (using AR or MA as mentioned before) and compute | and S from this model.

At any time t, we consider the past n values of the p variables. Let X, a vector of size p, be the sample mean of these n values. And let |0 be the expected mean vector. We can test the null hypothesis that X = | by computing

for a suitably chosen a. If the T2 statistic exceeds the value of —~Fa(p n-p), we can reject the null hypothesis with a confidence of (1-a) and signal an alarm.

Several biosurveillance research groups have made significant successful use of Hotelling methods. Good examples, within the ESSENCE framework, are described in Burkom et al. (2004, 2005).