## Implications

1. When they are valid, parametric tests are more statistically powerful than non-parametric tests. This means that more probesets should pass a threshold based on p values from ANOVA than would pass if the p values were based on a nonparametric test of the same data.

2. Log-transformed microarray data approximate a normal distribution. Even though it is not perfectly normal, it is close enough to allow the use of parametric tests. To confirm the accuracy of parametric tests, such as t tests and ANOVA, on log-transformed signal data we can use the residuals. Residuals are the measurement errors, in this case the differences between each observed value and the mean of all observations. If parametric tests are valid, the residuals should be normally distrib uted. At first glance, the residuals for all genes are not normally distributed (Figure 3.12 A). Observe that low signal values are more variable than moderate-to-high signal values. This means that the variance is unstable across intensity and therefore the residuals are unstable. A standardized measure is the difference between each observation and the mean divided by the standard deviation of the measurements. A standardized residual therefore removes the effect of the differing variances on the measurement. The standardized residuals are nearly perfectly normal (Figure 3.12 B and Table 3.1). The instability of the variance over intensity fully explains the non-normality in the distribution of residuals. As a result, we can predict when parametric tests will be the most accurate. The practical implication is that in the rare instances in which a signal level changes over several orders ofmagnitude, the

Fig. 3.12 Residuals of ANOVA for each probeset over a time series. The x axis represents the log-transformed signal values. (A) Pattern of residuals over signal intensity. Such a nonnormal pattern may influence the accuracy of p values produced by a parametric test. (B) The standardized residuals have a random scatter consistent with the assumptions of parametric tests. The practical result of this

logsig

observation is the p values derived from ANOVA of the log signal are reasonably accurate for the majority of probesets. For those rare probesets that change dramatically and therefore have dramatically changed variance, the p values are less accurate. As a rule, variance-stabilizing methods minimize these effects, allowing us to get accurate p values from parametric tests.

Tab. 3.1 Summary data for residuals. These distributions demonstrate that divergence of residuals from normal is an effect of the instability of variance across intensity and that, for most genes that do not change, this instability does not affect the p value, as the standardized residuals are nearly perfectly normal. We expect therefore that only those rare probesets that change dramatically in magnitude will have inaccurate p values.

 Residuals Standardized residuals Ideally normal residuals Mean -2.31e—11 -7.20e-11 0 Standard deviation 0.3995765 1.000002 1 Variance 0.1596614 1.000004 1 Skewness -0.2025563 -0.1261272 0 Kurtosis 6.9743 2.9944 3

p value estimate is less accurate. For typical, small changes in signal, the difference in variance is small and the p values is accurate.

3. The p values provided by a multiclass nonparametric test such as the Kruskal-Wallis test is more significant (lower p value) for linear responses than for a nonlinear response (Figure 3.13). A nonparametric test is based on rank order. For example, in a linear time response, there are changes in rank order at every time point, but a spiking pattern may only have changes in rank order for a single time point. This is not necessarily true for a nonparametric test when the magnitude of change is also taken into effect. This is an example of the principle that the mechanics of a statistical test may crucially influence which probesets are selected by p value and therefore the interpretation ofthe results.

4. Each time or dose point may have a different biological significance. By contrast, statistical tests treat each time point in a concentration series identically. If the bio-

Fig. 3.13 Gene expression responses to time (log(signal) as a function of time in hours) from four different probe sets with the lowest (best) p values produced by four analytical methods. Each of panel represents four replicates of the same probeset. (A) ANOVA of changes. Time and the differences between time points were tested by a categorical ANOVA. This is a good technique for finding transient spikes. (B) ANOVA model with time treated as a categorical variable. (C) Kruskal-Wallis test. This signal response is nearly linear, because a linear change with time appears to have the most significant rank order changes. Patterns such as those in (B) are much less likely to be discovered by the Kruskal-Wallis test than those in (A), because in (B) the gene is relatively stable in ranking over most of the time course. (D) Linear regression oflog-transformed data with time. This method cannot detect the patterns in (A) or (B), as the divergence from linearity is penalized.

Fig. 3.13 Gene expression responses to time (log(signal) as a function of time in hours) from four different probe sets with the lowest (best) p values produced by four analytical methods. Each of panel represents four replicates of the same probeset. (A) ANOVA of changes. Time and the differences between time points were tested by a categorical ANOVA. This is a good technique for finding transient spikes. (B) ANOVA model with time treated as a categorical variable. (C) Kruskal-Wallis test. This signal response is nearly linear, because a linear change with time appears to have the most significant rank order changes. Patterns such as those in (B) are much less likely to be discovered by the Kruskal-Wallis test than those in (A), because in (B) the gene is relatively stable in ranking over most of the time course. (D) Linear regression oflog-transformed data with time. This method cannot detect the patterns in (A) or (B), as the divergence from linearity is penalized.

logical significance of a time point changes, it should be reflected in the statistical model. For example, toxicology experiments may split time into two classes that correspond to acute and chronic responses. The decision to treat time and dose as a continuous variables or as class variables should be explicitly made, based on the biology. The gene expression pattern in Figure 3.13 A is not readily captured by models in which time is continuous. Furthermore, in the context of a toxicological response the rapid change in the transcript in Figure 13 B in the 0 to 1 h time interval is much more interesting than the same response at any other time point.

5. Vehicle and naïve controls are commonly used in toxicology. Toxins are usually dissolved in a solvent or 'vehicle' when applied to a system. The vehicle by itself may have a significant response that needs to be controlled. Usually this is done by comparison to a 'naïve' control that has not been exposed to the vehicle or the toxin. The responses for these two types of controls can be classed together in an ANOVA, because neither response is related to the toxin. However, modelling the naïve and vehicle controls in a split-plot design more accurately reflects the experiment (Cobb 1997, Chapter 8, Section 3).

6. Matching the logic of the statistical test to the biological interpretation is very important. Do not use an experimental design without first verifying that valid interpretation of the statistical results addresses the questions of biological interest.

0 0