## Postquantification Data Analysis

3.7.1. Microarray Data Quality Control (see Note 16 and Fig. 4)

1. For microarray data on a given DNA binding protein, for each of the triplicate protein binding microarrays, remove data corresponding to any flagged spots (i.e., spots that had dust flecks, and so forth).

2. Normalize the data from each of three triplicate microarrays according to total signal intensity, so that the average spot intensity is the same for all three microarrays.

3. Within each individual microarray, separate the data into sectors, according to their local region on the slide. For example, for the whole-genome yeast intergenic arrays (2), we sectored the spots into the 32 subgrids of the printed microarray.

4. Normalize the data again so that the mean spot intensity is the same over all the sectors. This serves to normalize for any region-specific inhomogeneities in the background and also binding and labeling reactions.

5. Remove any spots whose standard deviation (SD) divided by median value is greater than 2, i.e., spots with highly variable pixel signal intensities.

6. Average the background-subtracted, normalized signal intensities for all spots with reliable data in at least two of the three replicate microarrays, and calculate the SD/mean value. Remove any spots for which the SD/mean value is greater than 1.

7. Treat the SYBR Green I microarray data exactly the same way, except here remove any spots with fewer than 50% pixels with signal intensities greater than 2 SDs beyond the median background signal intensity, as these spots presumably do not have enough DNA present to allow accurate quantification of signal intensities (2) (see Note 17).

3.7.2. Identification of the 'Bound' Spots

1. Calculate the log2 ratio of the mean PBM signal intensity divided by the mean SYBR Green I signal intensity, and create a scatter plot of the log ratio vs the SYBR Green I signal intensities of the spots.

2. Although we expect that the log ratio should be independent of DNA concentration, we have found that higher DNA concentrations, as determined by higher SYBR Green I signal intensities, appear to bind proportionately less protein. To restore the independence of log ratio and SYBR Green I intensity, fit the scatter plot with a locally weighted least-squares regression using the LOWESS function (smoothing parameter = 0.5) (8) of the R statistics package.

3. Subtract the value of the regression at each spot from its log ratio, yielding a modified log ratio that is independent of DNA concentration.

4. Plot the distribution of all log ratios as a histogram (bin size = 0.05), which for a quite sequence-specific DNA binding protein is expected to resemble a Gaussian distribution with a heavy tail.

5. Determine the mode of the distribution by searching for the window of nine bins with the highest number of spots and taking the middle bin.

6. Reflect all values less than the mode and fit these values to a Gaussian function using the Mathematica software package. This provides the mean and SD of the distribution of nonspecifically bound spots.

7. Adjust the log ratios so that the peak of the distribution of nonspecifically bound spots is centered on zero.

8. Calculate a p value for each spot based on z, the number of standard deviations that the spot's log ratio departs from the mean of the Gaussian distribution, using the normal error integral (9) (see Note 18). The p value can be calculated easily in Microsoft Excel using the standard normal cumulative distribution function: normsdist(-z). This p value for each spot represents the probability that the spot is contained within the distribution of nonspecifically bound spots. Thus, spots with very small p values in the heavy upper tail of the real distribution are likely to be bound sequence-specifically by the given DNA binding protein.

9. To correct for multiple hypothesis testing, adjust all individualp values to a modified significance level using the modified Bonferroni method (10,11). For significance testing of the PBM data, we recommend using an initial a = 0.001, which corresponds to a' equal to approximately 1.5 o 10-7 for the highest ranking test case when evaluating approx 6400 unique spots, which is the case for typical yeast inter-genic microarray data (2). Spots meeting or exceeding a' are considered 'bound' at a statistically significant threshold (see Fig. 3A).

### 3.7.3. Discovery of the DNA Binding Site Motif

1. To search for a DNA motif that is overrepresented in PBM data and thus is the likely DNA binding site motif of the given DNA binding protein, select the sequences from all the spots that had a Bonferroni-corrected p value less than or equal to 0.001.

2. For this set of input sequences, use BioProspector (4) (see Note 19) to perform separate motif searches at each width between 6 and 18 nucleotides to identify the highest scoring motifs at each width.

3. Identify all matches to the motif within all sequences spotted on the microarray, and then calculate the group specificity score (5) of each discovered motif. These tasks can be accomplished with the pair of programs ScanACE and MotifStats (5), or with the software package MultiFinder (12).

4. Choose the single motif with the lowest group specificity score (5) to be the most significant, using the set of all sequences spotted on the microarray as the background. We use this scoring metric because it indicates the degree to which the property of containing the sequence motif is specific to the input set of intergenic regions, as determined from the most significantly bound spots on the microarrays. A lower, and thus better, group specificity score indicates that the motif is more specific to the input set of spots (i.e., the spots beyond a 0.001 p value threshold in the PBM data, or the randomly selected spots in the computational negative controls [see step 5 below]).

5. To assess the statistical significance of the DNA sequence motifs resulting from analysis of the PBM experiments, perform a set of computational negative control motif searches. Specifically, perform identical motif searches on at least 10 separate sets of randomly selected spots from the same microarrays used to perform the PBM experiments, with each of the 10 random sets containing the same number of sequences as the original input set for the given PBM dataset.

6. Motifs with group specificity scores that are more significant compared with the group specificity scores of the corresponding computational negative control sets are considered to likely correspond to the DNA binding site motif for the given DNA binding protein (see Fig. 3B). Examples of the ranges of group specificity scores for computational negative controls and for actual PBM data for yeast transcription factors can be found in ref. 2.

## Post a comment