Data Handling And Data Visualization

Handling the massive amounts of data generated by a high-throughput screening lab is a continuing challenge that will be further exacerbated by the move towards assay miniaturization, with its potential to generate data at a 10- to 100-fold greater rate. Whether the physical format of test devices are 1536-well, 9600-well, or chip-based, one of the potential rate-limiting activities is the transfer of data (raw or processed) acquired from measurement devices, association of this data with some unique indexing or cataloging bits, matching of the data back to compounds tested, and loading of this data into a database that can be queried. Once the data is in the database, the additional bottleneck is data presentation as an aid to supporting decisions about the quality and significance of the data accumulated. Even with current production runs of 100-400 96-well plates per day, reviewing reams of tabular information or plate-by-plate graphical reports is untenable.

The HTS community would do well to learn from the progress towards automation of the DNA sequencing and transcriptional profiling being done by functional genomics companies. All data loading is done automatically from sequencing readers directly to ORACLE relational tables as raw fluorescent traces, which are then provisionally assigned A, C, T, or G by automatically launched processes. Sequencing run sets, associated primer sequences, filenames, and actual raw traces are automatically linked and available for use for other downstream queries, such as searches against public databases. The end user can monitor the progress of any sequencing effort in real time, drill down to primary data to verify data quality, and edit these data files all from a client-host environment. Removal of all human intervention except at the point of data review streamlines data handling.

Handling and processing of these increased volumes of data and automatically loading both raw and processed data into a robust relational database management system (RDBMS) is central to a HTS process. All processed data is derived from automatically launched algorithms that calculate them from the underlying raw primary data. This paradigm uncouples the data loading process from the data review and analysis process. Some HTS groups load only validated data, but this creates a bottleneck for data capture and exposes the system to misplaced or lost data. The loss of a link back to the primary data obviates verification or modification during subsequent analyses. For instance, while a test plate may look out of specifications, upon subsequent review it is found that a spurious control well has thrown off all the calculated percentage inhibitions; masking this value (not deletion from the database) and recalculation yields an acceptable data set from this test plate. Loading only a processed data set of percentage inhibitions would have led to a loss of data. More importantly, at any later date another scientist could see the underlying data in order to verify the rejection of the ''spurious'' control. These quality control decisions are made after the data is acquired and already in the database. HTS is a highly iterative and dynamic process. Certain criteria such as hit rejection thresholds are provisionally set at the start of an HTS campaign; it needs to be updated throughout the entire campaign, and the final screen selections need to be reviewed as a whole after the screening collection is complete. This cannot be done appropriately in a manner based on sound statistical analysis if data is rejected too early in the process.

Regardless of the particular RDBMS, any HTS group should give considerable thought at the outset in defining their work flow to highlight the kinds of data types being generated and used, and more importantly, in defining what kinds of queries they might want to ask of their database and their processes.

It is particularly important in image-based technologies that the primary raw data, e.g., pixel intensity, be directly stored to tables, rather than the processed image that shows up as a grey scale or color-coded image on a CCD camera's terminal. These visual images can be readily distorted by changing gain and contrast, whereas the primary absolute intensity values do not change. Link back from a processed image and its setting files to the table of raw intensity values would be indispensable.

Another key to data analysis is that human beings see patterns better than computers in the absence of considerable investment in algorithm and code generation. It is better to have good visual representations of the data rather than simple tables of data. Rapid graphical representation of data is critical during two main HTS activities, the initial data review to uncover experimental artifacts and the final review of library performance against an assay and across assays as an assessment of library quality.

Data from runs represented as grids of plate data are useful to obtain a feel for how a particular plate did on a given day. However, as run sets routinely have become larger, the value in reviewing each data set as individual plates becomes less, except as an alert that there may be some assay artifact for that plate. Therefore an aggregate histogram or scattergram becomes more useful, with the ability to drill back to an individual plate that may be associated with a section of that scattergram. Visualization is very useful in the area of recognizing plate artifacts, such as edge effects or pipetting errors. If during a day's run, all plates run on that day collapsed onto a ''virtual'' aggregate plate with some measure of hit frequency, it would be obvious if a statistically anomalous number of hits were located in one area on the plates. Potential artifacts of this type could be noted readily.

A dynamically updated histogram of all aggregated data for a screen that can be broken into constituent parts and presented as subset histograms according to a day's run, last batch run, etc., would facilitate decision support for data rejection or inclusion. It would also give an indication of whether the library was truly random in its distribution of activity or whether it was more active for a subset of the library (i.e., bimodal).

During final review, data visualization as a multidimensional surface of biological activity of the library against several assays would uncover hot spots of activity for certain target types. These could then be parsed out for analysis against chemical space parameters to uncover additional pharmacophores. There is a need for displaying these n dimensions simultaneously on a 3-D projection. There are several software companies that have enabled these features. The data visualizer from Spotfire allows dynamic display of up to six dimensions or characteristics for up to 150K samples.

0 0

Post a comment