Organization of the Database

To readily interpret the data in GEO, it helps to have a general understanding of the database structure and content. Researchers provide their data in four sections: Platform, Sample, Series (which receive persistent GPLxxx, GSMxxx, and GSExxx accession numbers, respectively), and raw data. A Platform defines the array template and contains sequence identity tracking information for each feature on the array. A Sample record contains the measured hybridization data, along with a description of the biological source and treatment protocols. A Series record ties together experimentally related Samples. Accompanying raw data (e.g., Affymetrix original probe data or cDNA array.tif images) may be optionally supplied.

The hardware and software packages that generate and process microarray data produce a wide assortment of data styles and formats—Platform and Sample tables can take on many different structures and contain multiple and varying types of ancillary and supporting information. Furthermore, microarray-based technologies and processing strategies are continually evolving. The GEO database was designed with these considerations in mind and has a flexible and open architecture that can accommodate variety. However, data provided in different styles and formats are not readily interpretable or analyzable, even by the experienced user. To address this issue, an upper level of organization is applied. Despite the diversity, a common core of relevant data is supplied to GEO:

• Sequence identity tracking information for each feature on the array.

• Normalized hybridization measurements.

• A description of the biological source used in each hybridization.

These data are extracted from the submitter-supplied records and reassembled by GEO staff into an upper level unit called a GEO DataSet (assigned a persistent GDSxxx accession). A DataSet represents a collection of similarly processed, experimentally related hybridizations. Samples within a DataSet are further organized according to experimental variable subgroups, for example, they are categorized by age, disease state, and so on. A DataSet can be rendered to generate two separate representations of the data:

1. An experiment-centered perspective that encapsulates the whole study. This information is presented as a DataSet record. DataSet records comprise a synopsis of the experiment, a breakdown of the experimental variables, access to several data display and analysis tools, and download options (Fig. 1).

2. A gene-centered perspective that provides quantitative gene expression measurements for individual genes across the DataSet. This information is presented as a GEO Profile. GEO Profiles comprise gene identity annotation, the DataSet title, and a chart depicting value and rank measurements for that gene across all Samples in that DataSet (Fig. 2).

DataSets represent a standardized format for the data in GEO. All the data analysis and mining tools described in this chapter are based on DataSets.

0 0

Post a comment