Data Analysis Normalization filtering clusteringSoftware classification pattern discovery biological interpretation

Fig. 10.7 Microarray experimental components - information management (IM) systems for data acquisition and biological interpretation.

cal and experimental components relating to chip design and construction, data acquisition, image analysis, and data analysis. It will also track clone set gene sequences, descriptors, and other genomic annotations, as well as associated toxicolo-gical/pathological endpoints, and will provide basic bioinformatics tools for data analysis and biological interpretation.

CEBS Phase I will be protocol-driven. All datasets within CEBS will be linked by reference to an experimental protocol number and metadata that will specify standard operating procedures, observations, and measurements to be recorded. CEBS Phase I will include complete sample annotation (e.g., sample name, organism, biosource provider, sample source, developmental stage, age and units, time points, organ/tissue, growth conditions, medium, culture temperature, genetic variation, individual name or ID, disease state, additional clinical information and units, target cell type, cell line, treatment application, treatment type, separation technique, sample extraction method, amplification method, label, etc.). All the data types (numbers, graphs, observations, images, etc.) will be related by the experimental protocol. The data to be stored and their location will be similarly identified in the process of defining the experimental protocol, as will reports to be generated and analyses to be performed. The purpose of this high degree of context documentation is to facilitate extensive query and biological interpretation. Domain-specific metadata will introduce experimental datasets in each analytical domain: transcriptomics, toxicology, pathology, etc. CEBS Phase I will incorporate raw microarray image files as well as fully processed outlier gene lists, together with appropriate visualization tools. Results will be displayed or juxtaposed in various views or graphic user interfaces that will provide insights, facilitate further analysis, and suggest new hypotheses to test.

CEBS also will access biological, chemical, and toxicological resources in public domain databases, as well as pathway information such as that available in the Kyoto Encyclopedia of Genes and Genomes (KEGG) at (Ogata et al. 1999) and in "What Is There?" (WIT) at (Selkov et al. 1998). Links will be built to other databases such as the European Bioinformatics Institute ArrayExpress database (

ArrayExpress/arrayexpress.html), the National Library of Medicine's GEO Database or Gene Expression Omnibus at et al. 2002), and the NTP's new Oracle toxicology information bank.

To address the first of the bioinformatics and interpretive challenges mentioned above, basic gene annotation in CEBS Phase I will be largely automated, and annotation resources will be routinely consulted to provide a complete range of updated gene/protein information. The process of gene annotation is illustrated in Figure 10.8, and some major biological data and information resources for gene annotation are shown in Table 10.1. The links for these annotation resources were operational at the time of writing this chapter. However, please consult the NCT website at http:// a current list of links.

Continuous refinement of gene annotation and sequence definition will improve the interoperability of cross-platform datasets (Zweiger 1999). Steps for keeping sequence data current can be as follows: (1) sequence all cDNA clone sets and refer to the known sequences of oligonucleotide sets, (2) reference GenBank accession numbers and UniGene ID numbers for genes and GenBank accession numbers and dbEST cluster ID numbers for ESTs, (3) reference TIGR gene indices (http://www. Quackenbush et al. 2001) for EST or oligonucleotide consensus sequence and perform a MegaBLAST against trace archives for genomes of interest. Performing a MegaBLAST against trace archives means to compare nucleotide sequence data against the current raw data underlying first-pass sequences generated by various genome sequencing centres. This is particularly important for the rat genome, which is presently very incomplete. This effort to derive new information about incomplete genomes will substantially enhance the discovery value of ESTs on cDNA chips and will facilitate cross-species investigation of gene/protein functional analogies, as will be discussed.

Functional characterization presents a second bioinformatics and interpretive challenge. Functional characterization can involve the grouping of similar genes and

Phylogenetlc Relationship n erica I ues

Seauence <-> Gene Expressions^, Gene

Phylogenetlc Relationship n erica I ues

Sequence Tags

Seauence <-> Gene Expressions^, Gene



Fig. 10.8 Information (annotation) associated with a single gene (adapted from Gibas and Jambeck 2001).


Raw Images



Experimental Data


Fig. 10.8 Information (annotation) associated with a single gene (adapted from Gibas and Jambeck 2001).

10.8 Phased Development of the CEBS Knowledge Base | 217 Tab. 10.1 Some major biological data and information resources for gene annotation.



PubMed GenBank

Biomedical literature

Nucleic acid sequence (e.g., for the rat)


Genome sequence GenBank


Protein sequence

Protein structure

Protein mass spectra

Post-translational modifications

Biochemical pathways

GenBank SwissProt

Protein DB PIR




http :// ?db= nucleotide

http ://

genome http :// http ://

http :// textresid.html http http http http

// // // //

gene products. There are a number of conventional means to accomplish this, including supervised and unsupervised classification/predicttion, artificial intelligence, various genetic algorithms, as well as a number of annotation resources, as just discussed. We propose to use these methods and resources in concert with querying the scientific literature to develop knowledge of the function of genes and gene products.

Literature queries can facilitate gene annotation as well as biological interpretation of microarray expression results. The challenge is to deal not only with accepted microarray gene annotation names but also with legacy data in the earlier scientific literature, with the ultimate objective of making linkages of gene and protein annotations with literature based on sequence information. MEDLINE is the most widely accessible repository of biomedical literature, currently containing over 11 million abstracts and growing rapidly. Unfortunately, it is difficult to use the gene name found in a nucleotide sequence database record (or as presented in a list of outliers) to search the biomedical literature effectively.

The generation of names for genes and gene products based on sequence information is a significant challenge. Ultimately, genes and gene products must be linked by sequence data. Sequence-based synonym naming requires expertise in both data extraction and bioinformatics. Expertise in bioinformatics is required, since much of the searching will need to be done using BLAST (http:// (Altschul et al. 1990). Genomic BLAST pages are available for human, mouse, rat, zebra fish, and other eukaryotic and microbial genomes at the NCBI's BLAST website.

Nucleotide sequence databases, e.g., GenBank and UniGene, do not contain a 'gene product' name field. Instead, the name is imbedded in other information. For example, the GenBank nucleotide definition for 'Estrogen Receptor 1' (the HUGO recognized name for this receptor) is 'Homo sapiens estrogen receptor 1 (ESR1), mRNA'. Extraction of the appropriate search terms 'Estrogen Receptor 1' and 'ESR1' from the GenBank definition is a trivial task, but one that becomes intractable when a large number of genes or protein products are being searched in the literature or when the process is being automated, as is being contemplated in the development ofthe CEBS knowledge base.

To improve the interoperability between microarray gene annotation and the scientific literature, all genes in the clone lists are being provided with vetted name lists. By vetting, we mean that each gene name is searched in MEDLINE, and the way in which MEDLINE parses the name is examined to ensure that it is being searched in the desired manner. For example, searching MEDLINE via Entrez ( with the query phrase 'Estrogen Receptor 1' does not return any abstracts. Closer inspection of the search results indicates that this is because this phrase does not occur in the MEDLINE phrase index. The vast literature (more than 10 000 abstracts) concerning this receptor is only accessible with the legacy names of'Estrogen Receptor' and 'Estrogen Receptor alpha'.

Once name lists suitable for searching MEDLINE are available, we have two tools to help mine the literature data, OmniViz and PDQ_MED. OmniViz (Battelle Memorial Laboratory, Columbus, OH, USA) is a global literature search and visualization software package that can be of great help in obtaining an overview of relevant biomedical publications. The proximity-of-data query software, InPharmix's PDQ_MED (Sluka 2002), can facilitate rapid access to relevant abstracts in MEDLINE for multiple genes (e.g., from a list of outliers).

In CEBS Phase I, a database of gene identifiers, gene sequences, and synonym names suitable for searching the scientific literature will be available; such a database is currently in beta test at NIEHS for human, mouse, rat, and yeast chips printed at the NMG. An Internet interface to the database will be provided, allowing CEBS users to enter a chip name and a list of gene IDs or GenBank accession numbers. The output from the interface will be a list of names suitable for searching in MEDLINE or for use with literature-mining tools such as PDQ_MED or OmniViz. This is an important step toward improving the interoperability between microarray gene annotations and the scientific literature and, ultimately, toward building knowledge in CEBS. We have only begun to determine the optimal approaches to tackle the problems at hand that impede progress in functional annotation.

0 0

Post a comment