Searches Using NCBis Entrez Search Engine

Most biological scientists are familiar with Entrez, using it routinely to search NCBI databases like PubMed and GenBank (3,4). It has a straightforward interface by which users can simply type in key words and terms to locate relevant material, as well as capabilities to perform complex searches across multiple databases. The Entrez system currently comprises 25 interlinked databases, two of which store GEO data:

1. Entrez GEO DataSets contains DataSet definitions. Researchers can search for DataSets using text key word terms. Retrievals display the DataSet title, a synoptic description, the organism, and the experimental variables, as well as links to the complete GDS record (Fig. 1), parent Platform, and reference Series records. A user can quickly scan through the retrievals and identify DataSets that look relevant to the area of study. Entrez GEO DataSets is searchable using the "DataSets" query box on the GEO home page, or at fcgi?db=gds.

2. Entrez GEO Profiles contains individual, normalized, DataSet-specific gene expression profiles. Researchers can query for specific genes by name, symbol, accession number, and clone identifier, or genes of interest based on characteristics of the expression profiles. Retrievals display the mapped gene name (determined by the sequence reference provided on the array), the DataSet title, and a thumbnail chart of the gene expression profile generated from the normalized values for an individual gene reporter in each Sample of the DataSet. Clicking on the thumbnail image will enlarge the chart to display the full profile details and the Sample subset partitions that reflect experimental design (Fig. 2). Entrez GEO Profiles is searchable using the "Gene profiles" query box on the GEO home page, or at http://

It is usually sufficient to simply enter text key words to retrieve data of interest. For example, searching GEO Profiles with the term "Klk13" will retrieve all profiles that match that gene name. However, information is indexed into several categories (Table 1). These indices can be used to refine and restrict Entrez queries. To perform such a search, specify the search terms, their fields, and the Boolean operations to perform on the term using the following syntax:

term [field] OPERATOR term [field]

where term(s) are the search terms, the field(s) are the search fields and qualifiers, and the OPERATOR(s) are the Boolean operators (uppercase AND, OR, NOT). The Preview/Limits link on the Entrez tool bar assists greatly in construction of complex queries. Alternatively, complex search statements can be written and executed directly in the search boxes. The indices (available on the Preview/Limits page) may be used to browse and/or select the terms by which the data are described.

3.1.1. Identifying Experiments of Interest

In many cases, a researcher will begin by looking for DataSets that are pertinent to the area of study. For example, to locate experiments that investigate spermatogenesis or testis development in mice using Affymetrix GeneChip technology, search Entrez GEO DataSets with "(spermatogenesis OR testis development) AND Affymetrix AND mouse."

Table 1

Entrez Qualifier Fieldsa

Field name

Field description

GEO DataSets Author

Experiment Type

GDS Text

GEO Accession

GEO Description/Title Text

Number of Samples

Number of Platform Probes


Reporter Identifier

Sample Source

Sample Title

Submitter Institute

Subset Description

Subset Variable Type

Authors associated with the experiment

The experiment type, e.g., cDNA, genomic, protein, SAGE

DataSet description text

The GEO accession number (GPLxxx, GSMxxx, GSExxx, GDSxxx) Text provided in the description/title of original records The number of Samples in the DataSet* The number of Platform reporters in the DataSet*

The organism from which the reporters on the array were derived/designed The identifier for the array reporter (GenBank accession, gene name, and so on) The source biological material of the Sample Sample Title Submitter institute

The description of the experimental variable

The type of experimental variable, e.g., age, strain, gender


Table 1 (Continued)

Field name

Field description


GEO Profiles Experiment Type Flag Information Flag Type GDS Text GEO Accession GEO Description/Title Text GI

Gene Description ID_REF

Max Value Rank Max value in profile Median value in GDS Median value in profile Min Value Rank Min value in profile Number of Samples Organism

Ranked Standard Deviation Reporter Identifier Sample Source

The experiment type, e.g., cDNA, genomic, protein, SAGE Specific experimental variable flags, e.g., age, strain, gender Flag types, e.g., rank and value subset effects DataSet description text

The GEO accession number (GPLxxx, GSMxxx, GSExxx, GDSxxx) Text provided in the description/title of original records Mapped GenBank identifier Gene description, symbol, alias

The unique identifier for a reporter as given on the array

The maximum value rank*

The maximum value in profile*

The median value in DataSet*

The median value in profile*

The minimum value rank*

The minimum value in profile*

The number of Samples in the DataSet*

The organism from which the Samples were derived

The ranked standard deviation

The identifier for a reporter

The source biological material of the Sample

"Useful qualifier fields for performing restricted GEO DataSets and GEO Profiles queries. (*) indicates possible range operation, e.g., 20:50[Number of Samples] will find DataSets containing 20 to 50 Samples.

3.1.2. Identifying Gene Expression Profiles of Interest

Once the researcher has located relevant DataSet(s), he or she can use the DataSet accession number (GDSxxx) to restrict searches to that experiment. For example, to view the profiles of all heat shock genes in DataSet GDS181, he or she could query with "GDS181 AND heat shock."

If a DataSet accession is not specified, then the search will be performed across all GEO data. For example, to view profiles of kallikrein family genes in any DataSet that investigates progesterone, enter "kallikrein AND progester-one[GDS Text]."

Often the gene name will contain words that can be found in the DataSet description, and vice versa. In these cases, the only way to specifically retrieve data would be to restrict the search using the appropriate qualifier. For example, a researcher trying to locate profiles for the glycine receptor gene would have to restrict the search to "glycine receptor[Gene Description]"; otherwise, he or she would also retrieve all profiles from GDS967, a DataSet investigating a "Glycine receptor beta subunit mutant model for hyperekplexia."

Several fields are available for refining a search to help identify interesting or significant profiles based on the characteristics of the expression pattern. For example, the expression measurements of each Sample in a DataSet are rank ordered. It is possible to refine searches to identify genes that fall within a specified abundance bracket. To view profiles that fall into the top 5% abundance rank bracket in at least one Sample in DataSet GDS182, search with "GDS182 AND 96:100[Max Value Rank]." Alternatively, to search for profiles in which the median value level across the DataSet is high (approx 12-14 for this DataSet), query with "GDS182 AND 12:14[Median value in profile]."

Typically, researchers seek genes that vary their expression depending on experimental factors. As described in Subheading 2., GEO DataSets are partitioned into subsets that reflect experimental design. Profiles are flagged if they display a significant effect in relation to subsets, that is, if the expression values or ranks pass a threshold of statistical difference between any non-single experimental variable subset and another. These flags assist in the identification of candidate genes as follows:

• Users can restrict their searches to find any profile that exhibits an effect within specific DataSets. For example, to view profiles showing interesting value subset effects in either DataSet GDS186 or GDS187, query with "(GDS186 OR GDS187) AND "value subset effect" [Flag Type]." A convenient way to run this search is from the DataSet record page (see subset effect box in Fig. 1).

• Users can search across the whole GEO for genes that show an effect with respect to a particular experimental variable type. For example, to search for any gene that shows an effect with respect to gender, query with "gender [Flag Information]."

• Standard GEO Profile retrievals are default ordered according to subset effect flags, bringing potentially significant and interesting profiles to the fore.

The Entrez search system is a powerful tool that interlinks many diverse data domains. Regular users of NCBI resources are well advised to familiarize themselves with the advanced mining features available though Entrez (see Note 1).

0 0

Post a comment