The major functions and tools of SMD are described below. 3.1. Finding Data

The most common approach to dealing with microarray data is to analyze a set of hybridizations that together form an experiment, the individual arrays representing various time points, conditions, biological or surgical samples, or other combinations of experimental factors. The Publication List and the Basic and Advanced Search tools discussed in this section, facilitate identification and selection of microarrays for analysis. After the initial selection of arrays, reporters (e.g., clones, oligos, or genes) can be winnowed down based on their mea-

Fig. 1. The SMD home page. Important navigational elements are indicated.

surement quality, expression pattern, or other criteria as described in Subheading 3.2.

Alternatively, investigators may start with an individual gene and identify experiments in which that gene was affected. The "Name Search" and "Expression History" tools support this approach.

3.1.1. Access Control

SMD is a research database, containing both published and unpublished data. By default, data are only visible to the owner, members of the owner's group (usually a laboratory), and the database curators. Access may be assigned to other users and groups, or to the public, at the owner's discretion. Therefore:

• All users, including unregistered or "public" users, may view and retrieve data that have been made public upon publication or at the owner's discretion. It is SMD's policy that all data supporting a publication must be made public.

• Registered users may view and retrieve their own data, data belonging to members of their group, and any other data to which they have been explicitly granted access.

• Registered users may edit and delete only their own, nonpublic data. Once data have been made public, the owner may no longer delete it, although editing is still possible.

• Curators may view, edit, and delete all data.

This system permits very flexible collaborations. Unpublished data may be kept private or shared with as many collaborators as required. At the appropriate time, it is easy to make data public, and thousands of microarrays have been published and/or released to the public through SMD.

Some functions of SMD require users to log in before use. Registered users may provide their username and password to gain access to unpublished data. Public users may instead activate a "world session" directly from SMD's home page, gaining access to all published data and most of the analysis tools provided by SMD.

3.1.2. The Publication List

Navigation (no login required):

Home page: "Publications" button.

List menu: "Publications" option.

The publication list organizes data that support manuscripts or other published materials. This is usually the simplest way for "public" users to find data of interest. Citations are listed along with links to the data in SMD, a PubMed link if appropriate, the full text of the article if it is available online, and any supplemental web site for the publication. The list may be sorted and searched by organism studied, date of publication, citation, and authors.

Clicking on the "SMD" icon leads to a page with the title, citation, and abstract of the article, along with the links listed above. In addition, there are links to the "raw" data files for the experiment(s) (see Subheading 3.2.2.) and to machine-readable metadata describing the microarrays included. There are also options to display (see Subheading 3.2.1.) or retrieve (see Subheading 3.2.4.) and cluster (see Subheading 3.2.5.) the data, organized by experiment sets (see Subheading 3.4.3.).

3.1.3. Basic Search

Navigation (login required):

Search menu: "Basic Search" option.

The "Basic Search" tool permits users to browse organized data sets. Users may select an organism and then browse publications (see Subheading 3.1.2.), experiment sets (see Subheading 3.4.3.), or all arrays annotated to a given category (see Subheading 3.4.1.). Options are provided to display (see Subheading 3.2.1.) or retrieve (see Subheading 3.2.4.) and cluster (see Subheading 3.2.5.) the selected data.

Basic Search is frequently the best way to find public, as yet unpublished data, although not all data that a user is allowed to see can be found through this interface. The arrays that are shown in the Basic Search tool are only those that have been organized by the experimenters or curators into experiment sets (and to which the user has access), so hybridizations of interest will not appear if they have not been so organized. In that case, the "Advanced Search" tool is required (see Subheading 3.1.4.). (In the case of browsing by "category," all accessible arrays are presented by Basic Search, but in a manner determined by the annotation provided by the experimenter.)

Basic Search is most powerful in allowing access to experiment sets (see Subheading 3.4.3.), which are the primary means of collaborative communication and data organization in SMD. By using Basic Search rather than the "Advanced" option, collaborators can easily view data as organized and annotated by their coworkers, eliminating confusion and redundancy.

3.1.4. Advanced Search

Navigation (login required):

Search menu: "Advanced Search" option.

The "Advanced Search" tool provides several ways to identify, browse, and select data for analysis. All data to which the user has access may be found with this utility. Hybridizations may be identified by their owner (listed by username, which may be looked up in the User List), print run or array design, or key words (i.e., Category and Subcategory). Registered users may additionally use personal "array lists" for sets of hybridizations that they routinely revisit (see Subheading 3.4.2.). Data may be displayed (see Subheading 3.2.1.) or retrieved (see Subheading 3.2.4.) and then clustered (see Subheading 3.2.5.). Advanced Search is also the jumping-off point for creation of experiment sets (see Subheading 3.4.3.) and array lists.

This is the most powerful search tool and usually the most suitable for experimenters working with their own current data or assembling data from various sources. However, it can be difficult to identify data of interest other than your own, since that depends on the quality of annotations assigned by the data owners (frequently for their own use). Basic Search or the publication list are much more convenient when the data of interest have already been organized and annotated.

3.1.5. Reporter/Gene-Centric Search

Navigation (login required):

Search menu: "Name Search" option.

The "Name Search" tool finds reporters (clone, oligos, or other molecules placed or synthesized on a microarray), rather than experiments or hybridiza tions. Only those reporters found on microarrays in the database may be found in this way—this is not a general replacement for NCBI's Entrez, or Stanford's SOURCE (4). Reporters may be identified by organism and gene name, description, identifier, and so on (wildcards are allowed). All identifiers, annotations, and so forth will be returned for all matches to the search term.

If data for a so-found reporter are available, a link to the reporter's "expression history" will be presented. This will lead to a histogram of all data for the reporter (from arrays to which the user has access). The graph is interactive: clicking on a bar in the histogram will produce a list of the hybridizations in which the reporter had that value, with options to display (see Subheading 3.2.1.) or retrieve (see Subheading 3.2.4.) and cluster (see Subheading 3.2.5.) all data from those hybridizations. This can serve to identify experiments in which a particular gene was affected. Note, however, that microarray data may not be easily comparable across different experimental conditions, technical protocols, and reference RNA mixtures, so the histogram may be somewhat misleading.

3.2. Analysis

SMD provides several methods for unsupervised analysis, primarily hierarchical clustering (see Subheading 3.2.5.), and supervised analysis methods such as Gene Ontology (GO) enrichment analysis, using GO::TermFinder (5). However, SMD is primarily a platform for data storage, quality assessment, and collaboration. There are many excellent software packages for data analysis, such as the R statistical programming language, MatLab, and specialized tools such as Significance Analysis of Microarrays (SAM) to name a few. SMD currently supports analysis using these packages by allowing users to easily download the entire raw data for an array or to filter and download selected data from multiple arrays, in a convenient text-based format. Importing into other software packages is usually a matter of making simple changes to the data format, using a spreadsheet program such as Excel. Several analysis tools, such as the BioConductor ( packages for R, have facilities for reading the data files available from SMD with no editing required.

3.2.1. Display Data

Navigation (login required unless entering via the publication list): Any search tool: "Display Data" button.

The first task in analysis, of course, is to identify the data to be analyzed. The search tools described in Subheading 3.1. are the entry to this process. For detailed information on the hybridizations found by the search tools, click the "Display Data" button. This appears next to each experiment set contained in a publication (see Subheading 3.1.2.) on the Basic (see Subheading 3.1.3.) and

Advanced (see Subheading 3.1.4.) Search pages and with the list of arrays selected through the "expression history" tool (see Subheading 3.1.5.).

The display page presents each array selected or found by the search, along with a number of options for examining the data (Fig. 2). Most of these options are discussed below, in Subheading 3.3. (Quality Control and Other Tools). Most relevant here is the "view details" option, invoked by clicking on the "View" icon for an array of interest. This page presents all descriptive information provided by the experimenter, including channel and general descriptions, and any procedural information entered, as well as links to various quality control utilities. This information indicates the role of the hybridization in an experiment (e.g., the time point or the tissue type or disease state of the sample, and so on), and is thus critical for proper data analysis and understanding of the experiment.

The display page provides additional information if entered from Basic Search or the publication list. In this case, an entry for the experiment set appears at the top of the page. The "view" icon for this entry leads to summary information for the experiment as a whole, including the experimenter's description of the overall experimental design and a listing of experimental factors and their values for each array (e.g., incubation time, age, disease state, or whatever factors were deemed critical by the data owner). If provided by the experimenter, this summary serves as a guide to supervised analysis with other data analysis packages and to understanding the experiment.

3.2.2. Raw Data Files

Navigation (login required unless entering via the publication list):

Any search tool: "Display Data" button: "Raw Data" icon (single array).

Publication list: SMD "book" icon: "Raw Data" icon (all arrays in set).

Lists menu: "All Programs" option: "Get Public Data by Organism" link (public users only).

All measured data, array layout, manufacture information, and reporter annotations for a single array are combined in the "raw data" file for each hybridization. These files can be easily examined and edited in a spreadsheet program and contain all available information to support analysis in other software.

Files in SMD's format are provided for two-color arrays and for summary (at the gene or "probe set" level) data for Affymetrix-style single-channel arrays. Affymetrix probe-level data are provided in the original .cel files, for better compatibility with the many analysis packages that use the .cel format. When requested by registered users, these files (other than .cel files) are dynamically vo






Result Set





Serum sample 097

Absolute transcript levels


Spotted HS365

fo ib s mi is Ht; s is



Cell data

Serum sample 158


&ÜEH& EBlä





Affy Slide 1






mas5 2

\«SM M&m

*. Select and filter data Download raw data View description of hybridization

Generate GFF file Edit description of hybridization Delete data from database


Pelete itnin

SÏSÏ View gridded array image

HI View clickable array image Plot array data

View gridded Affymetrix data View clickable Affymetrix array image

Fig. 2. The "Display Data" view of search results. This page presents many options for examining data, many of which are described in Subheading 3.3. The icons are for various tools, as shown.

generated, in order to present the most current annotations for the reporters (cDNA clones, and so on) on the array. Public users will receive a static file from SMD's ftp site, which is refreshed periodically with current annotations.

Navigation (login required unless entering via the publication list): Any search tool: "Display Data" button: "Data" icon.

This tool returns a subset ofthe data in a raw data file (see Subheading 3.2.2.). The user may select any or all of the measured data fields, array layouts, and manufacture information (e.g., block, row, and column, or polymerase chain reaction [PCR] quality code), and reporter annotations (e.g., gene symbol or UniGene cluster ID). The data may be filtered for spot quality according to metrics chosen by the user (of which there are several dozen to choose from) and sorted by any data field. For example, to list the well-measured spots with the highest ratio of red to green signal, the user could filter for spots with a measured foreground signal more than twice the locally measured background and sort by red to green ratio in descending order. This tool may be used interactively, or the user may specify the fields and filters and download a file of all results.

3.2.4. Data Retrieval from Multiple Hybridizations

Navigation (login required unless entering via the publication list):

Any search tool: "Data Retrieval and Analysis" button.

Most forms of analysis require data from multiple arrays. SMD provides a "preclustering" (pcl) file of data selected and filtered as specified by the user (see for details of the pcl format). These files can be used for hierarchical clustering, either within SMD (see Subheading 3.2.5.) or using a variety of external tools. They are also suitable for use with other analysis tools, generally with only simple modifications in a spreadsheet program.

To generate a pcl file, a user selects arrays using any of the search tools (see Subheading 3.1.) and then clicks on the "Data Retrieval and Analysis" button. This leads to an interactive list of the search results, which may be refined further by selecting specific arrays. Clicking again on the "Data Retrieval and Analysis" button proceeds with retrieval, whereas clicking on the "Display Data" button allows examination of the selected arrays (see Subheading 3.2.1.). Registered users will also see options to create experiment sets or array lists (see Subheading 3.4.).

Proceeding with data retrieval, the user is presented with a series of pages for setting retrieval parameters. Detailed help documents, describing the various options, are available within SMD. Briefly, the major choices to be made are: which arrays to include (determined at the previous step); which reporters to include (all, or a restricted list); which field to retrieve (generally log ratio for two-color data, but any measured value may be selected); what spot quality filters to impose; what if any filters to impose on the expression pattern of each reporter (e.g., rank or variance filters); and whether to transform the data (by centering, and/or by log transformation or simple variance stabilization for single-channel data).

The pcl file may be downloaded upon retrieval from the database, or after filtering for expression profile and/or transformation. A summary of the retrieval and filtering options is also provided for download, making it a simple matter to record the procedure. Alternately, the user may stay within the SMD system and use the hierarchical clustering facilities on the data (see Subheading 3.2.5.).

3.2.5. Hierarchical Clustering

Navigation (login required unless entering via the publication list):

Any search tool: "Data Retrieval and Analysis" button: data retrieval: cluster options page.

Hierarchical clustering is a primary tool for discovery and data exploration in microarray research (and many other fields). Clustering is an iterative process of grouping items, in this case expression patterns or vectors, according to their similarity. There are many examples of cluster analysis in the microarray literature; see in particular ref. 6 for an early demonstration of the utility of clustering of microarray data. The technique itself is well described as a means for finding patterns and subgroups within data—see ref. 7for detailed descriptions of clustering methods.

In SMD, the clustering tool is available as the final step in the data retrieval process described in Subheading 3.2.4. Both reporters (genes) and arrays (conditions) may be clustered by centroid linkage, using either a Euclidean distance metric, or a centered or uncentered Pearson correlation metric. SMD also supports data partitioning, using self-organizing maps (SOMs; 8,9).

The clustering results may be examined using several tools within SMD, including GeneExplorer (10), and the Java TreeView applet (11). Alternately, users may download the result files for examination with other tools. SMD supports GO (12) term enrichment analysis (see Subheading 3.2.6.) for nodes of the gene cluster tree, currently for human, mouse, and yeast data, via the Web-based interactive cluster viewer. To access this tool, users click on the miniature image of the cluster heat map displayed at the end of the clustering process.

3.2.6. GO Term Analysis

Navigation (login required—registered users only):

Lists: All Programs option: "Ontology Term Finder" link.

GO term analysis is frequently an illuminating way to explore the biological significance of clustering results or of any procedure that produces a list of genes as its result. Rather than the researcher needing to know the functions of all the gene products, and the pathways in which they perform those functions, this type of analysis provides a robust way of determining whether there is a biological theme to a group of genes, and whether that theme is present at a rate higher than would be expected by chance.

SMD's GO term analysis tool is not limited to reporters found in the database —any human, mouse, or yeast genes may be analyzed, although not all genes have been annotated to the GO ontology at this time. The pcl file that was clustered, or another list of "background" genes, may be supplied to improve the accuracy of the calculation. All terms to which any gene in the list is annotated are returned, along with p values and estimated false discovery rates. A detailed help document is available within SMD.

0 0

Post a comment