With complete, or almost complete, genome sequences from a large number of species becoming available, the issue of assigning a function, if any, to each string of nucleotides has now moved to the forefront of activity in the human genome project (1). A string of nucleotides involved in a physiological process, such as encoding part of a protein (an exon) or specifying the spatiotemporal pattern of gene expression (e.g., a binding site for a transcription factor), is referred to here as a functional element in the genome. Much progress has been made in identifying genes using either ab initio predictions or evidence-based predictions, but a complete set of genes for most organisms cannot be unambiguously assigned (2). Computational detection of noncoding functional elements is even less well developed, mainly because of the limited understanding of the

From: Methods in Molecular Biology, vol. 338: Gene Mapping, Discovery, and Expression: Methods and Protocols Edited by: M. Bina © Humana Press Inc., Totowa, NJ

role of DNA sequences in the molecular mechanisms of gene regulation or other noncoding functions (3-5). However, methods of comparative genomics succeed at a sufficiently high rate that they are commonly used to predict candidate cis-regulatory elements for experimental validation (e.g., 6,7-10).

cis-regulatory modules (CRMs) are sets of functional elements that are clustered to form a regulatory unit (such as a promoter or enhancer) that acts in cis to a gene to control its expression level, timing, or tissue specificity. A large number of bioinformatic approaches have been developed to help investigators predict CRMs. This chapter describes how to use publicly available, web-based bioinformatic servers developed in our research group and those of our collaborators to predict CRMs based on properties of vertebrate genomic sequence alignments. Additional excellent servers are described in other chapters in this book; some are listed in Table 1.

The Methods section (Subheading 3.) refers to several functions computed from genomic sequence alignments to bring out different features associated with regulatory functions. For instance, a fundamental observation is whether a sequence falls within an alignment. The methods discussed in this chapter utilize precomputed, whole-genome alignments of sequences from several species, generated with the programs blastZ (11) and/or multiZ (12). Several other alignment algorithms and servers have been developed, as described in a recent review (13). More recently servers with improved features have been developed, which provide enhanced abilities to align and analyze sequences provided by the user (Table 1).

Purifying (or negative) selection is one of the most general genomic features that indicate function. The precomputed, whole-genome alignments have been analyzed for evidence of purifying selection following their divergence from a common ancestor. This type of selection can be inferred using the phastCons program (14), which computes the likelihood that a given nucleotide in a sequence (represented as a column in the alignment) is in the 10% most slowly changing sequences in the genome. Scores associated with phastCons analyses are visualized in the "conservation" track on display at the UCSC Genome Browser (15). Presented as highly resolved scores with wide dynamic range, the scores increase with stronger evolutionary constraint. Higher scores are implicated in function, but they provide no insight into the nature of the function.

The precomputed, whole-genome alignments have also been analyzed for the likelihood of involvement as a CRM, computed as a regulatory potential (RP) score (16,17). Considered as short runs of columns (containing from two to five aligned positions), regions are analyzed for their frequency of appearance in a training set of known regulatory elements vs a training set of ancestrally derived neutral DNA. This function is influenced by the degree of evolutionary constraint, as is phastCons, but it also incorporates information about patterns in

Table 1

URLs for Servers Used to Predict CRMs

Table 1

URLs for Servers Used to Predict CRMs




Genome sequences,




and annotations

Browser and

Table Browser



ECR Browser

















rVista 2.0,


Gene expression






Motif discovery







Crème 2.0

the alignments (16). Empirical evaluations of the effectiveness of this approach for finding regulatory regions of proven function show that both RP and PhastCons work well with some highly conserved datasets, such as enhancers of developmental genes (18). RP performs better than phastCons on a very difficult reference set containing all the CRMs in the human HBB gene complex.

None of the alignment-derived scores, including phastCons and RP scores, are sufficiently specific for highly reliable predictions of CRMs (19). Therefore, it is prudent to combine these with other features commonly found in CRMs, such as binding sites for transcription factors. Many binding site motifs have been discovered and are recorded in resources such as TRANSFAC (20) and JASPAR (21). Tools to identify motifs, based on overrepresentation of sequence strings in a given set of sequences, are also widely used (5,22). In general, any approach to find motifs in one single sequence returns an excess of false positives. Requiring strict conservation in alignments of human, mouse, and rat sequences reduces the number of hits to binding sites for transcription factors by a factor of about 40 (23). This chapter describes how to access matches to conserved transcription factor binding sites (cTFBS) computed by the program tffind (24).

Precomputed binding sites allow a user to look for sites of interest that fall within a neighborhood of a genomic locus, without setting strict limitations on the amount of sequence being submitted in the search. In contrast, someone using a server to find matches to TFBS in a sequence will typically extract a few kilobases upstream and downstream of a gene to submit. The limitation of the analysis to a certain distance around a gene may inadvertently exclude important regions. The use of precomputed binding sites allows a user to select a larger region and subsequently reduce it through queries of a more refined region.

The data discussed in this chapter are stored in databases at the University of California at Santa Cruz (UCSC) Genome Browser (15) and GALA, a database of genome sequence alignments and annotations (25,26) (Table 1). A recently released metaserver, Galaxy, provides a platform for integrative analyses of genomic sequences and annotations (27). The metaserver uses the query engines from remote databases such as the UCSC Table Browser (28) and other resources to retrieve primary data, and it provides operations and tools to filter, combine, and analyze the data. The Galaxy metaserver project is new and should grow to connect to many data repositories and provide a large suite of operations and tools. GALA is a more mature database project that also provides access to alignment and annotation results. GALA follows the traditional approach of recording all the data in a database on one large machine, whereas Galaxy accesses data from remote sites. Instructions for acquiring and analyzing data to predict CRMs using both Galaxy (in conjunction with the UCSC Table Browser) and GALA are presented in the Methods section (Subheading 3.).

The basic method described in this chapter is to retrieve candidate CRMs in erythroid cells as noncoding DNA segments with a high phastCons score or high RP score and a conserved match to a GATA-1 binding site. GATA-1 is a transcription factor that is essential for proper gene expression during late erythroid maturation (29). A description is given of how to obtain noncoding genomic

DNA segments with the desired phastCons or RP scores, how to obtain conserved GATA-1 binding sites, and how to identify all the conserved or high-RP intervals with a conserved GATA-1 binding site in close proximity. A similar approach could be followed for any binding site of interest, when some information is known regarding preferential tissue specificity of the factor.

Although the approach described using premapped matches to binding sites for the entire genome is useful, other computational tools are being developed to discover motifs (short nucleotide strings). These extensions of basic pattern matching require a given motif to be enriched in, for example, sequences immediately upstream from a set of coexpressed genes. Thus, they are frequently used to find candidates for common regulatory elements controlling similarly expressed genes. Clusters of coexpressed genes are commonly deduced from transcriptional profiles based on microarray or other experiments measuring expression. Two large public databases of gene expression data are located at the Gene Expression Omnibus and ArrayExpress (Table 1). A sample of motif-finding servers is listed in Table 1. These simply require that users submit a list of sequences, such as the promoters (known or predicted) for a set of coexpressed genes. The servers use different methods (5) for motif discovery. An evaluation of the performance of these methods was recently published and provides further information on the subject (22).

0 0

Post a comment