Novel strategies could help with discovery and mapping of the "codes" in DNA that regulate the expression of genes. The completed sequence of the human genome (1-3) and the genomic sequences of model organisms offer a rich source of data for addressing this problem. Not surprisingly, most efforts have focused on discovery of codes that control the expression of protein-coding genes since these codes are the key components of the complex networks and pathways through which various cell types are produced.

We have aimed at discovering regulatory codes, irrespective of their orientation and order in genomic DNA (4). The underlying hypothesis that drives our strategy is that short-sequence motifs occurring frequently in promoter regions

From: Methods in Molecular Biology, vol. 338: Gene Mapping, Discovery, and Expression: Methods and Protocols Edited by: M. Bina © Humana Press Inc., Totowa, NJ

of genes may correspond to codes that act in cis to regulate the expression of linked genes (4). This research problem has also been addressed by others (see, for example, ref. 5-7).

Our strategy relies on collecting 9-base elements that occur in the promoter regions of human genes (4). The hypothesis is that these elements may encompass, include, or overlap with "words" through which the regulatory codes could be described (4). To examine this hypothesis, we have created a "dictionary" consisting of all possible 9-mers, irrespective of their orientation in DNA (4). Previously we used the dictionary as a reference for collecting 9-mers from proximal promoter regions that were experimentally defined and from promoter regions that were deduced with respect to the 5' end of a nonredundant set of human expressed sequence tags (ESTs) (4). This latter dataset was obtained from a previous publication (8).

In this chapter, we describe the schema of the database and the associated programs developed for data collection. The philosophy of the design was to create a set of relatively small programs that could be used along with system utilities and ad hoc scripts to create databases. The database was constructed to provide multifaceted views of the data as well as a mechanism to relate ancillary data to the originally processed data without having to reprocess the original sequences. Although this approach requires the researcher to be more comfortable in a shell/command-line environment, it allows flexibility in research direction. Furthermore, multiple contributors to the Toolkit need only know the general design philosophies instead of intricate knowledge of one or two monolithic programs.

The Toolkit for creating the database can be downloaded from the following web site: Our studies have focused on analyzing the promoter regions of human protein-coding genes (4). The strategy is general and can be applied to collect data from the regulatory segments of other species. Since the sequences in our dictionary are associated with identifiers (IDs), we propose that collections from various species could help with creating a framework for examining the evolution of regulatory codes in DNA. The reference table and the strategy may also help with a general format for organizing and studying the single-nucleotide polymorphisms (SNPs) associated with regulatory codes in DNA.

0 0

Post a comment