To create a reference, we have computationally generated all possible 9-mers, producing 49 sequences (2). To facilitate data utilization, each 9-mer was assigned an identifier (ID) (2). To make the 9-mers independent from their orientation in DNA, the complementary 9-mers are considered as pairs. Therefore, the set contains 131,072 pairs of 9-mers. This set was named BINA_RF_sorted_master9 and can be downloaded from the web (see Note 1).

Each pair in that set defines the forward (f) and the reverse complement (r) of a 9-mer. For example, RF107075f corresponds to TCGCGAGCG, and RF107075r defines its reverse complement (CGCTCGCGA). The sequences are written in the 5' to 3' direction, by convention (see Note 2).

The human promoters that we have analyzed for data collection (2) correspond to a subset of sequences described in a previous report (9). The sequences were derived with respect to the 5' end of a relatively large set of human expressed sequence tags (ESTs) (9).

To facilitate data collection and management, we created a database. For the database engine, we chose MySQL (see Note 3). The data were collected by comparing each 9-mer in human promoter sequences (between positions -500 to +50) with the reference set consisting of all possible 9-mers (see Note 1).

A summary table was created to include the 9-mers that occurred between positions -500 to +50; between positions -50 to +50 (considered as basal promoter), and between positions -500 to -50 (considered as proximal promoter). The 9-mers were ranked according to their relative abundance in promoters and in total human genomic DNA (2). The 9-mers that occur equally in the specified promoter regions and in the total genomic DNA are expected to have a ranking in the vicinity of 1. Statistically significant 9-mers would have a ranking of 5 or higher (2). The data for proximal promoters can be downloaded from http://

The database also includes the accession numbers of the ESTs used for localization of the predicted transcription initiation sites of genes. Perl scripts were developed for correlating the accession numbers to the definitions described in GenBank files (see Note 3).

0 0

Post a comment