1. MySQL is available through a no-cost license for noncommercial use at http://

2. Perl is installed by default on most Linux distributions along with many publicly available modules. See for stand-alone Perl distributions.

3. The repository for many publicly available Perl modules is Both modules for download and their documentation can be found there.

4. See

5. There are several such version-dependent directories. For version 5.x.y of Perl, one is typically /usr/lib/perl5/site_perl. The command to list these directories is "perl -V." Look for the lines following "@INC:."

6. The commands to set environment variables are shell dependent:

c-shell: setenv SEQUENCE_ARCHIVE /var/genome. sh, bash: export SEQUENCE_ARCHIVE=/var/genome.

7. UCSC Genome browser is located at

8. A list of genome sequences at the UCSC can be obtained from http://hgdownload.

9. The promoter sequences should be a nonredundant set: one promoter per gene. For human promoters, we have analyzed (4) a subset of sequences published in ref. 8. From the browser at UCSC (, it is possible to retrieve sequences corresponding to promoter regions, predicted with respect to ESTs (8). However, the retrieved sequences should be checked for redundancies. Alternatively, you can create a nonredundant set of accession numbers of cDNA sequences for upload and retrieval of the corresponding promoter sequences. The promoters that you will obtain will be predicted with respect to the 5' end of the uploaded cDNA sequences defined by accession numbers.

10. Standard Query Language (SQL) is an ANSI standard that several relational databases use. Though it does not conform strictly to the standard, MySQL does conform in the relevant commands used here. These commands are well documented in MySQL and other database engine documentation. The MySQL documentation is available at

Some command formats are determined by your operational environment. Note that the database may be remote; it need not reside on the same system as the sequence files and the Toolkit.

11. MySQL has an account scheme independent of the account scheme of the hosting OS; other SQL engines use a more integrated account scheme. Password protection for MySQL user accounts is available. If used, users may set the value of the environment variable MYSQL_PWD to the password so that the user is not prompted for a password each time a program is run (see Note 6). See the MySQL documentation for more information about securing database accounts.

12. The master9 file contains two columns. Each line provides the ID and the sequence of a 9-mer, for example:

RF49903f AGGGTTAAG RF49903r CTTAACCCT The first column provides the IDs, and the second column contains the sequence of the corresponding 9-mer. As shown above, the 9-mers are analyzed as pairs, forward (f) and reverse (r) complement. The sequences are listed from the 5' to 3' direction by convention. The 9-mers are analyzed and collected as pairs in order to make them independent of their orientation in the DNA (4). This strategy also eliminates problems associated with redundancy and overcounting.

13. An absolute path is one that explicitly states the location of a file or directory among all file systems available to the user, starting at the root file system. (File systems are arranged in a tree structure.) An example would be /export/home/joe/ projectl/mydata. If user joe has his home directory (/export/home/joe) set as the current working directory, the relative path would project/mydata.

14. Batch Entrez is located at =Nucleotide/.

Follow the second procedure for uploading the file that contains the accession numbers. After uploading the file, click on retrieve. The page will return a listing of files that no longer exist. At the bottom of that page, you will find a link for retrieving the existing sequences. Click on that link. You will obtain a listing of the accession numbers and the corresponding definitions. To download the actual sequences, use the pull-down menu next to summary and select Genbank. You will obtain a large file that contains all sequences in GenBank format. On the top of the page, click on send all to file. You will be shown a form for opening or saving the files. Click on save. It may take a while to complete the download.

15. The grep and sed commands are system commands (installed with the operating system). The former looks for a given pattern and outputs lines containing it. The latter edits this output to strip away the beginning accession number and separating tab.

16. The output is in a format that can be used by map, a program by the Genetics Computer Group (GCG). For details go to gcg/.

17. Multisequence files are ones that contain multiple, possibly unrelated, sequences, one after the other. In FASTA files, a new sequence starts with the ">" character at the beginning of a line. In GenBank files, a double slash (//) terminates a sequence, and another such sequence may occur afterward. The first sequence of the file is at position 1, the second at position 2, and so on.

18. This is not a particularly important feature for this study, but it is useful for subsequently planned studies.

Table 1 RF code hits


Data type


RF_id CHARACTER ID assigned to a 9-mer

RF_9mers CHARACTER Nucleotide sequence of a 9-mer

Num_of_hits INTEGER Number of times the 9-mer occurs in the

RF_code_locate table

Table 2


Column/field Data type Explanation

RF_id CHARACTER ID of a 9-mer

RF_9mer CHARACTER Nucleotide sequence of a 9-mer

Mer_count INTEGER Number of times a 9-mer occurs in the genome

Repeat_count INTEGER Number of time a 9-mer occurs within the regions classified as repetitive DNA

CDS_count INTEGER A count of 9-mers that exist in the coding regions of analyzed cDNA sequences

Table 3 Ranking


Data type


RF_id Rank_A_Ei






CHARACTER ID of a 9-mer

INTEGER Number of times the 9-mer occurred in the A region

FLOAT Rank of this 9-mer using the Rank_A_Ei in this record

INTEGER Number of times the 9-mer occurred in the B region

FLOAT Rank of this 9-mer using the Rank_B_Ei in this record

INTEGER Number of times the 9-mer occurred in the C region

FLOAT Rank of this 9-mer using the Rank_C_Ei in this record


Wyss, Lazarus, and Bina

Table 4



Data type




Numeric index to uniquely identify this




Name/title/description of the sequence.

We used the definition in the

GenBank files.



Unique short name given to a promoter

sequence (lab-name or promoter1,

promoter2, and so on)



Name of the file containing the actual

promoter sequence. This may be a

relative name. The Toolkit uses an

external environment variable to

specify the absolute path of the relative

root (see Note 13).



Number of seconds past Jan 1, 1970,

when the file was last modified. If this

time does not match that given by the

file system, any data in processed

tables referring to this sequence are




Position of the sequence in a multi-

sequence file (see Note 17)

Table 5



Data type




Code for uniquely identifying a record

in this table



Arbitrary text specified at processing

time by experimenter



The numerical index that appears in the

Sequence_name table (see Table 4)



Position of the transcription initiation

site within a promoter sequence



Specifies the first nucleotide that was

used for data collection. The Start_ position is specified with respect to TI

used for data collection. The Start_ position is specified with respect to TI


Creating a Database of 9-Mers Table 5 (Continued)


Data type




Specifies the last position that was used for data collection. The End_location is specified with respect to TI.



The first 9-mer that was read from a promoter sequence. This is used to verify the integrity of the collected data.



The last 9-mer that was read from the promoter sequence. This is used to verify the integrity of the collected data.



Database user identifier of user inserting this data record into the table (i.e., responsible for this processing run)



Time this record was created

Table 6



Data type




ID assigned to a 9-mer



The numerical index that appears in the Sequence_name table (see Table 4)



Location of a 9-mer in a promoter sequence with respect to TI (i.e., -400)



The 9-mers are collected as a pair of complementary sequences. The direction of a 9-mer was assigned a + if it was found on the same strand as the coding sequence. A - is assigned for the complementary sequence of that 9-mer in the other strand.



Single character to flag a 9-mer as special. Specifically used to identify 9-mers that lie in a region of repetitive DNA. N.B.: Flag must be specified in the study_seq command line if this field is to be included in this table.



Refers to a field in the Expt table (see Table 5)


1. International Human Genome Sequencing Consortium. (2001) Initial sequencing and analysis of the human genome. Nature 409, 860-921.

2. Venter, J. C., et al. (2001) The sequence of the human genome. Science 291, 13041351.

3. Collins, F. S., Green, E. D., Guttmacher, A. E., and Guyer, M. S. (2003) A vision for the future of genomics research. Nature 422, 835-847.

4. Bina, M., Wyss, P., Ren, W., et al. (2004) Exploring the characteristics of sequence elements in proximal promoters of human genes. Genomics 84, 929-940.

5. Hutchinson, G. B. (1996) The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Comput. Appl. Biosci. 12, 391-398.

6. Marino-Ramirez, L., Spouge, J. L., Kanga, G. C., and Landsman, D. (2004) Statistical analysis of over-represented words in human promoter sequences. Nucleic Acids Res. 32, 949-958.

7. FitzGerald, P. C., Shlyakhtenko, A., Mir, A. A., and Vinson, C. (2004) Clustering of DNA sequences in human promoters. Genome Res. 8, 1562-1574.

8. Trinklein, N. D., Aldred, S. J, Saldanha, A. J., and Myers, R. M. (2003) Identification and functional analysis of human transcriptional promoters. Genome Res. 13,308-312.

9. Kent, W. J., Sugnet, C. W., Furey, T. S., et al. (2002) The human genome browser at UCSC. Genome Res. 12, 996-1006.

10. Karolchik, D., Baertsch, R., Diekhans, M., et al. (2003) University of California Santa Cruz. The UCSC Genome Browser Database. Nucleic Acids Res. 31, 51-54.

11. Karolchik, D., Hinrichs, A. S., Furey, T. S., et al. (2004) The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32 (Database issue), D493-496.

0 0

Post a comment