Populating the Tables

The order with which the tables are populated is based on convenience. It is possible to populate the tables in batches or manually, i.e., one sequence at a time. In this section, the emphasis is on procedures for analyzing large batches of sequences.

1. Use the following commands to populate three of the tables (RF_code_hits, Genome_ count, and Ranking; Tables 1, 2, and 3) with the information stored in master9. This file contains all possible 9-mers (see Note 12). In the following command line, myquery corresponds to a general utility in the Toolkit (see Subheading 3.5., step 8). The procedure executed by the command line also sets the initial counts of each 9-mer to zero. The Ranking table uses only the RF_id, so the cut utility is used to create a separate temporary file.

shell> myquery --parameters=master9 --query="INSERT INTO RF_code_hits

(RF_id,RF_9mers) VALUES (?,?)" shell> myquery --parameters=master9 --query="INSERT INTO Genome_

count (RF_id,RF_9mer) VALUES (?,?)" shell> cut --f1 master9 >RFid shell> myquery --parameters=RFid --query="INSERT INTO Ranking (RF_id)

Genome_count will hold information about the occurrences of each 9-mer in total genomic DNA and in sequences classified as repetitive DNA. RF_code_hits will summarize the data collection counts of the 9-mers in the promoter analyzed. Rankings summarize a calculated ranking for each 9-mer (Tables 1, 2, and 3).

2. Update the counts of 9-mers in the Genome_count table using the downloaded genome files (in our case, human genome). This is done using the study_chromo program (see Subheading 3.5., step 11). In the total genomic sequences that are downloaded from the genome center at the UCSC (see Note 7), the regions that correspond to repetitive DNA are shown in lower case letters. The study_chromo program counts the total occurrence of 9-mers in a specified chromosome file as well as the counts of all 9-mers that overlap a region of repetitive DNA. This program requires the files be a random access format. A file of this format can be generated from FASTA format using the fasta2random_access Toolkit program (see Subheading 3.5., step 10). Shown below is an example for creating a random access file and collecting the 9-mers that occur in human chromosome 1. shell> fasta2random_access chrl.fa shell> study_chromo chrl.ra >chr1.cnt shell> myquery --parameters=chr1.cnt --query="UPDATE Genome_count SET Mer_count = Mer_count+?, Repeat_count=Repeat_count+? WHERE RF_9mer = ?" shell> rm chrl.cnt chrl.ra Repeat the procedure for each chromosome file (chr2.fa, chr3.fa, ...) to build the total genomic count. The output of study_chromo is a file to be used by myquery to update the counts in Genome_count table. The example assumes the commands are executed with the current working directory set to the directory in which the chromosome files reside. Relative and absolute path names can be used to specify input and output if that is not the case (see Note 13). The chrl.cnt and chrl.ra file may be deleted afterward.

3. Use the utility split_seq for creating a separate file for each promoter sequence from the downloaded multisequence file. This provides for more efficient processing of the promoter sequences in subsequent steps (see Subheading 3.5., step 9). When splitting the multipromoter-sequence file, the program names the resulting promoter files according to the accession number of cDNAs from which the promoter regions of genes were derived. Files are placed in the current working directory. In the example, the downloaded FASTA file is called promoter_sequences, located in the home (default) directory, although it may be located via any path name (see Note 13). The noprompt option instructs the program to extract all files without any additional user intervention. The example places promoter files in a subdirectory called promoters of the SEQUENCE_ARCHIVE directory (see Subheading 3.5., step 1).

shell> cd $SEQUENCE_ARCHIVE shell> mkdir promoters shell> cd promoters shell> split_seq --noprompt $HOME/promoter_sequences shell> cd $HOME

4. From GenBank, obtain the files for the cDNA sequences used for localization of the promoter regions. Each promoter has been named according to the accession number of the cDNA to which it corresponds (8). The actual cDNA sequence files are useful for analyzing the number of times a 9-mer occurs in the coding regions (CDS) and for obtaining information about the encoded proteins. This information (DEFINITION in GenBank files) is added to the database. To do so, the first step is to collect a listing that contains the cDNA accession numbers associated with the promoter sequences. This file can then be uploaded in Batch Entrez (see Note 14). Following the example in Subheading 3.4., step 3, to get a listing of cDNA accessions, use this command:

shell> ls $SEQUENCE_ARCHIVE/promoters/* >cDNA_accession The file cDNA_accession should be uploaded to Batch Entrez (see Note 14). You will obtain a multisequence GenBank file. Place this file in the $SEQUENCE_ ARCHIVE directory. Batch Entrez will also provide a file of any sequences that were withdrawn from GenBank. Remove them from the $SEQUENCE_ARCHIVE/pro-moters directory before continued processing.

5. Use the Toolkit program gb_titles to extract the accession number to gene definition (title) mapping. This mapping is used for adding to the database the definitions in GenBank files. In the continuing example, assume that the downloaded multi-sequence file is named cDNA_sequences_march22_05.gb. The following command extracts the information to a file called "titles.map":

shell> gb_titles cDNA_sequences_march22_05.bg >titles.map The output, titles.map, will be used below, in the next step of data processing.

6. For data processing, analyze the promoter sequences you have generated in step 3. The analysis uses several programs for populating specific tables. The description shown below is for collecting the data in batch.

In step 3, a multisequence promoter file was split, creating a file for each promoter sequence. For data collection, each promoter sequence must be registered. Registration populates the Sequence_name table (see Table 4) with information extracted from each sequence file and augmented by information provided in the command line (see Subheading 3.5., step 2). In the example below, the information extracted in Subheading 3.4., step 5 is inserted as part of the registration process via the command line. This registration is associated with a "labname" to which it is referred by study_seq (see Subheading 3.5., step 3). The study_seq program collects data about the location of 9-mers occurring in a registered sequence. The program populates the Expt table (Table 5) with its input process parameters. The program also populates RF_code_locate (Table 6) with location and orientation of 9-mers. Study_seq can flag each 9-mer location as being in a repetitive versus nonrepetitive DNA region. In promoter sequences downloaded from the genome browser at UCSC, regions that correspond to repetitive DNA are shown in lower case letters.

In the course of registering and analyzing thousands of promoter regions, it is wise to automate the process. It is simple to do this in the example case as all promoter sequences have the same transcription initiation site (+1) and are all of the same length. The following ad hoc shell script does this for the promoter sequence files created in the example of Subheading 3.4., step 3 incorporating the data from the GenBank files obtained in Subheading 3.4., step 4. The labnames created are Promoter.1, Promoter.2, and so on (see Note 15).

foreach file ($SEQUENCE_ARCHIVE/promoters/*) set filename='basename $file' echo $filename set title='grep "A$filename" titles.map | sed -e "s/A[A\t]\t]//'" register_seq --noprompt --format=fasta --idtag=gb promoters/$filename --title="$title" "Promoter.$x"

study_seq --noprompt --start=-500 --TI_site=551 --flag Promoter.$x

Assuming the script file is called process_promoters, an easy way to execute the script is shell> csh process_promoters

7. Use the information in RF_code_locate table (Table 6) to obtain the number of times a 9-mer was encountered in all analyzed sequences to update the RF_code_ hits table. The update_hits program will do this (see Subheading 3.5., step 4).

shell> update_hits

8. Populate Genome_count (see Table 2) to include the number of times a 9-mer has been found in the coding regions of genes. This information is obtained as aggregate 9-mer counts from the coding region of the cDNA sequences downloaded from GenBank (see Subheading 3.4.).

Use the study_cds (see Subheading 3.5., step 12) to extract the information into file cds.par. The input to study_cds is the GenBank file obtained in Subheading 3.4. There is no repetitive DNA indication in the CDS regions of the GenBank file, so that field is cut out. Use myquery with the cds.par file as parameter input to update the Genome_count table's CDS_count field with the extracted data. shell> study_cds cDNA_sequences_march22_05.gb | cut -d" " -f1,3 >cds.par shell> myquery --parameters=cds.par --query="UPDATE Genome_count SET CDS_count = ? WHERE RF_9mer = ?"

9. Compute rankings of 9-mers based on their over/underrepresentation in the promoter region. To do this we need: the total number of 9-mers in the human genome (G), the total number of 9-mers in the promoter regions of interest (E), the number of times the ¿th 9-mer occurs in the human genome (G), and the number of times the ¿th 9-mer occurs in the proximal promoter region of interest (4). The ranking is


To get the total number of 9-mers in the human genome (G), simply sum up the count of all the 9-mers:

shell> myquery -query="SELECT SUM(Mer_count) FROM Genome_count SUM(Mer_count) 5731480962

1 row(s) selected

In this example, the promoter region of interest is from -500 to -50 with respect to the transcription initiation site. Since each record in the RF_code_locate represents one record, all records with location from -500 to -50 will be counted to get E. shell> myquery -query="SELECT COUNT(Location) FROM

RF_code_locate WHERE Location >= -500 AND Location <= -50" COUNT(Location)

19181030 1 row(s) selected This gives

This is used to generate the ranking of each 9-mer using the rank.sql file in the Toolkit. In this file, the phrase FACTOR is replaced with the value, and the resulting file is placed in the STORED_PROCEDURE directory (see Subheading 3.1., step 5). The following will generate a ranking for the -500 to -50 promoter region for all 9-mers in the file rankings.

shell> myquery -noheader rank.sql -- . -500 -50 >rankings The -- in the command prevents the -500 and -50 from being processed as command options. The dot is a regular expression indicating that all flags (i.e., both R and N, repetitive DNA and nonrepetitive DNA regions) are to be counted. See more about regular expressions and the REGEXP operator in MySQL documentation. For convenience, the rankings are placed in the Rankings table (Table 3). The region -500 to -50 is associated with the Rank_B and Rank_B_Ei fields. The same procedure can be used to associate other regions with other rank fields.

shell> myquery -parameters=rankings --query="UPDATE Rankings SET Rank B Ei=?, Rank B=? WHERE RF id=?"

3.5. Programs in the Toolkit

1. Use the explain program in the Toolkit to list a description of command line option and environment variables required for each perl script. The explain program extracts the first segment of POD (plain-old documentation) from the programs. Explain searches the PATH environment for the program specified in the command line. Example:

shell> explain study_seq

2. Use register_seq to register a sequence selected for data collection. The result of this program is to populate the Sequence_name table (see Table 4 for a listing and descriptions of the fields in Sequence_name). The Sequence_name table is the main table to which almost all other tables refer, either directly or indirectly. Other data analysis programs that analyze the sequence place the sequence identifier (Sequence _id in the Sequence_name table) in other tables along with the results of the analysis. This allows the results to be linked back to all the information in the Sequence _name table.

The information stored in the sequence name table is extracted from the sequence file registered. Command line parameters may give values to fields not extract-able from the sequence file itself. For example, FASTA files do not always have a description of gene function, so the command line option named title is used to specify it. Command line options also specify the format of the file. In the previous section, the files were analyzed in batch. In the example shown below, a single-sequence file (U24128) is analyzed to specify the format, and to include a title (or definition) and a labname (PC1). From a GenBank-formatted file, the definition is automatically extracted.

In the example shown below, the file should be located in the SEQUENCE_ ARCHIVE directory or subdirectory thereof (see Subheading 3.1.5.).

shell> register_seq --format=fasta --title="prohormone convertase" U24128 PC1 Otherwise, the subdirectory must be specified along with the sequence file name. For example, if the sequence file were in the subdirectory hs (for homo sapiens) in the SEQUENCE_ARCHIVE directory, the command line would be shell> register_seq --format=fasta -title="prohormone convertase" hs/ U24128 PC1

3. Use study_seq for extracting the 9-mers from a registered promoter sequence. The output places the 9-mers according to their name, their sequence, and their location within the promoter sequence into the RF_code_locate table (Table 6). The operation identifies the 9-mers in both forward and reverse direction (see Note 12). In the RF_code_locate table, the field named location captures the position of the 9-mer, and its complement, with respect to the transcription start site. Another field (named Direction in the RF_code_locate Table) provides the orientation (+ or -) with respect to the chromosome strand.

The RF_code_locate table also includes a field to flag a 9-mer as special. The study_seq program will record R in the flag field if that instance of the 9-mer is found in a repetitive DNA sequence in the promoter region and flagged as N if there is no overlap with DNA marked as repetitive.

The processing parameters of study_seq and the sequence_id of the analyzed promoter sequence are stored as records in the Expt table (see Table 5 for a listing and descriptions of the fields in Expt). When you invoke the program, include the required criteria. In the example:

shell> study_seq --flag --start=-300 --end=+55 --TI_site=4375 PC1 The program is instructed to use the sequence with labname PC1, to start from position -300, and to end at position +55 with respect to the transcription initiation site (TI). The TI_site, in the command line, informs the program about the position of the transcription initiation site in the promoter sequence that is being analyzed for data collection. This feature is particularly useful for collecting data from GenBank files since these files are numbered with respect to the beginning of a submitted sequence. For historic reasons, the study_seq program also handles RF_ code_locate tables without the flag field. The flag option must be used with study_seq if the flag field is included; a database error will be generated if it is not included.

4. Use update_hits to update the collective hit counts (number of occurrences) for each 9-mer in RF_code_locate into RF_code_hits table. The program counts the number of times each 9-mer occurs in the RF_code_locate table and updates the Num_of_hits field in the record with the matching 9-mer. A query file with parameters can be specified to return a subset. Suppose that a file in the STORED_PRO-CEDURES (see Subheading 3.1., step 5) directory named hit_filter contained the following query:

SELECT COUNT(RF_id), RF_id, FROM RF_code_location WHERE Flag = ? GROUP BY RF_id The command line line shell> update_hits --query=hit_filter:N Would count only the 9-mers flagged with N, that is, ones not overlapping with a region of repetitive DNA.

5. Use hit_summary to produce a GCG-formatted report (see Note 16) on the 9-mers using both the RF_code_hits and RF_code_locate database tables. The report lists each 9-mer, RF-id of the 9-mer, the number of registered sequence in which it appears (by labname), and the location of the 9-mer in each of the sequences in which they appear. Since study_seq populates the RF_code_locate table, only registered sequences that have been studied will appear in the report. Command line options for restricting the report to a specific hit count or 9-mers with hit counts greater than some threshold are available.

6. Use myquery to execute predefined or on-the-fly SQL statements. Query parameters can be specified on the command line or in a parameter file. Predefined queries are stored in the directory defined by the environment variable STORED_ PROCEDURES. When the statements are executed, parameter place-holders, indicated using question marks in the SQL statement, are replaced by the parameters in the command line or parameter file in the order in which they occur. The parameters on the command line are the last items on the command line. The query is executed once for command line parameters. The query is executed as many times as there are lines in a parameter file. The program allows not only for selecting some dataset from the database, but also for operators to populate and update arbitrary tables in the database.

7. Use split_seq to extract one or more sequences in a multisequence file to separate sequence files (see Note 17). By default the files are named by the accession number of the sequence at each position within the combo-file. The individual files are named by their accession number or other identifier in file. In FASTA files, there are sometimes several identification tags listed (gi and gb), and any can be used to name the files using the idtag option. In interactive mode, the user is asked whether the file should be extracted. The --noprompt command option skips the user prompt. The format of the multisequence file can be any that the BioPerl package can interpret, FASTA and GenBank being the two obvious formats.

8. Use fasta2random_access to convert a FASTA-formatted file to a random access file. This format makes it easier to convert base coordinates to file coordinates as they are equal. Programmatic access to locations within the file are done with seek or other similar file system calls (see Note 18). Random access files are created for large sequence files, typically representing an entire chromosome. Programs to extract subsequences from the chromosome given the sequence coordinates are extremely simple to program.

9. Use study_chromo to obtain a count of all 9-mers in a random access sequence file. The program name implies that the file represents a large sequence, representing an entire chromosome. The results are outputted in column format that is structured to be used as input as a parameter file to myquery to update tables. As with other programs that produce only aggregate count values, study_chromo does not require the chromosome sequence to be registered in the database. Instead, the file name of the file containing the sequence is specified in the command line. The program provides a count of each 9-mer in a chromosome. The program also produces a count of the 9-mers in segments with lower case bases, denoting regions of repetitive DNA. The program does have a command line option to count oligomers of different sizes.

10. Use study_cds to count 9-mers in the CDS region of a GenBank-formatted sequence file. The format must be GenBank as the program uses the features documented in the file to determine the CDS region. If the file is a multisequence file it will count each 9-mer in all the CDS regions in total (not per sequence). As with other programs that produce only aggregate count values, study_cds does not require the chromosome sequence to be registered in the database. Instead, the file name of the file containing the sequence is specified in the command line. The program does have a command line option to count oligomers of different sizes.

11. The gb_titles script is more of an ad hoc script than the other more flexible programs provided. It has no command line options. It also removes the string "Homo Sapiens CDS" and similar strings from the description as the project involved human genome analysis. It is, however, extremely simple as it uses the BioPerl functions to do the extraction of the description line. Only a simple knowledge of Perl is needed to modify the script to handle another specific species.

12. The retrieve_seq and download_seq programs are command-line programs that will download a sequence file from GenBank given an accession number. The output option allows the file to be named and placed to non-default locations. The default name is the accession number. The default place is SEQUENCE_SOURCE (retrieve_seq) and SEQUENCE_ARCHIVE (download_seq). Specification of an absolute path name will place the file in the explicit path (see Note 13). The format option allows the file to be stored in formats such as GenBank, FASTA, or others.

0 0

Post a comment