Functional Characterization of Genes by Gene Ontology

Duplicate genes may undergo pseudogenization, subfunctionalization, or neofunctionalization (17). To identify the putative function and fates of duplicate genes, an in silico analysis of gene function should be undertaken using the Gene Ontology (GO) resource (18).

1. Obtain the genelD (extract the ID from the gene2refseq.gz file) for each duplicated gene from the NCBI website ( The genelD is a unique NCBI identifier (previously Locus Link ID) for each curated RefSeq entry. The GO database can be searched by this unique ID to extract pre-computed gene ontology information. Additional information on the GO project is available at this website

2. Using the unique geneID, assign each gene to its GO annotations from each of the three GO taxonomies (biological processes, cellular component, and molecular function) by utilizing the GO Tree Machine ( You will need to create an account. (Registration is free and will allow the user to save and retrieve analyses.)

3. Create a text file with the list of the geneIDs and save to a file.

a. Log onto the GO Tree Machine site, and give the analysis a relevant name for future access.

b. From the drop-down menu for "Select the ID type in your file," select Locus Link ID (same as geneID).

c. For "What kind of analysis do you want to do?" select "single gene list" to perform a functional characterization of the duplicated genes.

d. You will need to upload the text file with the list of geneIDs previously created and select "MAKE TREE."

e. Alternatively, if, for step 3c, you select "interesting gene list vs. reference gene list" you can perform a statistical analysis of duplicated genes to detect GO terms that are relatively enriched compared with the full RefSeq data set. You will need to choose the "MOUSE" reference list.

4. Notes

1. Gene duplication allows for relaxed selection owing to redundancy, and this may allow for processes such as subfunctionalization, neofunctionalization, and pseu-

dogenization. Subfunctionalization occurs when two gene copies specialize to perform complementary functions. Neofunctionalization involves gene duplication whereby one of the genes acquires a new biochemical function. Furthermore, pseu-dogenization occurs when one of the duplicated genes acquires mutations rendering it nonfunctional.

2. Since chromosome sequence FASTA files are quite large and range in size from 50 to 250 Mb, a significant amount of computational power and memory is required to perform the sequence alignments using MegaBLAST.

3. We will explain how to perform this analysis in a serial manner. It is up to the reader to understand the nuances of their particular cluster or supercomputing installation in order to parallelize the algorithm and achieve the desired results in less time. This means understanding whether using MPI or forking and executing processes is suitable.

4. This protocol can be written in any programming language such as Perl, Java, Python, Ruby, C, or C++. However, typically in bioinformatic applications, algorithms are written in Perl.

5. The BioPerl package is available from

6. The current Generic Feature Format version 3 (GFF3) specification is available at

7. Assembled genomes of several species such as: human, rat, chimpanzee, dog, chicken, and others are available from the download page of the University of California at Santa Cruz (UCSC),

8. The main chromosome sequence assemblies are found in the chrN.fa files, where N is the name of the chromosome. The chrN_random.fa files are pseudo chromosomes containing sequences that are not yet finished or cannot be localized with certainty at any particular place in the chromosome assembly. The chrUn_random. fa file is another pseudo chromosome containing clones that have not been localized to a particular chromosome in the genome. These pseudo chromosomes should not be overlooked since they can often contain sequences that are involved in segmental duplications and have not been included in the main genome assembly perhaps because of their duplicated nature.

9. If the precompiled binaries do not match your computing environment, source code is available from NCBI at tar.gz. The instructions detail how to compile and install this suite of tools for your particular computing environment.

10. A total of N2 sequence alignments are performed for all sequence files where N is the number of files in the genome (i.e., chrl.fa vs chr2 BLAST database and chr2. fa vs chrl BLAST database). Sequence comparisons are required for all chromosomes in the genome including the pseudo chromosomes.

0 0

Post a comment