Figure 3 A decade of growth for GenBank and NCBI.
www.ncbi.nlm.nih.gov/) exemplifies the flood of data and analysis tools coming in many new forms in the first ten years of NCBI. This year-by-year tour marks highlights from NCBI's first decade. A detailed description of all the products and services mentioned in Figure 3 is available at their home page.
The flood of sequence data, mainly from completed genomes, has led to problems with quality control and the speed with which the data can be searched and analyzed [8-10]. The majority of biological databases now need to be reen-gineered and updated to cope with this deluge of information. As stated by Baker and Brass , databases can be split into four broad categories based on the source of their data:
1. Primary databases: These contain one principal kind of information (e.g., sequence data), which may be derived from many sources, such as large-scale sequencing projects, individual submissions, the literature, and other databases. Examples for primary databases include protein databases like Protein In-
2. Secondary databases: These contain one principal kind of information (e.g., alignment data), which is derived solely from other databases. The data may be a straightforward subset of another database or may be derived by analysis of another database. Examples for secondary databases include motif and pattern databases like BLOCKS (http://www.blocks.fhcrc.org/), derived from PROSITE (http://expasy.hcuge.ch/sprot/prosite.html), and PRINTS (http://www.biochem.-ucl.ac.uk/bsm/dbbrowser/PRINTS/), derived from OWL (http://www.biochem.-ucl.ac.uk/bsm/dbbrowser/OWL/OWL.html). These databases are described in detail in Sec. IV.
3. Knowledge databases: These are specialist databases containing related information from many sources, such as the literature, expert input, and other databases. Examples of knowledge databases include structural databases like SCOP (http://scop.mrc-lmb.cam.ac.uk/) from PDB (http:// www.pdb.bnl.org/) and the E. coli biochemical pathway tool, EcoCyc (http:// ecocyc.panbio.com/ecocyc/).
4. Integrated database systems: Those are combinations of primary or secondary databases. Examples of integrated databases include corporate geno-mic databases developed by pharmaceutical and biotechnology companies as single-stop sources for genomic information for their biologists.
The different types of databases described above serve as repositories for storing and accessing biological data. As mentioned earlier, the idea here is to use this data for high-throughput identification of novel targets. Two main databases that are widely used for identifying novel targets are the Expressed Sequence Tag (EST) databases and the High Throughput Genomic sequences from the International Human Genome Project. These two are the high-impact databases, and they are described in the next subsection.
A. Expressed Sequence Tag (EST) Database
ESTs provide the direct window onto the expressed genome . They are singlepass partial sequences from CDNA libraries. The main public source of these EST sequences is dbEST (http://www.ncbi.nlm.nih.gov/dbest/), and the EST clones are available from the I.M.A.G.E. consortium (integrated molecular analysis of genomes and their expression consortium; URL: http://www.bio.llnl.gov/ bbrp/image/image.html). As of March 14, 2001, there were 7,550,778 ESTs in the public domain. The number of available ESTs in the public domain for humans, mouse, rat, nematode, fruit fly and zebra fish is given in Table 1.
EST databases and cDNA sequencing are now used widely as part of both academic and commercial gene discovery projects. With the availability of high
Table 1 Number of Available ESTs in the Public Domain (as of 03/14/2001)
Homo sapiens (human) Mus musculus and domesticus (mouse) Rattus species (rat) Caenorhabditis elegans (nematode) Drosophila melanogaster (fruit fly) Danio rerio (zebrafish)
3,288,343 1,955,500 265,763 109,215 116,471 85,586
performance and a sharp decline in the cost of computing, large-scale analysis and review of EST sequence data, particularly with regard to the data quality, are now possible. The EST data, in addition to being a source for novel targets, are also linked with other genomic information in databases such as EGAD  and XREFDb .
The analysis of EST databases has resulted in the identification of several successful novel targets. One of the success stories of using EST databases is the identification and subsequent cloning of a candidate gene for chromosome 1 familial Alzheimer's disease (STM2) . Another example is the identification of human homologue between the yeast equivalent (hMLH1) of the bacterial DNA mismatch repair gene mutL. Missense mutations in hMLH1 has been associated with chromosome-3-linked hereditary nonpolyposis colorecetal cancer [15,16]. This success and the potential for novel gene discovery from EST databases have also been capitalized by biotechnology companies like Human Genome Sciences (http://www.humangenomesciences.com) and Incyte (http:// www.incyte.com).
EST sequences are generated by shotgun sequencing methods. The sequencing is random, and a sequence can be generated several times. This results in a huge amount of redundancy in the database. Large-scale bioinformatics and experimental comparative genomics are complex and time-consuming. One challenge is to eliminate the redundancy in the EST databases. Sequence-cluster databases such as UniGene , EGAD, and STACK (sequence tag alignment and consensus knowledgebase)  address the redundancy problem by coalescing sequences that are sufficiently similar that one may reasonably infer that they are derived from the same gene. Many companies, for example, Celera (http:// www.celera.com) and Incyte, have their own clustering software. A commercial software available for clustering is based on the D2 algorithm and is currently marketed by DoubleTwist (http://doubletwist.com/). In addition to EST se-
quences, there is another major source of sequence data in the public domain. This is the human genomic data and it is the topic of focus in the next section.
The main public domain source for the human genome data is from the Human Genome Project (HGP). This is an international research program designed to construct detailed genetic and physical maps of the human genome, to determine the complete nucleotide sequence of human DNA, to localize the estimated 30,000-40,000 genes within the human genome, and to perform similar analyses on the genomes of several other organisms used extensively in research laboratories as model systems. The scientific products of the HGP will comprise a resource of detailed information about the structure, organization, and function of human DNA and the information that constitutes the basic set of inherited ''instructions'' for the development and functioning of a human being. The working draft of the human genome was published on June 26,2000. The International Human Genome Sequencing Consortium published the article on ''Initial sequencing and analysis of the human genome'' in the journal Nature on February 15, 2001. The sequencing status of HGP as of March 14, 2001 was:
Finished sequence: 1,040,372 kb (32.5% of the genome)
Working draft sequence: 1,951,344 kb (61.0% of the genome)
It is estimated that 3% of the total human genome encodes proteins. The major challenge in the analysis of genomic DNA sequence is to find these protein-coding regions and other functional sites. The current laboratory methods are only adequate for characterizing sequences of a few hundred bases at loci of special interest (e.g., disease genes). They are quite laborious and not suitable for annotating multimegabase long anonymous sequences. Computational methods serve as a potential alternative that can be used to characterize and annotate these megabase long sequences, either in an automated or semiautomated way .
Several computational tools have been developed in recent years to tackle the gene prediction problem. A listing of gene identification resources, freely available for academic use, is given in Table 2. It is important to distinguish two different goals in gene finding research. The first is to provide computational methods to aid in the annotation of the large volume of genomic data that is produced by genome sequencing efforts. The second is to provide a computational model to help elucidate the mechanisms involved in transcription, splicing, polyadenylation, and other critical pathways from genome to proteome. No single computational gene finding approach will be optimal for both these goals . Also, no single gene finding tool can claim to be successful in completely identi-
Was this article helpful?
The comprehensive new ebook All About Alzheimers puts everything into perspective. Youll gain insight and awareness into the disease. Learn how to maintain the patients emotional health. Discover tactics you can use to deal with constant life changes. Find out how counselors can help, and when they should intervene. Learn safety precautions that can protect you, your family and your loved one. All About Alzheimers will truly empower you.