So far, we have discussed the various ways for finding novel proteins related to a query of interest. Another aspect of the problem is when you have a set of sequences and you would like to identify a putative functional assignment. The set of sequences can be from any source, such as a cDNA library, a subtractive hybridization study, transcriptional profiling, a sequence from a chromosomal region implicated with some disease, or genomic data from the Human Genome Project. In the case of genomic data, gene finding tools have to be applied first for finding the exons. The resulting exons can be analyzed for functional assignment. As mentioned earlier, the Genewise suite of programs (http://www.sanger.-
ac.uk/Software/Wise2/index.shtml) can take a genomic sequence as input and search them against the HMM profile libraries. In this section, users have a set of sequences and they have no a priori functional knowledge about the sequence.
The analysis pipeline is presented in Figure 7. The input sequences considered here are EST sequences and genomic data. The EST sequences have to be masked for repetitive elements, and low-quality sequence regions should be removed. Then they have to be compared against the existing sequence contigs. The EST will either become a part of an existing contig or be a novel singleton cluster. Similarly, genomic sequences have to be masked, and the ORFs predicted from gene finding programs are compared to existing EST databases. As mentioned earlier, the new sequence can merge two clusters in the database, extend a cluster, or form a new singleton cluster. Combining EST data with predicted genomic data is at the user's discretion. Many companies may want to keep experimental data and predicted data separately. The first step is to analyze those sequences that are new and not present in internal databases, using a sequence-based approach. Partial matches with database sequences have to be carefully analyzed, since the possibility of matching to a domain in a multidomain protein cannot be ruled out. Since ESTs and exons predicted from genomic data are partial sequences, only partial matches to database sequences can be expected.
Was this article helpful?