5. Interpretation of Results
This is the most challenging part of the search process. Scores calculated by the program using statistical measures serve as guidelines, and they are useful most of the time. In cases of weak similarity or alignments with low statistical significance, biological knowledge and intuition can help in the interpretation. A biologically significant alignment need not be statistically significant and vice versa. The common way for interpreting the results is to use the E value . The E value of a match measures the expected number of sequences in the database that would achieve a given score by chance. Since these values depend on the number of entries in the database, they will change with the number of sequences in the database. Nonetheless, lower E values (1e-5 and less) can suggest meaningful similarity. It is highly encouraged to look at the actual alignment and check the region of the query matching with the target. If this region happens to be of functional importance and residues shown to have a functional role are matching, then the alignment may be biologically significant even if it has a high E value.
The presence of sequence similarity allows the inference of homology, and the homology can help to infer whether they share functions. This inferring of function from the homologous matched sequences can be tricky. If the score is good and the alignment matches the entire protein, then there is a very good chance that they share the same or a related function. If only a partial target sequence matches with the query, they might share a domain and only contribute one aspect of overall function. This situation is true with a multidomain protein. One should be cautious before jumping to any functional conclusion. An EST matching with a zinc finger domain of a nuclear hormone receptor need not be a gene encoding NHR. It can be any DNA binding domain shared by many families of proteins. Many sequences have greatly diverged during evolution, and they cannot be detected by simple sequence similarity search methods. Thus failure to find a significant match does not indicate that no homologues exist in the database. This suggests that a more powerful computational tool, that goes beyond the simple pairwise sequence similarity, should be used. This leads us to the other database search method based on profiles.
Analysis of a multiple sequence alignment can reveal gene functions that are not clear from simple pairwise sequence alignment . Software packages are available that can take a multiple sequence alignment and build a profile of it. As stated by Sean Eddy , a profile incorporates position-specific information that is derived from the frequency with which a given residue, amino acid, or nucleic acid base is seen in an aligned column. Component residues of active sites or ligand binding pockets or functional motifs tend to be well conserved in sequence families. Using this information, which includes both conserved and less-conserved residues, a sensitive database search is possible.
Much of the new software for profile searches is based on statistical models called hidden Markov models (HMMs) . This section is an introduction to profile-based HMM methods, and more comprehensive reviews are available elsewhere . Profile-based searches can be done in two ways, using the publicly available HMM profiles, and creating a new HMM profile from aligned sequence data.
Pfam is a database of protein domain families. It is available in the public domain at http://www.sanger.ac.uk/Software/Pfam/ and http://www.cgr.ki.se/ Pfam/ (Europe) and at http://pfam.wustl.edu/ (USA). Using the publicly available HMM profiles is convenient if the domain of interest is already present in the Pfam database .
Pfam contains curated multiple sequence alignments for each family. These multiple sequence alignments are used to create HMM profiles. These profiles are then used to identify protein domains in uncharacterized sequences. Pfam contains functional annotation, literature references, and database links for each family. There are two multiple sequence alignments for each Pfam family, the seed alignment, which contains a relatively small number of representative members of the family, and the full alignment, which contains all members in the database that can be detected. All alignments are taken from pfamseq, which is a nonredundant set composed of the SWISS-PROT and SP-TrEMBL collections of protein family alignments that were constructed semiautomatically using HMMs. Sequences that were not covered by Pfam were clustered and aligned automatically and are released as Pfam-B.
The Pfam distribution contains a number of files: Pfam-A.seed, Pfam-A.full, Pfam, PfamFrag, SwissPfam, Pfam-B, diff, and Pfamseq . Pfam-A.seed and Pfam-A.full contain the seed and full annotation in a marked-up alignment format called the Stockholm format. The Pfam file contains the library of Pfam profile HMMs. Any given sequence can be searched against this file to find any Pfam domain present in the query sequence. The Pfam models are iteratively defined. They start from clear homologues and incorporate increasingly distant family members in the process.
PfamFrag is a library of profile HMMs designed specifically to find matches to protein fragments; SwissPfam is a file containing the domain organization for each protein in the database; Pfam-B contains the data for Pfam-B families in the Stockholm format. Sequences that are not available when Pfam-A is generated are clustered and aligned automatically and are released as Pfam-B; diff is a file containing the changes between releases to allow incremental updates of Pfam-derived data; pfamseq contains the underlying sequence database, in fasta format, that all sequences in Pfam are taken from. The Pfam package contains the above-mentioned files, and executables are available for different operating systems.
Collect nonredundant set of sequences belonging to a novel family Generate a multiple sequence alignment
(Estimates parameters needed for calculating E value in database searches)
HMMsearch - to search a database using the newly created profile Figure 5 Steps in creating a new HMM profile.
HMMER is a freely distributed implementation of profile HMM software for protein sequence analysis. It is available at http://hmmer.wustl.edu. There are currently nine programs in the HMMER package. These programs can be used to search a protein database or create a new HMM profile. If the domain of interest is not present in the Pfam database, then the user has to create a new HMM profile for the desired domain. A demonstration of how to create a new profile is presented in Figure 5. The description of the nine programs and an online manual for HMMER is available at the HMMER URL given above. For nucleic acid analysis, a new package called Wise2 is available at the Sanger Center, UK (http://www.sanger.ac.uk/Software/Wise2/index.shtml). It can compare a single protein or a profile HMM to a genomic DNA sequence and predict a gene structure. The genomic sequence analysis algorithm is called Genewise, and the corresponding one for ESTs is called ESTwise.
The earlier sections gave an overview of the sequence-based and profile-based methods. The next step is to use this information in the identification of novel proteins of interest. The next section talks about the application of these methods in novel gene discovery.
C. Identification of a Novel Protein: An Example
This example illustrates the use of database search methods and the application of the strategy for identifying novel genes described above. An example described here is G-protein coupled receptors (GPCRs). They are excellent drug targets, and approximately 50-60% of marketed drugs are GPCRs . With recent advances in genomics, more information on the functional role of GPCRs is available. An increasing number of mutations in GPCRs were found to be associated with disease conditions.
Figure 6 gives an overview of the protocol that can be used to identify novel GPCRs. This method can be applied to any protein family. It is advisable
to use both sequence-based and profile-based search based methods to ensure that the search is as complete as possible, not missing any weak homology hits. This type of process can be easily automated by using PERL scripts . Designing a user-friendly web interface for accessing the search output data would help biologists to browse through the results.
Was this article helpful?
Although nobody gets a parenting manual or bible in the delivery room, it is our duty as parents to try to make our kids as well rounded, happy and confident as possible. It is a lot easier to bring up great kids than it is to try and fix problems caused by bad parenting, when our kids have become adults. Our children are all individuals - they are not our property but people in their own right.