[3 Application of Phylogenetic Algorithms to Assess Rab Functional Relationships

By Ruth N. Collins


Researchers looking to solve biological problems have access to enormous amounts of sequence information and the desktop computational infrastructure to personally interrogate and analyze large datasets. Many powerful bioinformatics tools are available online; however, this discourages the customized analysis of data that is necessary for the experimental scientist to make maximally effective use of the information. In addition, a customized environment facilitates the critical evaluation of bioinformatic methods. This chapter presents a protocol developed to aid in classification of subfamilies and subclasses of a superfamily using the personal desktop computer. The visual representation of the qualitative and quantitative results of data analyses is also considered. The examples are focused on Rab GTPases but are more widely applicable to the classification of any given protein family.


Protein sequence search algorithms are powerful tools in bioinformat-ics. They are used to identify functional relationships, define subgroups of a large protein families, and dissect the relationship between structure and function at the molecular level. The explosion of publically accessible databases and the revolution in desktop computing power have put the experimental biologist in the pilot seat, facilitating the collection and analysis of sequence data. For the experimental biologist who is most cognizant of the biological questions, these in silico methods offer valuable tools to examine hypotheses and, importantly, can also help narrow the range of experimental options. An issue commonly encountered is that to take full advantage of phylogenetic algorithms and to critically evaluate their output necessitate a certain familiarity with statistics and computational science; adequate integration of these topics remains an issue in the training of biomedical research scientists in cell and molecular biology (Bialek and Botstein, 2004). Informatic analyses of large datasets also present issues regarding the optimal way to envisage the output. For maximum clarity, a visual presentation is preferable and the example below outlines a simple method that generates a graphic representation of a large phylogenetic dataset to identify subclasses of the Rab GTPase family.

Generation of Protein Sequence Alignment

1. The first stage is to mine publically available sequence information to generate a customized database containing all the protein sequences of interest. Using a web browser, navigate to the BLAST (Altschul et al, 1990) website (http://www/ncbi.nlm.nih.gov/BLAST/) and search for sequences related to the sequence of interest. Visually inspect the BLAST results and select sequences for downloading, taking care to avoid duplicates. Select sequences that are more distant to the original search sequence and use these sequences as seeds for further BLAST searching, a process termed sequence space hopping. Check both single pass methods and pattern -based searching such as PSI-BLAST (Altschul et al., 1997). The goal is to obtain as many sequences as possible in the twilight zone of 20-30% sequence similarity and avoid false pos itives.

2. It is not unusual for some database records to be of low accuracy and it is advisable to manually inspec t each sequence and, where possible, to correlate the file information with EST databases and check the predicted splice sites of hypothetical ORFs generated from genomic sequencing.

3. Clusta l X is a general purpose multip le alignment program for DNA or proteins that uses a window interface for sequence input and display (Chenn a et al., 2003; Thompson et al., 1997). Clustal X is open source software that is available for different platforms, and a compiled version for the Mac OS X can be downloaded from http://www.embl.de/~chenn a/ clustal/darwin/. The quality of the alignment is critical and needs to be fine-tuned manually through the examination and modification of penalties in different regions of the alignment.

4. Once the multiple sequence alignments have been obtained, the next step is to convert the alignments into a matrix wher e each value is a score that represents the sequence rela tionship between all the sequences . As the pro teins are rou ghly of similar overall length, one method for doing this is to creat e a percent ident ity matrix based on the alignment. MegAlign (DNAStar, Madison, WI), a commercial implementation of the Clustal W algorithm, can perform this calculation to one decimal point. Alternatively, the bioinformatics group at Cornell has created a web page where the clus tal file can be uploaded to return the percent identity mat rix (http://ser -loopp.tc.cornell.e du/cbsu/align_convert.htm). Gener ating sequence identity, however, does not take into account the chemical nature of amino acids and the detailed knowl edge that has been accumulated regarding amino acid mutati on frequen cies. A be tter scoring method is to use a wei ghted model of amino acid replacement to create a sequence similarity table. One method for doing this is to output the Clusta l alignment file in PHYLIP format, which then can be used for input to the PHYLIP ProtDist program to generate the pairwise distances between aligned sequences (http://evolution.genetics.washington.edu/phylip.html).

Principal Components Analysis of Alignment Data

5. The matrix containing the sequence relationship values is then analyzed by principal components analysis (PCA). PCA is, in general, a multivariate statistical method for finding linear combinations of variables that can be grouped together and specified by a single variable or component. Standard computer programs are available to do these types of analyses, and the statistical toolbox from MatLab (The Mathworks, Inc.) is recommended. In theory, there are the same number of principal components as there are variables, but in practice, usually only a few of the principal components need to be identified to account for most of the data variance. The goal in this analysis is to take a large dataset with n variables and find a small number of principal components (pcp) that encompass most of the data variability.

6. Typically, the sum of the variances of the first few principal components exceeds 80% of the total variance of the original data and the pcp output can then be plotted. In the example in Fig. 1, a two-dimensional plot of the second and third principal components enables the visualization of the sequence variability among 560 unique Rab GTPase sequences, culled from a rigorous analysis of the NCBI database in early 2004. Notably, this analysis identifies 10 major groups or subclasses of Rab proteins. These subclasses contain orthologs among different species, while paralogs within a species are distributed among groups. The area of the plot where x > 0.02 and y > 0.02 shows many sequences that do not fall into clusters and is shown in greater detail in Fig. 2A. Figure 2A indicates that in the most highly studied examples, the data points in this region are derived from Rab proteins known to regulate exocytic function, as the area is bounded by subclasses of sequence groups that include Rab27, Rab3, and Rab8. This analysis might suggest that the outliers in the plot (identified in Fig. 2B) may have a common involvement in exocytic processes, whether ubiquitous, or regulated traffic with vesicles of either biosynthetic or lysosomal/endocytic origin. One implication evident from the plot is that a major evolutionary divergence in Rab sequences lies in their regulation of the exocytosis of specialized organelles. Such organelles are found in differentiated cells of higher eukaryotes (e.g., exocytosis of lung surfactant from type II alveolar cells) and in single-celled eukaryotes such as apicomplexans (e.g., rhoptries of Toxoplasma gondii), but not in more streamlined model eukaryotes such as Saccharomyces cerevisiae, whose entire complement of Rab protein sequences can be grouped within the major subclasses of Rabs identified in Fig. 1.

7. An understanding of molecular mechanism requires a detailed analysis of residues contributing to variability. Therefore, a valid question is to understand the individual residues that are responsible for the clustering of the Rab subclasses. Casari et al. (1995) have made use of PCA to identify such discriminant residues where the principal components are calculated for the actual amino acid identities in the homology alignment. The problem is depicted in Fig. 3, which shows an alignment of consensus sequences for the Ras, Rho, and Rab subfamilies of the Ras superfamily. The identity of conserved positions is indicated with uppercase bold

0 0

Post a comment