1UCLA-DOE Center for Genomics and Proteomics and UCLA Molecular Biology Institute, University of California, Los Angeles, California; 2Protein Pathways, Inc., Woodland Hills, California
The rapidly growing genomic sequence databases are creating new challenges concerning how to use genomic data to learn about the functions of proteins and their functional relationships with each other in the cell. A variety of experimental and computational approaches are emerging for deciphering which proteins are working with which others as parts of functional networks in the cell. Here, we review some of the various new computational techniques that are able to draw connections between distinct but functionally linked proteins based on certain patterns of occurrence or arrangement across multiple fully sequenced genomes.
In the last few years, advances in genomic and proteomics technologies have produced an explosion of raw data on biological systems at the molecular level. The rapidly growing number of organisms for which the genomes have been completely sequenced serves as a dramatic example. At the time of this writing, the genome sequences of more than 100 organisms are publicly available , including a draft sequence of the human genome released last year [2,3]. As a result, the speed with which protein sequences are being acquired has vastly outpaced our ability to assign functions to them directly by experimental (e.g., biochemical and genetic) methods. This growing disparity between known sequences and known functions for these proteins has created a unique challenge. How can we infer the functions of proteins on the genomic scale? A variety of methods have been devised to meet this post-genomic challenge. while some genome-wide analyses are mainly experimental in nature, others are predominantly computational, and some combine aspects of both approaches. in this chapter, we touch first on experimental approaches to genome-wide analysis (covered in more detail in Chapter II.C) and then focus on computational analyses of whole genomes.
Approaches to Analyzing Protein Functions on a Genome-Wide Scale one theme emerging from recent work is that consideration of the genomic context of a protein can provide valuable information about the function of a protein, even in the absence of experimental studies. Analyses of various kinds of patterns across the burgeoning genomic databases can provide insight into functional relationships among distinct (nonhomologous) protein sequences. Consequently, this has led to a natural shift from asking what a particular protein
Copyright © 2003, Elsevier Science (USA).
All rights reserved.
Handbook of Cell Signaling, Volume 1 15
does to asking what other biomolecules that protein interacts with or, to use a broader phrasing, is functionally linked to. This expanded perspective of protein function within the context of pathways and networks forms the basis for many of the recent developments in genomics.
Some experimental genomic approaches make direct observations of functional linkages, while others require subsequent computational analyses to make statistical inferences about the existence of such linkages. Two techniques for making direct observations of physical protein-protein interactions are the yeast two-hybrid methods [4-6] and mass-spectrometry methods [7,8]. Both approaches are being applied on a genome-wide scale to generate maps of physical protein-protein interactions. Another genome-wide experimental approach, the synthetic genetic array , makes observations of functional linkages among proteins at the genetic rather than physical level.
Some genome-wide studies combine experimentation and subsequent computational analysis. One example of this combined approach is the inference of functional linkages from mRNA expression data obtained from DNA microar-rays. In these studies, computational analysis of the raw expression data produces functional linkages between genes for which expression patterns vary in correlated ways with respect to changes in variables such as time, growth conditions, or tissue type [10-14].
Functional Linkages from Genome Sequence Data: Nonhomology Methods
The traditional computational method for inferring the function of an uncharacterized protein relies on establishing a statistically significant similarity between the sequence of the uncharacterized protein and that of a protein whose function has already been experimentally determined. The vast majority of entries in the sequence databases have acquired their functional annotations via this technique. Here, we refer to this large family of sequence-based approaches as the homology method because they assign functions to proteins based on homology. While this classical approach has played a major role in shaping molecular biology, its limitation is clear. It can only infer relationships between similar sequences. The homology approach does not shed light on functional linkages between different (nonhomologous) proteins. In one situation of special interest, typically accounting for a third to half of the open reading frames in a newly sequenced genome, a protein sequence from one genome may have homologs in other genomes, but it may be that none of these proteins has ever been characterized experimentally. Sequence comparison would tell us that these proteins are all evolutionarily related to each other, but nothing more.
A series of recent computational innovations (reviewed in references [15-18]), denoted here as nonhomology methods, utilize patterns discovered at the higher level of genomic organization to infer functional linkages between nonho-mologous proteins. Such linkages provide a rich source of functional information, even for the problematic situation of proteins without any characterized homologs. We describe three different nonhomology methods followed by an illustration of their application.
Two or more proteins that act together in the cell as part of the same complex or pathway should all be present in any organism that uses that complex or pathway. conversely, it is natural to expect them all to be absent from organisms that do not use that complex or pathway. A protein phylogenetic profile is a vector that describes the presence or absence of a particular protein across a set of genomes. Two or more different (nonhomologous) proteins that share very similar phylogenetic profiles are likely to be functionally coupled. Pellegrini et al.  developed the phylogenetic profile method to establish functional linkages among proteins on the genomic scale. Related ideas and data structures were also discussed by others . Statistical treatments have improved the original calculations , and the profiles have been used in other applications such as predicting the subcellular localization of proteins .
Two proteins, A and B, that are separate entities in one organism are sometimes found fused together in a single larger protein A-B in the genome of another organism. The evolutionary fusion of these two proteins is taken as evidence that they are functionally linked. The fusion protein is dubbed a Rosetta Stone because it allows a functional linkage to be drawn between the two separate proteins A and B. This idea was first applied on a genome-wide scale by Marcotte et al.  and then by others .
Especially in prokaryotic organisms, functionally linked proteins are sometimes encoded near each other on the chromosome (e.g., as in operons). When two or more proteins tend to be encoded in proximity, especially in relatively divergent microorganisms, this argues strongly for a functional linkage between the proteins. The information embodied in conserved gene order or proximity was first applied on a genome-wide scale to establish possible functional linkages by Overbeek and coworkers [24,25].
To illustrate the ideas here, algorithms based on the three methods discussed above were applied to the genome of Escherichia coli, and the results for a well-known cell signaling pathway were investigated. The protein flgE was chosen as a somewhat arbitrary starting point for investigating the bacterial flagellar complex. using the multiple methods, high confidence links were established for this protein. Subsequently, high confidence links were established to those proteins first connected to flgE. This process was repeated until links of third order from the central protein were included. The results are shown in Fig. 1. Many of the
Figure 1 An illustration of functional linkages inferred by computational analysis of genomic data (nonhomology methods). The flagellar protein flgE was taken as a query protein. Computational methods were used to predict functional linkages between flgE and other proteins in the E. coli genome. Subsequent links (of second and third order) were generated from these to others. (A) This procedure produced the network shown, which includes many proteins known to participate in motility and chemotaxis. The computed functional links include proteins involved in various aspects of this biological system, from signal transduction to flagellar assembly and regulation. Each link is coded according to the computational method by which it was inferred. Links from the method of phylogenetic profiles are in solid lines, gene neighbor links are dashed, and Rosetta stone links are dotted. In some instances (not illustrated), multiple methods produced the same link. The three methods are shown at the bottom. Each panel illustrates the pattern in the genomic data that allowed one of the inferences at the top to be made. (B) The gene neighbor method draws a functional linkage between two proteins if they tend to be encoded in adjacent or nearby positions on the chromosomes of multiple organisms [25,26]. (C) In the method of protein phylogenetic profiles , the presence or absence of a protein across a set of genomes is analyzed. The two linked proteins shown have profiles for which the similarity is statistically significant. (D) In the Rosetta Stone method, the two separate proteins from one genome are functionally linked because they are found in some other genome as combined parts of a single larger protein [23,24].
proteins involved in flagellar biosynthesis and assembly (flgA, flgB, flgC, flgD, flgE, flgF, flgG, flgH, flgl, flgJ, flgK, flgL, fhiA, fliE, fliF, flip, fliR), export (flhA, flhB, fliH), and motor switching (fliG, fliM, fliN) were recovered. Links emanating from these flagellar proteins established connections to the chemosensing proteins (cheA, cheB, cheR, cheY, cheW, tap, tar, trg, tsr) whose signals ultimately drive the flagella, to the transcriptional proteins (fliA, rpoN) that regulate the production of flagellar proteins, and to the ATPase complex (fliI, atpA, atpB, atpC, atpG) that supplies energy for the flagellar motion. The illustration of the chemo-taxis system in E. coli shows that in favorable cases these computational nonhomology methods not only can recover links among proteins involved in a complex or pathway but also can reveal higher order functional relations among the complexes and pathways.
Other methods for inferring functional linkages have also been explored. For instance, mRNA expression data have been combined with promoter motif detection algorithms to identify regulatory networks [26,27]. In contrast to the analysis of large sets of experimental measurements, another computation approach seeks to distill large volumes of experimental results through the mining of the published literature [28-31]. These methods attempt to ascertain, in an automated fashion, the existence of experimentally established functional relationships among proteins from computational analysis of millions of biomedical literature abstracts. Efforts involving some amount of manual curation have also been conducted. The Database of Interacting Proteins (DIP)  is the result of one such effort.
Computational methods like those discussed here provide only circumstantial evidence that various proteins are actually functionally linked in the cell. This makes quality control a particularly important problem. Two complementary approaches to this problem are the development of probabilistic models to evaluate statistical significance and the use of known functional relationships for benchmarking.
statistical approaches for assessing inferences made by nonhomology methods have only begun to be addressed. one of the statistical difficulties that has not been explored deeply concerns how to handle correlated observations. For example, among the organisms whose genomes have been sequenced, some are much more closely related than others. This complicates the probabilistic treatment of features such as conserved relative positions (or presence versus absence) of proteins across the known genomes. Suppose for example that two (or more) proteins exhibit some genomic pattern that is evident only among very closely related organisms. such a pattern has not survived over a long evolutionary time scale and so may not indicate a significant functional linkage between the proteins in question.
Regardless of the simplicity or sophistication of the statistical analyses performed, experience has shown that the various computational methods must be calibrated by examining how well they perform on proteins whose functions are already known. One reasonable benchmarking approach is to measure the fraction of predicted functional linkages that are corroborated by the linked proteins having similar functional categories or keywords in annotated protein databases (e.g., SWISS-PROT, MlPs, KEGG) [33-35]. A related strategy is to use the inferences from one computational method to evaluate another. The general idea of using multiple methods to generate linkages with higher confidence was first applied by Marcotte et al. . Multiple sources of experimental measurements have also been used to similar effect .
Was this article helpful?