The robust sequencing of significant numbers of ESTs is a solid and accepted route to sampling a collection of genes that are actively expressed within a given tissue representing the developmental, homeostatic, biotic and abiotic environment. This snap-shot is however biased towards the biological system. While this is a primary objective of many EST sequencing strategies, the bias also means that many gene sequences that have either low expression levels or are expressed in only a few cells following precise signals will remain unsampled. For many genomic approaches, this bias makes no sense. There are solutions to this conundrum that may fall within the realm of sub-genomic sampling. The technologies that we will address here include draft or partial genome sequencing, BAC end sequencing, methylation-
filtration of the gene space and high-C0T selection of low complexity DNA. Each <-
Figure 1. Looking at the sampled species reveals two new facets of EST biology. Compared to earlier reviews that have considered EST collections and the taxonomic content of the collections (e.g. (Rudd 2003, 2005)), there is a more meaningful sample of species that fall outside of the angiosperm crop species (e.g. Adiantum, Nuphar and Saruma). Also of considerable note is that for several genus multiple species have been sampled (e.g. Populus, Lactuca, Pinus and Gossypium) illustrating that the power of comparative genomics is being applied to 'genomeless' organisms of these approaches is directed at sampling a fraction of the genome in a gene expression independent manner.
The objectives of a typical genome project are to sequence the bulk of accessible euchromatic DNA, and to assemble the resulting sequence reads into a minimal set of "contigs" representing large continguous stretches of genomic DNA. The process of sequencing requires a sufficient amount of sequence redundancy such that the total pool of underlying sequences can be unequivocally assembled into super-scaffolds whilst allowing for the random sample effects. The Arabidopsis genome has been sequenced to approximately seven genome equivalents (AGI 2000) meaning that the average DNA residue represents a consensus sequence of at least 7 read nucleotides. A 1x genome sequencing strategy would therefore imply that a single residue will represent the consensus of a single sequence read, so, through chance alone some bases will be sequenced more frequently whilst others may remain un-sequenced. The strategy for partial genome shotgun sequencing where a 0.5x genome coverage is therefore reasonable. Whilst using 10-times fewer reagents, and thus costing 10-times less, than a 5x genome, such a strategy does not preclude the species from further completion in the future. This technique has been applied successfully within the crop species Brassica oleracea (Ayele et al. 2005; Katari et al. 2005). The results of this strategy revealed the potential of genome survey sequencing for comparative genomic analysis and as a tool for gene identification. Brassica oleracea, however, has a compact genome of 650 Mbp with the result that 35% of brassica sequence corresponded to protein coding gene sequence in the close relative species, Arabidopsis thaliana. When we consider large species such as wheat, such an approach would seem impractical. A draft genome sequencing approach might therefore make sense for the smaller crop plant genomes, and could be extremely valuable in surveying closely related species, but this is not a reliable route for the larger plant genomes!
Bacterial artificial chromosomes (BACs) are large genomic inserts that have been cloned into bacterial constructs. BAC-by-BAC sequencing arguably provides a more efficient strategy for complete genome sequencing than the shotgun sequencing approach. A BAC library may be constructed that contains a certain number of genome equivalents. By sequencing both ends from a large enough number of BACs, a sequence resource can be generated that provides some insight into the content of the underlying genome. As with the genome survey approach in the previous section on draft genome sequencing, this approach is largely impotent for gene discovery within the larger genomes, but at least can be used to establish the types of retroelement that will be encountered within a more thorough genome sampling, and can provide a route for the discovery of molecular markers. Large
BAC libraries are available for a very large number of species and are widely used within map-based cloning and the discovery of candidate genes (e.g. (Ling and Chen
2005) and (Gaafar et al. 2005)). The sequencing of BAC ends within the context of genome survey has been described for at least ginseng (Hong et al. 2004), soybean (Marek et al. 2001) and maize (Gardiner et al. 2004). The deep sequencing of BAC ends only really becomes appropriate once a whole genome sequencing strategy has been adopted, but nevertheless there remain significant Genome Survey Sequence (GSS) resources in the public domain.
Current biology and post-genomic technologies are largely biased towards the understanding of the protein coding genes and their regulatory elements. Random genome sampling techniques appear unsuitable for large genomes so alternative approaches to sampling the parts of the genome enriched for the gene space in a largely unbiased manner have been sought. It is well known that the bulk of a larger plant genome is repetitive and that much of this repetitive content consists of retroelements. It has also been shown that much of this repetitive DNA is hyper-methylated in comparison to the hypo-methylated 'gene-space.' These observations have been applied to the development of technologies that can be used to create libraries that are enriched for the hypo-methylated genome fraction. This is achieved by the isolation and shearing of genomic DNA. The DNA, which is of mixed methylation states, is cloned and propagated in an E. coli strain containing a 5-methyl cytosine restriction system. This has the result that methylated DNA is cut, and unmethylated fragments will be successfully cloned and propagated. The construction of such libraries has been coined "methylation filtration" (Rabinowicz et al. 2003; Rabinowicz et al. 1999). The technologies have been demonstrated in both maize (Palmer et al. 2003) and in sorghum (Bedell et al. 2005).
High-C0T sequencing is based upon renaturation of sheared genome fragments, and has been elegantly demonstrated, again, using the sorghum (Peterson et al. 2002a) and maize genomes (Yuan et al. 2003b). Genomic DNA is isolated and sheared into fragments that are sufficiently small that 'gene' sequence can likely be dissociated from any adjacent retroelement or other repeat; the experimental fragment size being 1.8 kbp on average for maize (Yuan et al. 2003b). The resulting fragments are then melted, and renaturation is performed in a controlled and gradual manner. The study of the resulting population of DNA fragments reveals the C0t-values, where C0 is the nucleotide concentration and t is the reassociation time (see (Paterson
2006) for illustrated review). With the establishment of a C0T curve, particular fractions may be identified that contain high complexity DNA fragments, and which should therefore also contain significantly less retroelement or other repeat. Shotgun sequencing of clones isolated from high-C0T fractions has indeed revealed that there indeed is a clear enrichment for gene sequence with a concomitant reduction in the amount of repeat or retroelement sequence. This technology has been at least demonstrated in sorghum (Peterson et al. 2002a), the maize genome (Yuan et al. 2003b) and wheat (Lamoureux et al. 2005). The selection of high- C0t libraries thus demonstrates another route into a preferential sampling of the gene-space.
There are a number of extremely powerful techniques that may be used to access the content of the protein coding space within a genome. While EST sequencing has been very widely adopted throughout the research community, the other approaches to the sampling of the gene space have been demonstrated as effective and powerful approaches, but have been adopted only within the context of very large pilot projects within the scope of further complete genome sequencing. These true genome sampling technologies in addition to providing routes to the unbiased collection of protein-coding genes, gives access to transcribed but non-polyadenylated features, to random slices of the genome, and to hypo-methylated genome fragments, or to any continuum of features within. With DNA as the starting material rather than a poly-adenylated RNA there seems to be a much greater versatility for genome sampling and genomeless genomics, especially since these technologies will provide access to the regulatory regions upstream of the genes themselves and can provide access to both intronic and exonic sequence.
These methodological advances in conquering the plant genome have been focused by applying the traditional di-deoxy sequencing methods to sequencing from DNA libraries that have been constructed using complex techniques. The sequencing method has remained largely unchanged throughout, albeit with greater automation. Common sense would demand that in order to solve the insurmountable issues of plant genome sequencing we would need access to technologies capable of sequencing significantly longer DNA regions. It is therefore counter intuitive, in that one emerging technology excels in the production of shorter sequence reads (Margulies et al. 2005). The process of genome sequencing using 454 sequencing yields as many as 500,000 sequences in parallel from a single run. Each read is significantly shorter than a typical 'Sanger read' in that the average length may be only 110 nucleotides. While this technology does not solve the issues of plant genome sequencing, the ability to rapidly sequence vast amounts of short sequence from the genome, the transcriptome or any other reduced representation library does open up some rather fantastic opportunities to the plant research community.
In future we will undoubtedly see a much wider adoption of these non-EST technologies, but it seems that for the moment at least, the crop plant community has the best access to EST resources. ESTs can certainly provide an answer to many questions posed, and for the remainder of this review I will focus solely on the EST sequences and their applications.
5. COMPUTERS, DATABASES AND THE REPRESENTATION OF CROP EST SEQUENCE DATA
The volumes of publicly available EST data have created a formidable resource for the research community as exemplified the dbEST database (Boguski et al. 1993). This resource however has not been designed for the needs of the biologist. The dbEST database (or it's siblings such as the EST division of the EMBL database at the EBI (Cochrane et al. 2006)) has been designed as a sequence repository where researchers are both free to contribute their sequences, and as a repository they are expected to deposit their sequences upon publication. The sequence repository is therefore humble in it's offerings and the sequences and their critical annotations and associations are maintained as a flat and textual representation of the data, rather than as a structured, maintained or curated collection of sequences. This does not mean that these sequences alone are without use. Public BLAST (Altschul et al. 1990) servers such as the NCBI BLAST server use the available EST resources as a substrate against which user sequences may be compared. The data from within these primary databases is also freely available to download in full. This free data accessibility also provides the gateway into secondary sequence databases that exploit the fuller potential of the contained information.
The plant research community is no stranger to databases and web-based methods for the presentation and dissemination of biological knowledge. The Arabidopsis (Garcia-Hernandez et al. 2002; Hubbard et al. 2005; Schoof et al. 2004) and rice (Karlowski et al. 2003; Schoof et al. 2005; Yuan et al. 2003a) genome databases have paved the way for the exploitation of genome data in research, and further 'generic' database infrastructures for the description and exploitation of plant genomic data have been discussed (e.g. (Hubbard et al. 2005; Lawrence et al. 2005; Schoof et al. 2005)) so that the crop plant research community will be a direct beneficiary of the forthcoming crop-plant genome initiatives. The critical aspect of a meaningful database is that data should be stored and maintained in a structure such that biologically meaningful queries can be addressed to the data collection and meaningful results can be retrieved quickly and simply. The Arabidopsis, rice and maize genome databases act as a repository for the raw genome sequences -while important this sequence is of little direct relevance to the community. Onto this data substrate additional annotation and analyses of varying dimensionality are layered (Reed et al. 2006). The information added typically includes gene-models, map positions, similarity and identity to known genes and descriptions that relate to function, structure, ontology or domain content.
When we consider the primitive data-types that are associated with sequences in a genome database the content is not completely dissimilar to the information content that could (or should) be associated with EST sequences. As argued earlier, an EST collection is fragmentary at best with a significant background of sequencing error (empirically shown as approximately 1.5% mismatch error per nucleotide using Arabidopsis EST sequence, data not shown) and massive sequence redundancy. To make sense of the sequence data, it is therefore imperative that the EST sequences be cleaned, clustered and assembled to produce a minimal 'unigene-set'. This
Figure 2. A diagram illustrating the flow of information within the openSputnik EST sequence and annotation database (Rudd 2005). Collections of EST sequence are downloaded from public sequence databases such as the dbest and are used to build species centric databases. The EST sequences are aggressively trimmed of known and probable sources of contamination such as E.coli sequence, cloning vector etc. The 'filtered' sequences are then clustered and assembled using specialist software resulting in a 'unigene set' of reduced redundancy and increased complexity. These unigene sequences are used to predict probable molecular markers and in conjunction with other assembled sequence collections are used to establish comparative genomics resources. The comparison of the unigene set to the fully sequenced genomes and to reference protein databases such as UniProt allos for the tentative assignment of role and function and facilitates the unigene assignment to the Gene Ontology. Peptide sequence is predicted on the basis of codon usage and maximum likelihood assessment of the sequence and the resulting peptides are annotated for protein domains, signal peptides, transmembrane domains and sub-cellular localisation. The fullest repertoire of annotations ensures that a simple EST collection may be exhaustively searched for biological context that may be used for the selection of candidate genes for crop improvement
Figure 2. A diagram illustrating the flow of information within the openSputnik EST sequence and annotation database (Rudd 2005). Collections of EST sequence are downloaded from public sequence databases such as the dbest and are used to build species centric databases. The EST sequences are aggressively trimmed of known and probable sources of contamination such as E.coli sequence, cloning vector etc. The 'filtered' sequences are then clustered and assembled using specialist software resulting in a 'unigene set' of reduced redundancy and increased complexity. These unigene sequences are used to predict probable molecular markers and in conjunction with other assembled sequence collections are used to establish comparative genomics resources. The comparison of the unigene set to the fully sequenced genomes and to reference protein databases such as UniProt allos for the tentative assignment of role and function and facilitates the unigene assignment to the Gene Ontology. Peptide sequence is predicted on the basis of codon usage and maximum likelihood assessment of the sequence and the resulting peptides are annotated for protein domains, signal peptides, transmembrane domains and sub-cellular localisation. The fullest repertoire of annotations ensures that a simple EST collection may be exhaustively searched for biological context that may be used for the selection of candidate genes for crop improvement unigene sequence collection can then be annotated and analysed to build a sequence resource that may be used for comparative and functional genomics and that may act as a plentiful resource of candidate genes for further analysis. Figure 2 shows the analytical graph that is used within the openSputnik EST sequence database (Rudd 2005) to assign meaning to ESTs and their parent unigenes.
The openSputnik database (Rudd 2005) is not alone in providing a repository of processed EST data for a collection of crop species. Other databases may contain significant volumes of processed data for several species (e.g. plantGDB (Dong et al. 2004), the TIGR gene indices (Lee et al. 2005) or the NCBI UniGene resource (Wheeler et al. 2006)) or may contain focussed and deep annotation and analysis for a more restricted collection of species (e.g. parasitic plants (Torres et al. 2005), flowering plants (Albert et al. 2005), peach (Lazzari et al. 2005), pineapple (Moyle et al. 2005), or many others).
It is reasonable to summarise that EST databases are an essential resource for migrating information between genomes, for understanding what an EST might do, and where else it might be found and as a general resource for the understanding of the content of a crop species, and a fundamental resource in the selection of candidate genes during the process of crop improvement. The assumption that a large crop EST collection may be treated as a genome-project in miniature, while correct, is also naïve. There is much more that can be done using the large and already existing sequence resources, or with sequence collections that may be created for a specific need.
6. GENOME SEQUENCE, SEQUENCE HETEROGENEITY AND MOLECULAR MARKERS
It seems likely that for the foreseeable future many of the crop species that we are currently reliant upon will remain unsequenced. EST sequencing has demonstrated a technology that may be applied to establish a glimpse of the underlying gene content and databases have been established that work around the limitations and caveats of EST sequence to provide the maximal available context to interested researchers. Meanwhile, some of the caveats of EST sequence data (namely the vast redundancy within sequence collections) have been turned to the advantage of the community. Mining EST collections for molecular markers is routinely performed and robust methods have been developed for this.
While genomics is a relatively new approach within the crop plant community, genetics is a traditional approach whereby researchers and breeders attempt to identify the chromosomal intervals in which traits that enhance performance, value or the scientific merit of the plant are delimited. To make sense of breeding populations breeders have been constructing both genetic and physical maps from many genomes. The genetic maps are reliant upon molecular markers that can be of a few different types. The most popular markers include simple sequence repeat markers (SSR), single nucleotide polymorphism (SNP) markers or more recently conserved ortholog set (COS) markers. While SSR and SNP markers were traditionally identified manually, the vast volumes of data in the EST databases have led to the development of automated processed for candidate marker selection.
SSRs, also known as microsatellite markers, consist of a variable number of typically di-nucleotide or tri-nucleotide repeats. The variability in number of repeat elements will segregate between breeding populations and again may be used in the construction of a genetic map. ESTs have been well used within the generation of SSR markers within several grass species (and in other crops) e.g. (Barkley et al. 2005; Graham et al. 2004; Gupta et al. 2003; Mian et al. 2005; Saha et al. 2004; Thiel et al. 2003). The aim of these experiments was to identify potential microsatellites and to investigate if length variability could be identified within other populations. SSR markers are cheap to test and develop, and will remain a favourite of the crop research community.
SNPs are an enticing form of molecular marker; they are simple in that they are composed of a nucleotide difference at a single position within the much larger chromosomal context. This single difference reliably differentiates between given varieties, cultivars or ecotypes. There has been much discussion recently on computational methods that may be used to select SNPs from within the redundancy of large sequence collection (Huntley et al. 2006; Kota et al. 2003; Marth 2003; Matukumalli et al. 2006; Weil et al. 2004). Regardless of the underlying methodologies, the selection of candidate SNPs has provided an expedited route to the discovery and validation of novel markers.
Not only can plant breeders utilise large sequence collections that reside in the public domain for the selection of candidate markers, but there are also suggestions (in mammalian systems at least) that SNP markers developed in one system may be applicable to other systems (Grapes et al. 2006). The selection of likely candidate SNPs from pig protein coding sequences and their comparison to known human SNPs has revealed that there is a reasonable correlation in gene-to-gene variability across species opening the prospects for site-directed mining of SNPs between species.
Was this article helpful?