An efficient compromise, which has been used in the M. tuberculosis H37Rv genome sequencing project, is to combine both approaches, by fully sequencing selected cosmids (Philipp et al 1996a) or bacterial artificial chromosome (BAC) clones containing large inserts (Brosch et al 1998), and by sequencing a large number of clones derived from a whole genome shotgun library that gives two- to threefold genomic coverage. In parallel, end sequencing of the inserts present in several thousand cosmid or BAC clones is undertaken. In this way, one not only generates data rapidly and efficiently but
TABLE 1 Strategies for microbial genome sequencing
Strategy Insert si^e Cloning Mapping Problems sequence Side-products
Whole genome Small (< 2 kb) Simple Not required Genome, Late Small insert clones shotgun complexity, for SAGE
repetitive DNA, occasional unclonability
Directed or clone Large (> 35 kb) Complex Labour intensive Unclonability, Early, constant Small insert clones based but valuable incomplete delivery for SAGE. Large genome coverage insert clones for biology, genetics, etc.
SAGE, systematic analysis of gene expression.
also benefits from an in-built topological check on shotgun sequence assembly provided by the end sequences. This is particularly important for genomes rich in repetitive DNA, as is true for M. tuberculosis. In the final stages of sequencing, the BAC clones in particular serve as templates for gap filling as they generally carry those regions of the genome which are underrepresented or missing from cosmid or plasmid libraries. This was indeed the case with H37Rv as several loci that were apparently unclonable in multicopy cosmids, as Sau3AI partial digest fragments, or in small insert libraries containing randomly sheared fragments, were fully covered in the pBeloBacll bank that carries large inserts generated by partial digestion with Hi«dIII (Brosch et al 1998). From the ~5000 clones initially isolated a canonical set of 68 BACs has been established that covers essentially the complete genome. This readily manageable collection of stable clones represents a valuable resource for the future, and should find use in applications as diverse as comparative genome mapping and the systematic analysis of gene expression (SAGE).
At the time of writing, the sequencing of the H37Rv genome is entering the final phase. There are currently about 15 sequence contigs, ranging in size from 1.2 to 1540kb, that are all linked to their neighbours by BAC or cosmid clones. The total contig length is 4393kb, which is close to the genome size of ~4400kb estimated for the circular chromosome of H37Rv by pulsed field gel electrophoresis, thus indicating that the remaining gaps are probably small. Intensive efforts are now being made to close these and to analyse and annotate the composite sequence. From our preliminary inspection of completed cosmid sequences it is clear that M. tuberculosis, unlike the leprosy bacillus (Cole 1994, Honore et al 1993), has densely packed coding regions and the genome is likely to comprise ~4000 protein-coding sequences. Genes are identified by a combination of methods involving positional base preference, codon usage, pattern recognition, and similarity to known genes or their products. Database searches have led to the attribution of precise, or putative, functions to over 70% of the genes and, as in other genome sequencing projects, it is a matter of great interest to determine the biological roles of the remaining 30% that have no counterparts in other bacteria. Much insight into the biochemistry, general metabolism and physiology of M. tuberculosis has been obtained from our initial analysis of the large body of sequence data, and many new leads for chemotherapy and immunoprophylaxis have been highlighted, but these will not be discussed here. Instead, we will concentrate on two salient features: (1) the possible role of insertion sequences in genome dynamics; and (2) the potential significance of two dispersed repetitive DNA sequences, i.e. the major polymorphic tandem repeat (MPTR; Hermans et al 1992) and the polymorphic GC-rich sequence (PGRS; Poulet & Cole 1995a, Ross et al 1992), that both belong to multigene families encoding acidic proteins.
Was this article helpful?