4 19.12 Whole-genome shotgun sequencing utilizes sequence overlap to align sequenced fragments. More about current developments in DNA sequencing

The Human Genome Project

By 1980, methods for mapping and sequencing DNA fragments had been sufficiently developed that geneticists began seriously proposing that the entire human genome could be sequenced. An international collaboration was planned to undertake the Human Genome Project (IFigure 19.13); initial estimates suggested that 15 years and $3 billion would be required to accomplish the task. As a part of the effort, the genomes of several model organisms, including Escherichia coli, Saccharomyces cerevisiae (yeast), Drosophila melanogaster (fruit fly), Arabidopsis thaliani (a plant), and Caenorhabditis elegans (a nematode) were to be sequenced as well. The genomes of these model organisms were sequenced to help develop methods that could then be applied to the sequencing of the human genome and to

19.13 The Human Genome Project has produced an initial draft of the sequence for the human genome. (Mario Tama/Getty.)

provide sequenced genomes with which to compare the organization and structure of the human genome.

The Human Genome Project officially got underway in October 1990. Initial efforts focused on developing new and automated methods for cloning and sequencing DNA and on generating detailed physical and genetic maps of the human genome. The methods described earlier for mapping, sequencing, and assembling DNA fragments were pivotal in these early stages of the project. By 1993, large-scale physical maps were completed for all 24 pairs of human chromosomes. At the same time, automated sequencing techniques (IFigure 19.14) had been developed that made large-scale sequencing feasible.

The initial effort to sequence the genome was a public project consisting of the international collaboration of 20 research groups and hundreds of individual researchers who formed the International Human Genome Sequencing Consortium. In 1998, Craig Venter announced that he would lead a company called Celera Genomics in a private effort to sequence the human genome.

The public and private efforts moved forward simultaneously but used different approaches. The Human Genome Consortium used a map-based approach; many copies of the human genome were cut up into fragments of about 150,000 bp each, which were inserted into bacterial artificial chromosomes. Yeast artificial chromosomes and cosmids had been used in early stages of the project but did not prove to be as stable as the BAC clones, although YAC clones were instrumental in putting together some of the larger contigs. Restriction fingerprints were used to assemble the BAC clones into contigs, which were positioned on the chromosomes by using genetic markers and probes. The individual BAC clones were sheared into smaller overlapping fragments and sequenced, and the whole genome was assembled by putting together the sequence of the BAC clones.

Celera Genomics used a whole-genome shotgun approach to determine the human genome sequence, although the genetic and physical maps produced by the public effort helped Celera assemble the final sequence. In this approach, small-insert clones were prepared directly from genomic DNA and then sequenced. The overlapping of DNA sequences among these small-insert clones was then used to assemble the entire genome.

Both public and private sequencing projects announced the completion of a rough draft that included most of the sequence of the human genome in the summer of 2000, 5 years ahead of schedule. Analysis of this sequence was published 6 months later.

The availability of the complete sequence of the human genome is proving to be of enormous benefit. It is greatly facilitating the identification and isolation of genes that contribute to many human diseases and is providing probes that can be used in genetic testing, diagnosis, and drug development. The sequence is also providing important information about many basic cellular processes. Comparisons of the human genome with those of other organisms are adding to our understanding of evolution and the history of life.

I 19.14 Automated sequencers and powerful computers allowed a rough draft of the human genome sequence to be completed in just 10 years. (Whitehead/MIT Genome Center, 2001; from NATURE 409: 860-921.)

The New Genetics

Mapping the Human Genome— Where It Leads, What It Means

In June 2000, scientists from the Human Genome Project and Celera Genomics stood at a podium with President Bill Clinton to announce a stunning achievement—they had successfully constructed a sequence of the entire human genome. Soon this process of identifying and sequencing each and every human gene became characterized as "mapping the human genome." As with maps of the physical world, the map of the human genome provides a picture of locations, terrains, and structures. But, like explorers, scientists must continue to decipher what each location on the map can tell us about diseases, human health, and biology. The map accelerates this process because it allows researchers to identify key structural dimensions of the gene that they are exploring and reminds them where they have been and where they have yet to explore.

What does the map of the human genome depict? When researchers discuss the sequencing of the genome, they are describing the identification of the patterns and order of the 3 billion human DNA base pairs. Although such identification provides valuable information about overall structure and the evolution of humans in relation to other organisms, researchers really want the key information encoded in just 2% of this enormous map—the information that encodes most of the proteins of which you and I are composed. Proteins stand as the link between genes and pharmaceutical drug development, they show which genes are being expressed at any given moment, and they provide information about gene function.

Knowing our genes will lead to a greater understanding and radically improved treatment of many diseases. However, sequencing the entire human genome, in conjunction with sequencing of various nonhuman genomes under the same project, has raised fundamental questions about what it means to be human. After all, fruit flies possess about one-third the number of genes possessed by humans, and an ear of corn has approximately the same number of genes as a human. In addition, the overall DNA sequence of a chimpanzee is about 99% the same as the human genome sequence. As the genomes of other species become available, the similarities to the human genome in both structure and sequence pattern will continue to be identified. At a basic level, the discovery of so many commonalities

Dr. Craig Venter (Celera Genomics), President Clinton, and Dr. Francis Collins (NIH). (Ron Edmonds/AP)

and links and ancestral trees with other species adds credence to principles of evolution and Darwinism.

Some of the most expected developments and potential benefits of the Human Genome Project directly affect human health; researchers, practicing physicians, and the general public eagerly await the development of targeted pharmaceutical agents and more specific diagnostic tests. Pharmacogenomics is at the intersection of genetics and pharmacology; It is the study of how one's genetic makeup will affect one's response to various drugs. In the

Arthur L. Caplan and ^ Kelly A. Carroll future, medicine will potentially be safer, cheaper, and more disease specific, all while causing fewer side effects and acting more effectively, the first time around.

There are, however, some hard ethical questions that follow in the wake of new genetic knowledge. Patients will have to undergo genetic testing in order to match drugs to their genetic makeup. Who will have access to these results—just the health care practitioner? Or will the patient's insurance company, employer, school, or family have access? Although the tests may have been administered for one case, will information derived from them be used for other purposes, such as for the identification of other conditions or future diseases or even in research studies?

How should researchers conduct studies in pharmacogenomics? Often, they need to study subjects by some kind of identifiable trait that they believe will assist in separating groups of drugs, and in turn they separate people into populations. The order of almost all the DNA base pairs (99.9 %) is exactly the same in all humans, leaving a small window of difference. The potential for the stigmatization of individuals and groups of people based on race and ethnicity is inherent in genomic research and analysis. As scientists continue drug development, they must be careful not to further such ideas, especially because studies of nuclear DNA indicate that there is often more genetic variation within ethnic groups or cultures than between ethnic groups or cultures.

These are just a few of the ethical issues arising out of one development of the Human Genome Project. The potential applications of genome research are staggering, and the mapping is just the beginning.


The Human Genome Project is an effort to sequence the entire human genome. Begun in 1990, a rough draft of the sequence was completed by two competing teams, an international consortium of publicly supported investigators and a private company, both of which finished a rough draft of the genome sequence in 2000. Information about the Human Genome Project and numerous links to it

Single-Nucleotide Polymorphisms

In addition to the DNA sequence of an entire genome, several other types of data are useful for genomic projects and have been the focus of sequencing efforts. One consists of single-nucleotide polymorphisms (SNPs, pronounced "snips"), which are single-base-pair differences in DNA sequence between individual members of a species. Arising through mutation, SNPs are inherited as allelic variants (just like alleles that produce phenotypic differences, such as blood types), although SNPs do not usually produce a phenotypic difference. Single-nucleotide polymorphisms are numerous and are present throughout genomes. In a comparison of the same chromosome from two different people, a SNP can be found approximately every 1000 bp.

Because of their variability and widespread occurrence throughout the genome, SNPs are valuable as markers in linkage studies. For example, human SNPs are being cataloged and mapped for use in identifying genes that contribute to disease. When a SNP is physically close to a disease-causing locus, it will tend to be inherited along with the disease-causing allele. Thus the SNP marks the location of a genetic locus that causes the disease. A SNP can also be useful for determining family relationships—most SNPs are unique within a population, having arisen only once by mutation. Thus the presence of the same SNP in two persons often indicates that they have a common ancestor.

Expressed-Sequence Tags

Another type of data identified by sequencing projects consists of databases of expressed-sequence tags (ESTs). In most eukaryotic organisms, only a small percentage of the DNA actually encodes proteins; in humans, less than 2% of human DNA encodes the amino acids of proteins. If only protein-encoding genes are of interest, it is often more efficient to examine RNA than the entire DNA genomic sequence. RNA can be examined by using ESTs—markers associated with DNA sequences that are expressed as RNA. Expressed-sequence tags are obtained by isolating RNA from a cell and subjecting it to reverse transcription, producing a set of cDNA fragments that correspond to RNA

molecules from the cell. Short stretches of these cDNA fragments are then sequenced, and the sequence obtained (called a tag) provides a marker that identifies the DNA fragment. Expressed-sequence tags can be used to find active genes in a particular tissue or at a particular point in development.


In addition to the genomic-sequence data, genomic projects are collecting databases of nucleotides that vary among individuals (single-nucleotide polymorphisms, SNPs) and markers associated with transcribed sequences (expressed-sequence tags, ETSs).



By the time this book is published, complete genome sequences will have been determined for more than 100 different organisms, with many additional projects underway. These studies are producing tremendous quantities of sequence data. GenBank, one of the major databases of DNA sequence information, now contains more than 19 billion base pairs of sequence, and this number increases in size every month. Cataloging, storing, retrieving, and analyzing this huge data set are a major challenge of modern genetics. Bioinformatics is an emerging field consisting of molecular biology and computer science that centers on developing databases, computer-search algorithms, gene-prediction software, and other analytical tools that are used to make sense of DNA, RNA, and protein sequence data. Bioinformatics develops and applies these tools to "mine the data," extracting the useful information from sequencing projects.

Before being sequenced, most genomes contain few genes whose locations have already been determined, which, coupled with the enormous amount of DNA in a genome and the complexities of gene structure, makes finding genes a difficult task. Computer programs have been developed to look for specific sequences in DNA that are associated with certain genes. For example, protein-encoding genes are characterized by an open reading frame (ORF), which includes a start codon and a stop codon in the same reading frame. Specific sequences mark the splice sites at the beginning and end of introns; other specific sequences are present in promoters immediately upstream of start codons. Still other sequences are associated with particular functions in certain classes of proteins. Computer programs have been developed that scan the DNA for these sequences and identify genes on the basis of their presence and position. Some of these programs are capable of examining databases of EST and protein sequences to see if there is evidence that a potential gene is expressed.

It is important to recognize that the programs that have been developed to identify genes on the basis of DNA sequence are not perfect. Therefore, the numbers of genes reported in most genome projects are estimates. The presence of multiple introns, alternative splicing, multiple copies of some genes, and much noncoding DNA between genes makes accurate identification and counting of genes difficult. Information on ESTs, SNPs, and bioinformatics homologous. Homologous genes found in different species that evolved from the same gene in a common ancestor are called orthologs (I Figure 19.15). For example, both mouse and human genomes contain a gene that encodes the alpha subunit of hemoglobin; the mouse and human alpha-hemoglobin genes are said to be orthologs, because both genes evolved from an alpha-hemoglobin gene in a mammalian ancestor common to mice and humans. Homologous genes in the same organism (arising by duplication of a single gene in the evolutionary past) are called paralogs (see Figure 19.15). Within the human genome is a gene that

Functional Genomics

A genomic sequence is, by itself, of limited use. It would be like having a huge set of encyclopedias without being able to read—you could recognize the different letters but the text would be meaningless. Functional genomics is, in essence, probing genome sequences for meaning—identifying genes, recognizing their organization, and understanding their function. The goals of functional genomics include identifying all the RNA molecules transcribed from a genome (the transcriptome) and all the proteins encoded by the genome (the proteome). Functional genomics exploits both bioinformatics and laboratory-based experimental approaches in its search to define the function of DNA sequences.

Chapter 18 considered several methods for identifying genes and assessing their functions, including in situ hybridization, DNA footprinting, experimental mutagene-sis, and the use of transgenic animals and knockouts. These methods can be applied to individual genes and can provide important information about the locations and functions of genetic information. In this section, we will focus primarily on methods that rely on knowing the sequences of other genes or that can be applied to large numbers of genes simultaneously.

Predicting Function from Sequence

The nucleotide sequence of a gene can be used to predict the amino acid sequence of the protein that it encodes. The protein can then be synthesized or isolated and its properties studied to determine its function. However, this biochemical approach to understanding gene function is both time consuming and expensive. A major goal of functional genomics has been to develop computational methods that allow gene function to be identified from DNA sequence alone, bypassing the laborious process of isolating and characterizing individual proteins.

Homology searches One computational method (often the first employed) for determining gene function is to conduct a homology search, which relies on comparing DNA and protein sequences from the same and different organisms. Genes that are evolutionarily related are said to be

Was this article helpful?

0 0

Post a comment