The Human Genome Project at UCSC
Race to complete the first working draft
Science Magazine reflects on the 10th anniversary of human genome sequencing
Read about the entire Human Genome Project on the NHGRI website
The challenge of bioinformatics
Essay by David Haussler
Webcast: NGRI genome symposium, "From Double Helix to Human Genome—and Beyond"
April 2003
The International Human Genome Project (IHGP) came to UC Santa Cruz in December 1999 when Eric Lander, the director of the Whitehead sequencing center (Whitehead Institute/MIT Center for Genome Research), invited David Haussler to help annotate the human genome. In particular, Lander wanted help in discovering the locations of the genes, which make up only approximately 1.5% of the sequence.
Haussler had previously applied a mathematical technique known as hidden Markov models (HMMs) to the task of computer gene-finding. This application of HMMs had quickly become the dominant gene-finding methodology and was used successfully on the Drosophila melanogaster (fruit fly) genome.
Haussler enlisted Jim Kent, then a graduate student in UCSC’s Department of Molecular, Cell, & Developmental Biology, along with systems engineer Patrick Gavin and graduate students Terrence Furey and David Kulp (who had led the gene-finding effort on the Drosophila genome). This was the birth of the UCSC Genome Bioinformatics Group.
It was a crucial time for the international project. The private company Celera Genomics had announced its intention to assemble the human genome sequence well in advance of the public effort, raising the fear that the sequence would be protected by patents and thus not be freely available to scientists. At this point, a number of groups within the IHGP were trying to assemble the genome sequence, which turned out to be like an extremely difficult jigsaw puzzle having many similar-looking, noncontiguous, overlapping pieces. The progress was slow and arduous.
Motivated to prevent Celera and its clients from locking up significant portions of the human genome in patents, Kent dropped his other work in May of 2000 to focus on the assembly problem. Within 4 weeks, he developed a 10,000 line computer program that assembled the working draft of the human genome. The program, called GigAssembler, finished the job on June 22, 2000, just days before Celera completed its first assembly.
On July 7, 2000, after further examination by the principal scientists of the public genome project, the UCSC Genome Bioinformatics Group released this first working draft on the web at http://genome.ucsc.edu. The scientific community downloaded one-half trillion bytes of information from the UCSC genome server in the first 24 hours of free and unrestricted access to the assembled blueprint of our human species.
With the gene assembly 90% complete, the assembled genome was published along with the findings of hundreds of researchers worldwide in the February 15, 2001 issue of Nature, which was largely devoted to the human genome.
The initial assembled human genome sequence was referred to as a working draft, because there remained gaps where DNA sequence was missing, due either to lack of raw sequence data or ambiguities in the positions of the fragments. In the months following the release of the working draft, the UCSC team worked with other researchers worldwide to fill in the gaps. The resulting finished sequence made its debut in April of 2003. It encompasses 99% of the gene-containing regions of the human genome and is 99.99% accurate.
There isn't a laboratory system available to read along the entire length of a DNA strand to determine the order of nucleotide bases (A, G, T, and C for adenine, guanine, cytosine, and thymine). DNA sequences are determined by a variety of methods, some automated. They all involve breaking DNA into fragments by some chemical method such as the use of enzymes and then determining the order of the nucleotides in the fragments.
The task is further complicated by the fact that to get an accurate map, you need considerable redundancy in the sequenced segments. So the sequenced segments contain several times the number of bases in the genome being studied. A supercomputer (such as UCSC’s PitaKluster) tackling this task will spit out a series of longer assembled segments that are contiguous and represent non-overlapping portions of the genome. These are called contigs. To join the contigs together, researchers must go back to the wet lab and get sequences of the gaps between the contigs. They home in on the missing sequences using the ends of the existing ones.
The shotgun approach can be more effective if it is informed by other knowledge of the genome that is already available. The human genome resides on 23 chromosomes. The locations of many genes on these chromosomes are already known, so this allows some sequences to be placed on the map. Then the genome can be pieced together from these fixed segments. This is a bit like solving a jigsaw puzzle using the picture on the cover of the box as a guide.