New findings from the ENCODE project challenge established views on human genome

this was the centerfold in the June 14, 2007 issue of Nature, from the ENCODE Consortium paper
Wednesday, June 13, 2007
Branwyn Wagman and NHGRI

The University of California, Santa Cruz played a key role in an international project to catalog all of the biologically functional elements in 1 percent of the human genome. The project has culminated today with the publication of a set of papers that promise to reshape our understanding of how the human genome functions. The UCSC Genome Bioinformatics Group, headed by biomolecular engineering professor David Haussler and associate research scientist Jim Kent, adapted the internationally recognized UCSC Genome Browser ( as the data repository for this project.

The findings challenge the traditional view of our genetic blueprint as a tidy collection of independent genes, pointing instead to a complex network in which genes, along with regulatory elements and other types of DNA sequences that do not code for proteins, interact in overlapping ways not yet fully understood.

The UCSC Genome Browser web site allows researchers unencumbered access to the wealth of data produced by the international consortium. It showcases this data so that genetic scientists can mine it for clues about how the body works in health and in disease.

In a group paper published in the June 14 issue of Nature and in 28 companion papers published in the June issue of Genome Research, the ENCyclopedia Of DNA Elements (ENCODE) consortium, which is organized by the National Human Genome Research Institute (NHGRI), part of the National Institutes of Health (NIH), reported results of its exhaustive, four-year effort to build a parts list of all biologically functional elements in 1 percent of the human genome. Carried out by 35 groups from 80 organizations around the world, the research served as a pilot to test the feasibility of a full-scale initiative to produce a comprehensive catalog of all components of the human genome crucial for biological function.

"The sheer number of ENCODE data providers and the diversity of experimental methods used to generate this data presented a challenge to the UCSC team," remarked Kate Rosenbloom, lead software developer on the UCSC ENCODE team. "We were continually customizing our software for effective visualization and efficient retrieval of new data types."

The Nature issue will include a pull-out poster that is a screen shot of the UCSC Genome Browser concisely displaying a broad range of the ENCODE data.

In addition to serving as a data repository, the UCSC team has provided programming for the comparative genomics aspect of the ENCODE project through the Multi-Species Sequence Analysis Group. By aligning the human genome with multiple other species, it is possible to glean an understanding of the relative importance and roles of DNA sequences. The UCSC Genome Browser has been designed to provide such alignments, and it currently displays 38 species, from simple organisms such as yeast and worms to mice, chimps, and humans. Comparative genomics is a research focus of the Haussler laboratory.

Authors of the ENCODE papers include researchers from academic, governmental and industry organizations located in Australia, Austria, Canada, Germany, Japan, Singapore, Spain, Sweden, Switzerland, the United Kingdom and the United States. The ENCODE project has been open to all interested researchers who agree to abide by the consortium's guidelines.

The UCSC team, led by Haussler and Kent, are software developers Kate Rosenbloom and Rachel Harte, project manager Donna Karolchik, quality assurance manager Robert Kuhn, graduate student Daryl Thomas, and postdoctoral scholar Ting Wang, along with the entire genome browser staff and other Haussler lab graduate students and postdocs.

Several of the UCSC participants attended an "ENCODE analysis jamboree" in Washington, DC in July 2005, where they provided custom programming services to consortium members and trained them in the use of the UCSC ENCODE browser. In fall 2005, the UCSC group hosted two ENCODE analysis groups for several days of focus on genes, gene transcription, and transcription regulation.

"This impressive effort has uncovered many exciting surprises and blazed the way for future efforts to explore the functional landscape of the entire human genome," said NHGRI Director Francis S. Collins, M.D., Ph.D. "Because of the hard work and keen insights of the ENCODE consortium, the scientific community will need to rethink some long-held views about what genes are and what they do, as well as how the genome's functional elements have evolved. This could have significant implications for efforts to identify the DNA sequences involved in many human diseases."

The completion of the Human Genome Project in April 2003--aided by the bioinformatics contribution of Haussler and Kent--was a major achievement, but the sequencing of the genome marked just the first step toward the goal of using such information to diagnose, treat and prevent disease. Having the human genome sequence is similar to having all the pages of an instruction manual needed to make the human body. Researchers still must learn how to read the manual's language so they can identify every part and understand how the parts work together to contribute to health and disease.

In recent years, researchers have made major strides in using DNA sequence data to identify genes, which are traditionally defined as the parts of the genome that code for proteins. The protein-coding component of these genes makes up just a small fraction of the human genome--1.5 percent to 2 percent. Evidence exists that other parts of the genome also have important functions.

However, until now, most studies have concentrated on functional elements associated with specific genes and have not provided insights about functional elements throughout the genome. The ENCODE project represents the first systematic effort to determine where all types of functional elements are located and how they are organized.

In the pilot phase, ENCODE researchers devised and tested high-throughput approaches for identifying functional elements in the genome. Those elements included genes that code for proteins; genes that do not code for proteins; regulatory elements that control the transcription of genes; and elements that maintain the structure of chromosomes and mediate the dynamics of their replication.

The collaborative study focused on 44 targets, which together cover about 1 percent of the human genome sequence, or about 30 million DNA base pairs. The targets were strategically selected to provide a representative cross section of the entire human genome. All told, the ENCODE consortium generated more than 200 datasets and analyzed more than 600 million data points.

"Our results reveal important principles about the organization of functional elements in the human genome, providing new perspectives on everything from DNA transcription to mammalian evolution. In particular, we gained significant insight into DNA sequences that do not encode proteins, which we knew very little about before," said Ewan Birney, Ph.D., head of genome annotation at the European Molecular Biology Laboratory's European Bioinformatics Institute (EBI) in Hinxton, England, who led ENCODE's massive data integration and analysis effort.

The ENCODE consortium's major findings include the discovery that the majority of DNA in the human genome is transcribed into functional molecules, called RNA, and that these transcripts extensively overlap one another. This broad pattern of transcription challenges the long-standing view that the human genome consists of a relatively small set of discrete genes, along with a vast amount of so-called junk DNA that is not biologically active.

The new data indicate the genome contains very little unused sequences and, in fact, is a complex, interwoven network. In this network, genes are just one of many types of DNA sequences that have a functional impact. "Our perspective of transcription and genes may have to evolve," the researchers state in their Nature paper, noting the network model of the genome "poses some interesting mechanistic questions" that have yet to be answered.

"Teamwork was essential to the success of this effort. No single experimental approach can be used to identify all functional elements in the genome. So, it was necessary to conduct multiple, diverse experiments and then analyze them using multiple computational methods," said Elise A. Feingold, Ph.D., program director for ENCODE in NHGRI's Division of Extramural Research, which provided most of the funding for the pilot project.

"Following the Human Genome Project's model of free and rapid data access, we have designated ENCODE as a community resource project. This designation means all ENCODE data were deposited in public databases as soon as they were experimentally verified," said Peter Good, Ph.D., program director for genome informatics in NHGRI's Division of Extramural Research.

The main portal for ENCODE data is the University of California, Santa Cruz's ENCODE Genome Browser at; the analysis effort is coordinated from Ensembl, a joint project of EBI and the Wellcome Trust Sanger Institute, at Much of the primary data have been deposited in databases at the NIH's National Center for Biotechnology Information at and EBI at For more detailed information on the ENCODE project, including the consortium's data release and accessibility policies and a list of NHGRI-funded participants, go to:

"It would have been impossible to conduct a scientific exploration of this magnitude without the skills and talents of groups representing many different disciplines. Thanks to the ENCODE collaboration, individual researchers around the world now have access to a wealth of new data that they can use to inform and shape research related to the human genome," said Eric D. Green, M.D., Ph.D., director of NHGRI's Division of Intramural Research, which has multiple investigators participating in the ENCODE research consortium.

article in Nucleic Acids Research, Dec 13, 2006, "The ENCODE Project at UC Santa Cruz"


  Center for Biomolecular Science & Engineering
1156 High St, Mail Stop CBSE,
University of California, Santa Cruz, CA 95064
831-459-1477 ext. 9-1477 | 

For questions about the UCSC Genome Browser:

UCSC Home | BSOE Home | CBSE Home | Institute HomeInternal