Two large parallel processor clusters, called Swarm and PitaKluster, and a bank of web servers support the UCSC Genome Browser and its associated tools and databases. These facilities also support much of the computational genome research conducted within CBSE. Swarm consists of 256 quadcore Intel Xeon processor compute nodes, each with 8 gigabytes of memory. Swarm has a total of 1024 cores on 4 double-sided racks; it has a theoretical maximum flop rating of over 10,000 gigaflops/second. It runs Rocks Linux, an optimized Linux distribution for clustering applications. The PitaKluster consists of 128 dual AMD Opteron processor compute nodes, each having 4 gigabytes of memory, housed in three Rackable storage units. The PitaKluster's 396 processors, which also run on a Linux operating system, can perform over a trillion instructions/second. Both systems were designed to provide an exceptional amount of inexpensive computing power in minimal space. For memory-intensive jobs, CBSE employs a cluster of 8 machines with dual dual-core AMD processors, each with 32 gigabytes of memory. These computational clusters are supported by 24 data servers and 8 metadata servers running IBM GPFS; together they provide almost 240 terabytes of usable, replicated network storage. The clusters connect to the network infrastructure with at least 1-gigabit ethernet.
CBSE also employs a high-availability connected computer setup for virtual machine hosting and two computers, each with 64 computing cores and 1 terabyte of memory for software and database development. These attach to 24 terabytes of local disk space. Together, these redundant machines have 64 computing cores, 500 gigabytes of memory, and access to 20 terabytes of local disk space.
These computing capabilities are replicated at the UC San Diego Supercomputer Center, where CBSE maintains a file system with 600-terabyte capacity, a 16-node 512-core cluster, and a 1 terabyte memory machine with 64 computing cores for software development—all connected by an internal 10-gigabit/second network infrastructure.
The web servers for the UCSC Genome Browser are housed in a data center designed to function 24/7, 365 days a year. They consist of 6 dual 12-core AMD Opteron processors; each offers 64 gigabytes of internal solid state storage and 128 gigabytes of memory. These machines have access to a central file server that provides 10 extra terabytes of shared disk area and a central mySQL database server that holds up to 13 terabytes of genomic data. Fifteen additional servers with 16-64 gigabytes of memory plus one 8-gigabyte cloud server provide web access to BLAT (BLAST-like alignment tool) software and its memory-intensive calculations. Servers available for public use include a genome preview server for access to raw data before it has gone through QA, a server that hosts all the browser mySQL data, a server to store user-generated custom tracks, and a wiki seriver that holds public infomration and can keep track of named sessions. A local download server that allows users to download our data serves nearly 2 terabytes of data every day. For redundancy and load balancing, we house an identical download server and one additional file server at the UC San Diego Supercomputer Center.
Why Parallel Processors?
Computer clusters such as these are a cost effective way to process large amounts of data. Since bioinformatics problems are “embarrassingly parallel,” they do not require high speed inter-process communication to perform calculations. This eliminates the need for high-priced networking equipment. Taking advantage of this fact by employing parallel but separate computation by many processors, we have pioneered the development of “super-computing on-the-cheap” for the specific needs of genome presentation, annotation, and analysis.
The Swarm cluster is the fourth-generation bioinformatics cluster at UCSC, operating alongside the third-generation PitaKluster, which gradually took over for the second-generation system, The KiloKluster. The first generation was a cluster of 100 Pentium III processors that was built to assemble the first working draft of the human genome in June of 2000, using a 10,000-line program written by Jim Kent called GigAssembler.
These computing systems are funded through the Howard Hughes Medical Institute, the National Human Genome Research Institute (NHGRI), the California Institute for Quantitative Biosciences (QB3), and the National Cancer Institute.