Mapping the Human Genome was the Easy Part

How to store and manage all of the data created in the genotyping process and in subsequent analytical activities creates unique requirements for research scientists.

When the Human Genome Project (HGP) wrapped up its work in 2003, it had accomplished most of its major goals. The project had identified all of the approximately 20,000-25,000 genes in human DNA, determined the sequences of the 3 billion chemical-base pairs that make up human DNA, stored that information in databases, improved tools for data analysis, and transferred related technologies to the private sector. One of the recipients was the Center for Inherited Disease Research (CIDR) at Johns Hopkins University in Baltimore, MD.

CIDR, like many other university- and corporate-based research facilities, has initiated projects that built upon the foundations of HGP. One of these research projects involves the exhaustive comparison of genotypes in persons with certain diseases.

“Using HGP data,” says Lee Watkins, Jr., director of bioinformatics at CIDR, “we are seeking to understand the genetic bases of certain complex diseases. We get DNA from patients and compare it to unaffected persons to understand patterns and differences in genetic makeup. We are doing these genetic comparisons on a massive scale to identify single-nucleotide differences, as well as small rearrangements of the genetic material.” Improved understanding could lead to better diagnosis, treatments and, perhaps, cures.

In the face of this lofty goal, Watkins quickly found himself challenged by a more mundane task: figuring out how to store and manage all of the data created in the genotyping process and in subsequent analytical activities. The primary data collected is image data: microscopic photographs of genetic material in fluorescent red and green hues.

“We produce multiple TIFF (tagged image file format) images of around 30 to 80 MB per file. This data undergoes some normalization. Then it is used to create another set of data in a format used by analysis tools. Then a final data set is produced that is hundreds of GBs in size,” Watkins explained, adding that, not only is the final data set retained, but so is the “level zero” scientific data (the original imaging data).

“The original image data needs to be retained forever. As new analytical techniques come into play, someone will want to look at the level zero data again,” Watkins said, “It is very expensive to produce all of the data again.”

The massive quantity of data, ranging from a terabyte per week to a terabyte per day, immediately required two things, Watkins said: an archive strategy and storage platforms with sufficient capacity to store the increase in data without bankrupting the research facility. In early 2007, he learned about Capricorn Technologies’ PetaBoxes, which the vendor touts on its Web site as having the lowest cost per TB of any storage platform. Each PetaBox is completely scalable, from individual terabyte nodes to petabyte clusters. A single 19-inch rack can support up to 160TB of raw disk space, which the vendor claims is achieved through an economy of design that consumes as little as 27 watts per terabyte. CIDR deployed a dozen 3TB units initially.

With the hardware platform selected, Watkins says that his team built its own archive software -- initially a cobble of a database and Samba to facilitate SMB/CIFS-based data access to researchers. “We were mostly interested in simplicity of management across commodity hardware,” he said, “and we wanted to cluster systems for resiliency and to set duplication of data at a desired rate that would let us recover at any point.”

The solution worked…up to a point. Watkins said that he had underestimated the volume of data and the challenge of managing such an expanding Internet technology archive (writes to the repository are made via HTTP). That’s when Capricorn turned him on to Caringo, an Austin, Texas-based software developer specializing in content addressable storage (CAS).

Caringo’s CAStor software enables any storage platform to be used as a content-addressable storage repository. According to Jonathan Ring, Caringo’s president and founder, CAStor enables an access technique based on a unique and non-changing data object ID that is created when files are ingested into the storage repository.

Said Ring, “CIDR needed to support millions of transactions in simultaneous ingestion streams. They also needed the WORM [write once read many]-like characteristics that would help to ensure the integrity of the data over time. CAStor is simply the only product on the market that can do this.”

Watkins said that he could not accept the promises of Caringo at face value. The vendor stated that implementing a CAS repository would be as simple as loading some software from a USB key onto each node in the cluster. He challenged Caringo to demonstrate the functionality, which provided for resiliency by redistributing data ingestion streams in the face of hardware failures without stopping operations.

Watkins reported that problems were detected early on, but that he was impressed by Caringo’s diligence in resolving them. “Early in our use of CAStor, when we were rapidly expanding the cluster, we added a bunch of new nodes with slower (100MB) NICs. The speed/latency differences with the 1GB NICs eventually caused some nodes to get confused and take themselves offline. This was a fairly major problem that Caringo quickly addressed -- not quite a crash but definitely a freak-out. Officially this was an unsupported configuration but not explicitly noted that way in their sysadmin guide, which has since been corrected.”

With the fixes in, said Watkins, “We tried to crash the cluster every way we could think of, but it simply wouldn’t fail.” That turned out to be a lifesaver for CIDR. While the data center was being overhauled, Watkins recounted, some of his equipment was staged outside of the power-protected, climate-controlled, raised-floor shop. “We learned the truth about failure rates in commodity SATA disk drives…and we had power failures in a few instances,” he recalled. In the case of disk failures, the Caringo-enabled cluster performed without a hitch, failing streams over to the other clustered storage repository. In the case of power outages, “the clusters simply came back up without error following the outage,” Watkins said.

Watkins has been operating the CAStor-enabled archive for nearly six months now. He said he had added a second product from Caringo, a File System Gateway, that provides access to data via SMB/CIFS for those users who want file system access to the content addressed storage rather than using HTTP.

Solving the issues of scalability, reliability, and performance enable CIDR to remain focused on its primary goal: obtaining a better understanding of the genetics behind complex diseases. For Caringo, a relative newcomer in the still proprietary and stovepiped world of storage, CIDR is among the first of a growing number of customers who need content addressable storage technology that will work with any storage target. Their goal, to establish CAS as a data management enabler in a hardware agnostic manner, has been realized in an elegant and affordable software-only solution.

It’s worth a look. Your comments are welcome: