How Big Data is Driving Data Dispersal

Increasing data volumes makes RAID inadequate for storage integrity. Data dispersal can provide greater data protection and substancial cost savings for digital content storage.

By Chris Gladwin, CEO, Cleversafe

Big Data is circulating through the industry as the latest buzzword. It refers to the rapid growth of unstructured datasets that affect the ability of commonly used software tools to capture, manage and process the data within a tolerable elapsed time. Prior to 1999, most data was structured, but the introduction of digital music began the proliferation of unstructured data. Back in 1999, 50 percent of data was classified as unstructured, but according to a 2008 survey from HP, about 70 percent of data is estimated to be unstructured. That was three years ago, and that number is likely to be higher today. The Internet has already turned into a video distribution machine, and all this unstructured data is taxing storage systems and creating the need to be able to store multi-petabytes.

Current data storage arrays based on RAID were designed for storing smaller amounts of data. As a result, the cost of RAID-based storage systems increases as the total amount of data storage increases, and data protection degrades, resulting in digital assets that can be lost forever (RAID schemes are based on parity and, at its root, if more than two drives fail simultaneously, data is unrecoverable). The statistical likelihood of multiple drive failures has not been an issue in the past; however, as systems grow to hundreds of terabytes and petabytes, the likelihood of multiple drive failure is now a reality.

Further, drives aren’t perfect, and typical SATA drives have a published bit rate error (BRE) of 1014, meaning once every 100,000,000,000,000 bits there will be a bit that is unrecoverable. Doesn’t seem significant? In today’s larger storage systems, it is. Unfortunately, the likelihood of having one drive fail and encountering a bit rate error when rebuilding from the remaining RAID set is highly probable in real-world scenarios. To put this into perspective, when reading 10 terabytes, the probability of an unreadable bit is likely (56 percent); when reading 100 terabytes, it is nearly certain (99.97 percent).

RAID advocates tout its data protection capabilities based on models using vendor-specified Mean Time to Failure (MTTF) values. In reality, drive failures within a batch of disks are strongly correlated over time, meaning if a disk has failed in a batch, there is a significant probability of a second failure of another disk.

To combat the shortcomings of RAID, some enterprises use replication, a technique of making additional copies of their data to avoid unrecoverable errors and lost data. However, those copies add additional costs; typically 133 percent or more additional storage is needed for each additional copy (after including the overhead associated with a typical RAID 6 configuration). Organizations also use replication to help with failure scenarios, such as a location failure, power outages, or bandwidth unavailability. Having seamless access to data is key to keeping businesses running and profitable. As storage grows from the terabyte to petabyte range, the number of copies required to keep the data protection constant increases. This means the storage system will get more expensive as the amount of data increases.

Moore’s Law and the Rise of Dispersed Storage

Before entering the storage industry, I worked in the wireless industry, which has been working with bit error failure rates for decades. When you talk to storage professionals, they have no idea what their bit error failure rate is compared to that handled by the wireless professionals who have been dealing with erasure coding for 30 years. So as a wireless professional, you look at storage and wonder why the storage industry hasn’t been doing the same thing. The reason this hasn’t happened in the storage industry is because of Moore’s Law.

Back in 1991, a CPU was a lot slower. Microprocessors were a million times slower, so to do extra calculations, such as implementing erasure coding in software, made it even slower. Things have changed. In 1991, a hard drive took 60 seconds to read or write an entire platter of data. Drives have grown, but they aren’t faster. Now, it takes a hard drive about 16 hours to read or write a platter of data on a 2-TB drive. Meanwhile, CPUs have become faster. If you can factor in the cost of math, which is a million times cheaper, you can do a lot without noticing performance degradation or increased cost. That performance, driven by Moore’s Law, enables these advanced erasure coding techniques for storage.

Emerging technologies such as dispersed storage can help organizations significantly reduce costs, power consumption, and the storage footprint, as well as streamline IT management processes. Information dispersal algorithms (IDAs) separate data into unrecognizable slices of information, which are then dispersed to disparate storage locations, which can be located anywhere. Because the data is dispersed across devices, it is more resilient against natural disasters or technological failures, like system crashes and network failures. Thanks to redundant storage and the IDAs, only a subset of slices is needed to reconstitute the original data, so in the event of multiple simultaneous failures across a string of hosting devices, servers, or networks, data can still be accessed in real-time.

For example, in a configuration that generates 16 slices where only 10 are required for data reconstitution, up to six simultaneous failures can occur while still providing users with seamless access to data. From a security standpoint, each individual slice does not contain enough information to understand the original data and only a subset of the slices from dispersed nodes are needed to fully retrieve all of the data.

The Museum of Broadcast Communications successfully deployed dispersed storage for unstructured data. The non-profit organization collects, preserves, and streams a wide array of historic and contemporary radio and television programs. By leveraging dispersed storage, the organization is able to store 400,000 videos without incurring costly overhead in storage and bandwidth, and still can scale to match capacity as their digital collection grows.

One benefit of dispersing data is that costly replication can be eliminated. Traditionally, information is replicated to create additional copies in case there is an outage or failure at the first location. Because dispersed storage can be configured across multiple sites, it can tolerate a site failure and still have seamless read and write access to the data across the other sites. For example, with a 16 slices, 10 threshold configuration, it could be split across 4 data centers with 4 slices in each. If one of the data centers is down, there are still 12 slices across the other 3 data centers with which to recreate data.

This means that instead of replicating data across two or three data centers with 360% or more raw storage required (factoring in RAID overhead), dispersed storage can deliver the same data protection with only 160% raw storage required. The beauty of a dispersed storage architecture is it is highly fault tolerant so it can route around site failures and network connectivity problems. As long as the threshold number of slices can be retrieved from any of the sites, it can deliver data.

If a dispersed storage system is configured across multiple data centers, there is a heavy reliance on network connectivity, so multi-site dispersed storage is not well suited for applications that require low latency. However, it is well suited for massive content libraries, especially those growing to petabyte scale because it is cost-prohibitive to replicate petabytes.

Everyone’s data -- structured or unstructured -- is growing at an alarming rate. If an organization continues relying on RAID and replication to manage big data, the storage system will eventually become cost prohibitive, or it will have to sacrifice data protection to store all of its data, and inevitably push the limits that will expose them to data loss.

Forward-thinking executives will make a strategic shift to dispersal to realize greater data protection as well as substantial cost savings for digital content storage. They’ll also benefit by migrating their storage in a smaller storage range, and will realize the cost savings within their stable tenure. Those executives who wait, or are in denial about the limitations of RAID, will most likely see their system fail within the executive’s tenure.

Chris Gladwin is the co-founder and CEO of Cleversafe. The company was established in 2004 after he wrote algorithms for the first dispersed storage software prototype. Chris was previously the creator of the first workgroup storage server at Zenith Data Systems and was a manager of corporate storage standards at Lockheed Martin. Chris has created over 300 issued and pending patents related to dispersed storage technology, wireless remote interface, and Internet service technology, and his work has been recognized by 32 industry awards for the products, services. and companies he has created. Chris holds an Engineering degree from the Massachusetts Institute of Technology. You can contact the author at pr@cleversafe.com