In-Depth

5 Costliest Deduplication Problems -- And How To Avoid Them

How do you pick the best deduplication strategy? We explain how to avoid the five costliest problems you can make when choosing deduplication technology for big data environments.

By Jeff Tofano

Deduplication has quickly moved from a hot new technology to a mature essential requirement for most enterprise data centers. In this time, several different approaches to deduplication have emerged to address the specific requirements of today's data center. Some deduplication technologies are best suited to remote/branch offices and data centers protecting smaller data volumes; others are designed for large, data-intensive environments with "big backup" requirements

The consequences of choosing the wrong deduplication solution can be costly and time consuming, particularly for large enterprises. In this article, we explain how to avoid the five costliest problems that IT managers can make when choosing a deduplication technology for a big backup environment.

Problem #1: Silos cause costly data center sprawl

Arguably, the costliest mistake a large enterprise can make is using a deduplication technology that does not scale across nodes and capacity. "Siloed" deduplication is far less efficient at reducing capacity than global coherent deduplication technologies because they cannot compare data across systems to identify duplicates. This allows a significant volume of duplicate data to exist undetected on multiple systems.

Siloed solutions also have a significantly higher total cost of ownership (TCO) than do scalable alternatives. One reason for this high TCO is that they force enterprises to purchase new systems every time they run out of either performance or capacity, quickly causing costly data center sprawl. Labor costs increase as IT administrators are required to load balance new systems and to perform ongoing system administration, upgrades, and maintenance tasks on multiple individual systems.

As enterprises continue to increase the volume of data that each administrator is responsible for managing, the added complexity caused by sprawl adds significant unnecessary cost and risk to enterprise data centers. Siloed solutions also cause costly over-buying because they require enterprises to purchase a new system to add performance even if they have not yet used all of their existing capacity.

Solution: Grid-scalable global deduplication

A grid-scalable system with global coherent deduplication built specifically for enterprise environments is a more cost-efficient choice than siloed deduplication in a variety of ways. First, they are inherently more efficient at reducing capacity because they perform comparisons across all data as a whole to find duplicates. By deduplicating all data as a single "pool," they can save thousands of dollars in reduced capacity requirements.

Second, they allow enterprises to control costs and eliminate over-buying by enabling them to buy only the performance and capacity they need and to add more as it is needed. This linear "pay-as-you-grow" scalability has the added benefit of eliminating the cumbersome process of load balancing additional systems. Scalable deduplication systems also automate all disk subsystem management tasks (including load balancing), eliminating manual processes and wasted administration time.

Third, a grid-scalable deduplication system with a high degree of automation can save thousands of dollars in labor costs by enabling a single administrator to manage data protection for tens (in some cases hundreds) of petabytes of data. At a glance, administrators can see the status of data as it moves through backup, deduplication, replication, archive, restore, and erasure processes. They can make adjustments as needed quickly and easily.

Problem #2: Insufficient performance to meet backup windows and restore requirements

Big backup environments typically face an ongoing challenge of backing up data backup within their fixed window. They are also challenged to restore massive data volumes within stringent recovery-time objectives (RTO). Enterprises need a data protection solution that delivers predictable, reliable high performance that can be scaled up as needed. However, the backup (data ingest) performance of hash-based deduplication solutions can be unpredictable and they typically slow down over time. They deliver unpredictable performance because they perform deduplication by building an index of data as it is backed up. Over time, the index grows and the process of performing "lookups" to the index to find duplicates becomes increasingly processing intensive, making backup and restores performance slow and unpredictable, with costly results.

Solution: Scalable, byte-differential deduplication

Deduplication technologies that use byte differential methodology to identify duplicates deliver the deterministic, high performance that data-intensive environments need without the performance slow-downs that plague hash-based alternatives. Byte-differential technologies use built in intelligence to identify duplicate data based on the content of the backup stream itself and to perform comparisons at the byte level, eliminating the "growing index" issue.

They also leverage the benefits of "concurrent" processing. That is, they can perform backup, deduplication, replication, and restore processes concurrently using multiple processing nodes for sustained, predictable, high performance. As a result, byte-level differential deduplication technologies designed for large enterprises provide a cost-efficient, reliable way to meet backup windows and RTOs.

Problem #3: Unrealistic deduplication capacity reduction ratios

Capacity reduction ratios advertised by many hash-based deduplication vendors are best-case scenarios based on overly optimistic assumptions geared to small to midsize enterprises. In fact, most of the actual realized savings are at ratios below 5:1. The added value of greater ratios is negligible. More important, generic ratios do not reveal that hash-based deduplication technologies typically get very poor reduction efficiency for structured data, such as Exchange, Oracle, or SQL. Because structured data typically comprises a large proportion of the overall enterprise data volume, the inability to deduplicate it could cost enterprises thousands of dollars in added capacity, power, cooling, and data center footprint.

Solution: Enterprise-class deduplication optimized for structured data

Instead of focusing on reduction ratios, choose a deduplication system that delivers the highest reduction efficiency for its specific environment and requirements. Because they are optimized for massive data volumes containing a high proportion of structured data, enterprise-class deduplication technologies that use byte-level differential methodology deliver significantly more capacity reduction and lower TCO than hash-based alternatives.

Problem #4: Inefficient "all-or-nothing" deduplication waste processing resources

A typical big backup environment contains a wide range of data types and manages a complex array of retention periods, regulatory requirements, and business continuity policies. The "all-or-nothing" deduplication required by hash-based technologies can be costly and inefficient in these big backup environments. For example, they waste precious processing cycles and slow performance by deduplicating large volumes of data that have little to no duplicate content or potential or capacity reduction. Without the ability to recognize data types, these hash-based solutions cannot be "tuned" for efficiency and result in significant over buying of performance and capacity.

Solution: Flexible, "tunable" deduplication technology

Byte-differential deduplication that is content aware reads the content of the data as it is backed up and provides the administrators with a wide range of options for tuning the deduplication methodology to their specific mix of data types. For example, they can choose to turn deduplication off for designated volumes of backup data to meet regulatory requirements or to avoid wasting processing cycles on image data or other data that is unlikely to contain duplicates.

These deduplication technologies also automatically detect the type of data being backed up and apply the deduplication algorithm that is most efficient for that data type. They deliver more efficient use of processing resources and significantly more efficient capacity reduction than hash-based deduplication solutions.

Problem #5: Poor management control through the life cycle

As big backup environments grow, enterprises often make the costly error of trying to "divide and conquer" their backup environments by using multiple siloed backup targets. This solution increases the TCO of data protection solutions for the reasons we've described and prevents IT management from ever getting a holistic view of their data protection environment.

As a result, they never have the information or control they need to make important adjustments or buying decisions to improve their efficiency and cost savings. This fragmented approach dramatically increases the manual intervention needed to manage this data and eliminates the ability to automate important capabilities, such as intelligent tiering, needed to keep pace with continued growth.

Solution: Automation, dashboards, and reporting

Choose a data protection and deduplication technology that enables administrators to monitor and manage the precise status of all data in the backup environment as it passes through backup, deduplication, replication, and restore operations as well as electronic data destruction. Ensure reporting delivers granular deduplication ratios by backup policy or data type. Using backup technologies, along with dashboards and reporting that support a holistic view of the data protection environment, enables IT managers in big backup environments to manage more data per administrator, fine-tune the backup environment for optimal efficiency, and to plan accurately for future performance and capacity requirements. These solutions also provide the ability to tier data automatically for cost-saving efficiency.

A Final Word

Deduplication is a critical part of today's data center. Although there are many deduplication options that are well suited to smaller data centers, big data environments need more powerful, scalable solutions that can help them manage their data protection in a cost-effective, efficient way.

Jeff Tofano, the chief technology officer at SEPATON, Inc., has more than 30 years of experience in the data protection, storage, and high-availability industries. As CTO, he leads SEPATON's technical direction and drives product and architectural decisions to take advantage of SEPATON's ability to address the data protection needs of large enterprises. His experience includes serving as CTO of Quantum and holding senior architect roles with Oracle Corporation, Tandem Computers, and Stratus Technologies. Jeff earned a BA in computer science and mathematics from Colgate University. You can contact the author at [email protected].

Must Read Articles