Deduplicaton: Five Key Questions Help You Find the Right Solution

Deduplication on virtual tape libraries is a hot technology. Before jumping in, however, you’ll need answers to five important questions.

by Miklos Sandorfi

Deduplication on virtual tape libraries (VTLS) is a hot new technology that promises to help offset the exponential data growth facing today’s enterprises. Data deduplication vendors are actively promoting the relative features and benefits of their products, and in the process, creating a great deal of confusion about what solution is best in any given backup environment.

Answering the following questions will help you eliminate this confusion and use meaningful criteria to choose a solution that best meets your needs.

1. How fast can you backup data with deduplication?

Backup performance is essential for successful data protection. Choose a deduplication technology that will allow you to complete backups within your backup windows -- and plan for meeting those windows even after three years of continued data growth.

Some technologies deduplicate data before backing it up to the VTL (in-line deduplication); others back up the data to the VTL and then deduplicate it (post-process deduplication). Generally speaking, in-line deduplication solutions offer a fast, efficient way for small to medium-sized organizations or departments within larger enterprises. However, in-line deduplication cannot handle large backup volumes or scale performance or capacity to meet the needs of a enterprise-class environments.

Post-process deduplication backs up data to the VTL at wire speed and then performs the analysis, comparison, deduplication, and capacity reclamation processes. Although this method needs slightly more disk space to handle incoming backups and the deduplication process, it can scale performance and capacity easily to handle up to petabytes of data on a single appliance. In addition, because it backs up a full set of data before deduplicating it, the post-process method enables more rigorous data integrity checking.

To assess whether your data amounts are classified as small-to-midsize or enterprise, calculate how much data you back up nightly and in your weekly full backups. Compare this volume to the maximum hourly multi-stream throughput available on the VTLs with the deduplication method you are considering.

2. How fast can you restore your data?

Ensure your deduplication solution can restore data quickly and easily. Understand the time required to restore files that were backed within the previous 30 days (the most common category of restore request).

Some deduplication technologies use the first backup as a reference copy. All subsequent backups are compared to it for duplicates. Duplicate blocks in every subsequent backup are replaced with pointers to this earliest reference copy. As a given piece of data is backed up numerous times, it is broken up into more and more pointers. To restore that data, the software has to locate and compile it correctly from these pointers, which can be a complex and time-consuming process.

Other technologies perform this function in reverse. They use the data in the most recent backup as the reference copy and replace duplicate data stored in previous backups with pointers. This technique can only be used with post-process deduplication technologies. Its main advantage is that it enables the software to restore recently backed up data nearly instantaneously by eliminating or minimizing the amount of reassembly required.

3. How efficient will deduplication be in your specific environment?

Look for vendors that will test and characterize samples of your backup data and provide clear expectations of the levels of deduplication you can expect from their technology -- before you buy.

Deduplication efficiency varies widely depending on a wide range of variables related to both the deduplication technology and the specific environment in which it is used. It stands to reason that the more duplicate data in your backups, the more capacity reduction your deduplication technology can provide.

The deduplication efficiency delivered by different technologies also varies widely. To deduplicate data, the software must compare all data in the backup set to the previously stored data. Comparing every byte of backup data to every byte of stored data would be a time-consuming, processing-intensive challenge. Instead, some technologies run incoming data through a hashing algorithm to create a small representation of the data and a unique identifier for that piece of data, called a hash, then it compares the hash to previous hashes stored in a lookup table. When a match is found, the duplicate data is replaced with a pointer to the existing hash. If there is no match, the data is added to the lookup table. In a small-to-midsize backup environment, this method may offer sufficient deduplication efficiency (typically 20:1) with reasonable performance. This technique also works well for in-line deduplication.

At the enterprise scale however, the hash-comparison method may not deliver a sufficient level of capacity reduction or performance. Some enterprise-class deduplication technologies use built-in knowledge of the backup data sets to compare backup data to stored data at the object level (i.e., Word files to Word files, Excel files to Excel files, etc.). This process identifies likely areas of duplication that can then be compared at the byte level for optimal capacity reduction (25:1 or 50:1 when combined with hardware compression on the VTL).

Another distinction between deduplication technologies is how well or poorly they perform on incremental backups or “incrementals forever” backup scenarios. Read the fine print on the data sheets. Ask for references from customers that are using the same backup application, similar policies and data types that you are.

4. How scalable is the solution? When happens when you run out of capacity and performance?

Calculate how much data you will be able to store on a single VTL with deduplication given your specific deduplication ratios, policies, and data types. Understand the implications of exceeding that capacity in terms of administrative complexity, capital expense, and disruption to your environment.

Some deduplication technologies, particularly those performed in-line, cannot easily scale. To perform backups fast enough to stay within your backup window with these technologies, you need to add multiple independently managed appliances. Overall efficiency of deduplication is reduced because the data comparisons that identify duplicate data are only performed within individual devices.

Some VTL technologies can handle large backup volumes, but require a “forklift upgrade,” even within the product line, to add performance and capacity. Consider an enterprise-class deduplication solution that can backup and restore data as fast as 17 TB/hr and handle petabytes of data in a single appliance.

5. How much will you save?

Evaluate the deduplication technology within the context of your overall backup and recovery solution. Avoid solutions that replace one big problem for a handful of smaller ones.

Avoid deduplication technologies that reduce data storage capacity but require you to manage multiple independent VTLs. The reduced cost of power and floor space achieved through these solutions is offset by added administrative complexity. Given the cost of labor and the relative risk of human error, the “silos of storage” method may be more costly than no deduplication at all.

Deduplication technology is an exciting new development in backup and recovery that promises to dramatically reduce capacity, power consumption and footprint. However, to gain these benefits in a real-world environment, you need to ignore the hype, understand the technologies, and ask the hard questions.

- - -

Miklos Sandorfi is chief technology officer at SEPATON, Inc. You can reach the author at