Nearline Solutions: Reducing Data Storage Infrastructure Costs

Why a nearline component solution is the most practical and cost-effective way to keep data warehouse storage costs under control.

In the course of recent discussions with major outsourcers, systems integrators, and IT departments, I learned a very surprising fact: these days, data warehousing professionals typically estimate the total cost of storing data at about $250,000 per terabyte per year. This figure may seem inordinately high. When disk storage can be purchased for as little as $1,000 for a 1-terabyte array, how can the cost of keeping data be so high?

The answer is that many more factors are involved in managing data environmentsthan simple disk storage capacity. Consider the following:

  • To meet agreed-upon service levels for data availability, integrity and performance, multiple copies of the data are typically maintained on mirrored devices, off-site disaster recovery facilities and development systems

  • Storage devices for industrial use require components with far greater performance and reliability than typically provided by low-cost drives

  • Large amounts of expensive cache memory are required to get adequate performance from large storage arrays

  • Significant processing power is required to perform I/O operations and processes to ensure data availability (backups, for example)

  • Storage farms and the associated processors require expenditures on energy, space, and other environmental factors

  • Offsite “hot sites” require network bandwidth to maintain data currency

  • Considerable administrative work (for example, index building) is involved in retrieving data and making it ready for analysis

  • Data backup and long-term data retention typically require tape subsystems that can meet specific performance levels that guarantee the availability of production data

The combined result: for every terabyte of required production data, as many as six or seven terabytes must be kept online and under management across the enterprise, along with associated infrastructure costs to maintain adequate service levels.

For these reasons, when a mid-tier service level agreement (SLA) is in place, the cost of provisioning a terabyte of data can easily exceed $200,000 per year. When viewed in light of today's unprecedented proliferation of business data – with some organizations generating terabytes of data every day in the course of normal operations – along with new requirements for data retention, this can make maintenance of an adequate enterprise data warehouse facility seem a prohibitively expensive proposition.

Recent advances in data management technology may hold the key to implementing a cost-effective enterprise data warehouse that can maintain required levels of performance and service, even in the face of the current "data explosion." This article describes the available alternatives, and shows why a solution based on a nearline component is the most practical and cost-effective way to keep data warehouse storage costs under control.

Alternative #1: Keep All the Data in the Warehouse

In a typical environment, many factors may make it difficult or costly to scale data warehouses to handle rapidly expanding volumes of data. The pre-calculated OLAP “cubes” constructed to support multi-dimensional reporting can further add to the data growth, because of the considerable space they occupy.

Furthermore, data warehousing “best practices” often involve maintenance of multiple copies of all the warehouse data. Keeping all data in the warehouse “footprint” means that the typically large amount of data that is infrequently accessed and rarely updated ("static" data) will require the same management attention as frequently-used current data—the same high-performance storage environment and the same frequency of processing to ensure availability and integrity. As the warehouse grows, this is likely to dramatically affect service levels, costs, and/or accessibility.

The removal of relatively static data from the warehouse not only reduces expenditures on high-performance storage, it also allows “housekeeping” activities to be performed much more quickly, resulting in higher availability of data to end users.

Alternative #2: Archive the Static Data

Many effective archiving tools are available that allow data to be retained for as long as it may be required to satisfy regulatory requirements or to meet the data needs of the enterprise. If all an organization's data is effectively archived, it might be asked, why can’t an archive facility be used to relieve the warehouse from the burden of static data?

The answer involves data accessibility. Mounting an operation to recover archived data for use in analysis or reporting can be prohibitively expensive and time-consuming, raising the specter of lost business opportunity, fines from regulatory bodies, and enormous negative impact on productivity. That activities requiring such data are often unplanned makes implementation of an archive retrieval “process” impractical, and the level of ready resources for such activities is typically difficult to determine.

Alternative #3: Nearline Storage

A more efficient solution than either of the first two alternatives is to move static data out of the main "current" warehouse into a nearline repository where it can be economically stored, normally in a highly compressed format, while remaining easily accessible when required. Nearline capabilities allow less-frequently-used active data (such as aged information or detailed transactions required to rebuild or recast OLAP cubes) to be stored more cost-effectively and without the same degree of replication as online data.

As the name suggests, nearline solutions provide near-real-time access to static data, no matter how large the volume involved. Much of the static data may be accessed only infrequently, but when it does become required, it must be readily available without requiring extensive data preparation before it can be used, on its own or in conjunction with data created after the archive.

Keeping extensive historical and detailed information in a nearline repository makes access much simpler. In some cases, the access performance of nearline data can match that of the “online” warehouse, and can be offered to users in a "seamless" fashion to meet their analysis and reporting needs.

If nearline is so effective, why archive at all? Nearline storage is designed to allow the warehouse to scale cost-effectively to house many terabytes (and even petabytes) of accessible data. However, it is still likely that the nearline data store may not be the “point of record” of original data. In some cases, nearline data is not in its original form, but represents an abstraction or transformation of the original. In such cases, certified archival storage may still be required.

Also, it may still be necessary to guarantee the retention of data so old that it will probably never be required for reporting or analysis. In these cases, archival storage can be less expensive than retaining it in accessible form.

In an age when the data volumes handled by organizations are growing exponentially, it is becoming impossibly expensive to maintain all enterprise data in a traditional "online" data warehouse while providing adequate performance and service levels to users. Implementing a nearline component is an ideal way to relieve the main warehouse of the burden of data that is rarely changed, while still keeping it readily available for access when required. A nearline solution will ultimately enable you to greatly reduce expenditures on storage infrastructure, not simply by offering high levels of data compression, but also by reducing the amounts of data that need to be replicated or moved across the network in the course of normal warehouse operations.

About the Author

Jerry Shattner is executive vice president, marketing and corporate alliances, at SAND Technology.