In-Depth

What Does "Archive" Really Mean? (Part 2 of 2)

More misconceptions clarified about the meaning of storage archiving, plus criteria you should use to make smart archive purchases.

Last week I described the formation of the Active Archive Alliance, and cleared the air about what "archive" really means for storage professionals.

This week I will conclude my discussion of the term "archive" by exploring two more misconceptions about the technology, and I'll offer a list of ten criteria you should use when making a smart archive purchase.

Storage Resource Management is Not "Archive"

Similar to hierarchical storage management (HSM) discussed last week, storage resource management (SRM) is not a substitute for intelligent data management. Also like HSM, SRM focuses on capacity allocation efficiency. A survey of SRM product whitepapers and marketing materials reveals that storage resource management means different things to different vendors.

Most SRM products, for example, provide tools for managing and monitoring storage device configurations, connectivity, capacity, and performance. Some SRM products provide tools for designing, monitoring and maintaining replication processes intended to protect data, including backup and various forms of local and remote mirroring. Still other SRM products provide tools for implementing processes to automate the routine management tasks that can be automated efficiently to reduce labor cost components in IT administration.

SRM vendors do offer a range of reports that may be useful in exploring data repositories and for identifying candidates both for deletion (in the case of contraband files identifiable by their file extensions) and for migration (identifying last access/last modified metadata). They are useful in identifying how data is currently laid out on infrastructure and for spotting “hot spots” or other escalating conditions that may impair access speeds or application performance.

Some provide tools to “watch” the on-array algorithms governing value-add features such as thin provisioning or de-duplication -- often to provide consumers with greater oversight and confidence in these technologies.

Unfortunately, many leading SRM vendors misrepresent the capabilities of their products as data management products. Intentionally or not, they state that SRM delivers the utilization efficiency that is often missing from storage, when correct wording might be “allocation efficiency” or “operational efficiency” of the infrastructure. As the name implies, storage resource management is about managing storage resources, not data.

Information Lifecycle Management is Not "Archive"

Perhaps the greatest damage done to intelligent data management and archive was the marketing surrounding information lifecycle management (ILM) promulgated by leading storage array vendors in the late 1990s and early 2000s. ILM is an old idea, its origins harking back to mainframes. IBM correctly asserted that ILM involved, at a minimum, four things.

First, you needed a means to classify data assets (to understand what to move around infrastructure). Second, you needed a method for classifying storage assets (to know the targets to which classified data could be moved). Third, true ILM required a Policy Engine that would describe what data to move and under what circumstances or conditions. Finally, you needed a data mover -- software that would move the data physically from device to device based on policy.

When EMC resurrected the ILM concept in the late 1990s, and began a marketing barrage that found many competitors pitching the same functionality, they ignored the first three parts of what IBM defined as ILM and instead proffered machine processes for data movement: basically, HSM. The result came to be criticized as “information Feng Shui management” because it provided no support for data classification, storage classification or policy-driven management -- the “heavy lifting” of any ILM process.

True ILM is synonymous with intelligent data management, but ILM as represented in most vendor marketing ploys is not true ILM. It is instead analogous to electronic delivery of tax returns: a few years ago, the IRS provided a means to transfer completed returns via e-mail, but offered no tools to preparers for sorting out receipts which were deductible expenses (data classification), for deciding which forms to use for which declarations (storage classification), or for determining which returns needed to be retained in an available state for possible review by examiners (policy rules). By analogy, the IRS didn’t provide a complete “ILM” solution: the heavy lifting of tax preparation remains the burden of the preparer.

Archive is part of a true ILM strategy, but it is not ILM. It is also one of a set of services that need to work with data based on a thoughtful assessment of the business context of the data itself and as part of its lifecycle management.

Going Forward

Clearly, a mixture of deliberate and inadvertent confusion has entered into the discussion around archive. The Active Archive Alliance is seeking to redress this confusion and to help consumers address the root causes of high storage cost, out of control file proliferation, and regulatory non-compliance: data mismanagement. Getting there will require a systemic and policy-driven mechanism for data classification, data routing across infrastructure, and ultimately an open and reliable archival repository -- preferably an open and standards-based mechanism.

That we haven’t seen any vendor become the industry’s “archive giant” reflects the fact that disinformation around archive is so pervasive, and the educational hurdles to be surmounted prior to making a sale is so daunting, no single vendor is capable or willing to make the necessary investment. Perhaps the Active Archive Alliance can.

One thing that might help is the creation of a simple check list of criteria that the consumer can use to make smart archive purchases. This list should include, at a minimum, the following:

  1. The system should enable the classification of data assets in a granular and business-focused manner, ideally at the time of inception or creation of the data.
  2. The system should enable the creation and centralization of policy-based rules governing data classes.
  3. The system should monitor and maintain a consistent view of storage assets and provide a clear understanding of where data assets are positioned on storage infrastructure at any given time.
  4. The system should be capable of establishing, directly or indirectly, the routing of data assets through infrastructure so that data is exposed to (or excluded from) “storage services” for data protection, reduction, access security, encryption, etc. Ideally, it should also enable the routing of data to hosting platforms that accommodate necessary usage characteristics associated with the data in terms of accessibility and performance.
  5. The system should leverage existing infrastructure management capabilities where appropriate, including pre-defined or domain-based access controls whether organized by user or server. Active Directory in the Microsoft environment is an example.
  6. The system should be file system agnostic to the greatest possible extent, supporting data management universally and regardless of the operating system or file system environment.
  7. The system should not compromise the integrity of data files themselves. Processes used to inventory files and to manage their movement should in no way truncate the file header, making it impossible to re-scan the file complex should the file management system itself become compromised. (Some products strip the metadata header from the file and place only the file payload in a repository. The metadata header information is stored in the data management system database, and the payload data cannot be recovered if the data management system database is corrupted. This type of “stubbing” is to be avoided because it puts managed data at risk.)
  8. The system itself should have security features to prevent unauthorized access to or revision of data management policies. Ideally, it should also be designed for use by different data managers, including governance risk and compliance (GRC) managers, business department heads, IT management, or other groups who might have legitimate roles in defining and policing data management policy.
  9. The system should be flexible in its support for both centralized archive and decentralized or federated data management.
  10. As a practical matter, the system should be transparent to users and applications. Repeated case studies have shown that approaches that impose a requirement on end users to become involved in the classification of their own data tend not to be sustainable. Moreover, it should be as automated as possible, enabling policies to be applied quickly to an existing corpus of data and extended to new data rapidly using existing policy templates. A number of approaches, shown at left, have been suggested for the implementation of data management. Automatic classification based on “deep blue math” algorithms remains the holy grail, as does the wholesale replacement of the conventional file system with a database or other organizing metaphor. Today, however, data classification based on user role shows the most promise as a methodology for applying data classification in a way that satisfies the transparency and automation goals of effective data management.

Sometimes simple tools can make all the difference. We wish the Active Archive Alliance good luck.

Your comments, as always, are welcome: [email protected].

Must Read Articles