What's in a Name?

Because we lack a good data-naming methodology, we're placing mission- critical data—and ultimately entire business operations—at risk.

Since the earliest days of recorded history, we've tried to control the unknown forces that affect our lives by assigning names to them. Many religious liturgies have their roots in this practice: Ancient Egyptians sought to appease the forces of nature that made the Nile River overflow its banks, flood crops and destroy homes by reciting or singing a lengthy list of grandiose names for the river god. Their hope was that one of the names would flatter the unknown deity, causing floodwaters to recede.

In our culture today, we've tended to replace superstition with empirical, scientific precepts. Engineering teaches us that it's less important to understand the nature of a thing than to discern enough about its behavior to harness and control it.

However, when it comes to enterprise data storage, we may need to rethink this bias. To cope with burgeoning data and its management, a naming system is exactly where we need to begin—not to appease some unseen "god of storage," but to enable truly effective storage management.

What we need is a way to name data, to classify it and encode it with certain key descriptive traits at the point where it was first created. If we had that, then the subsequent control of data—its management, its security, and its optimal provisioning across storage platforms and across networks—would be a much more approachable task.

Today, of course, data isn't named or categorized as it's created. The file systems of most server operating systems usually "time stamp" data, perhaps adding a bit of metadata describing its gross accessibility parameters. However, there's no consistent "header" or other mechanism used by all operating systems to associate data (1) with the specific business process that it supports, or (2) with the applications that either generate or use the data. Moreover, there's no "thumb-printing" method used consistently by all operating systems to ensure non-repudiation (that data remains valid and unchanged except by its owner).

Given that, we're forced to manage mission-critical data using the clumsiest of methods and tools. Rather than managing storage based on the intrinsic requirements of the data itself, we instead struggle to manage the equipment that is used to store data (for example, disk arrays) and the network pipes used to move it or copy it from one place to another. That is Storage Resource Management (SRM) in a nutshell, and it explains why SRM is a source of chronic pain for most storage administrators.

Why is All Data Treated the Same?
Think of it this way: Because data is not "self-describing," there's no easy way to optimize its storage on any contemporary storage platform. For example, there's no way to provision the data belonging to a multimedia application with the physical storage arrangement that would best facilitate its use. Optimally, to provide efficient "jitter-free" playback of multimedia content, such data would be written to the outermost tracks of each disk in an array or SAN—that is, to the longest contiguous amount of storage space. But, because the data isn't self-describing, there's no way to automatically provision physical storage to meet optimal data service requirements. There's simply no way to discern what type of storage (in terms of RAID level, physical layout on media) that various types of data require. Data is simply data. It's all treated the same.

Similarly, in the absence of well-described data, it's impossible to optimize data movement schemes. Current data backup (making copies of mission-critical data to an alternate location) and hierarchical storage management (migrating data from more expensive to less-expensive storage media over time) are cases in point. Without an elegant system of data description, we tend to use brute force to segregate critical data from non-critical data when defining a backup scheme. Because data isn't named with its originating application or its business process association, this task is burdensome, time-consuming and often subject to error. In general, we back up a lot more data than we probably have to in order to assure business continuity.

Furthermore, without assigning a fingerprint to data at time of creation, information security becomes a kludge of encryption keys, virtual networks, and administrative headaches—none of which are particularly daunting to an earnest hacker. If data were provided some sort of inviolable checksum or Message Digest in its header to guarantee that its contents didn't change except as a result of valid and appropriate application work, securing the data would be considerably easier.

Consequences of Inconsistent Data Naming
You can see problems caused by the lack of common data-naming schemes daily in the reduced performance and cost-efficiency of all storage platforms. From a strategic point of view, the lack of an effective data-naming methodology places mission-critical data—and ultimately entire business operations—at risk.

It should come as no surprise that a growing number of industry visionaries have begun to view the absence of a coherent data-naming scheme as one of the most important problems confronting IT this century. I've had conversations about this with leading information security analysts such as Robin Bloor of Bloor Research in Austin, Texas and Bletchley, UK; storage industry luminaries including STK's Randy Chalfant, Horison Information Strategies CEO Fred Moore, and Avamar Technologies CEO Kevin Daly; and notable "out-of-the-box" technology thinkers such as father-of-the-VAX-turned-independent-consultant Richard Lary.

In each case, our discussions of IT challenges lead to the same conclusion: Data naming is a prerequisite to taking data-storage technology in particular—and IT in general—to the next level. It's more important than issues of SAN plumbing, tape versus disk mirroring, network convergence, or any of the other "holy wars" in today's trade press. Perhaps that's because data-naming schemes seem to be the stuff of "geek meets"—debate fodder for gatherings of engineers and propeller heads laboring in obscure and rarified technology associations. They aren't, well, sexy.

Change in the Works
With the release of EMC's Centera platform, however, this may be changing. Earlier this year, EMC responded to the burgeoning requirements of customers confronting new regulatory mandates in the healthcare and financial industries for long-term data storage by announcing Centera. The Hopkinton, Mass.-based company was going to sell cheap storage boxes based on inexpensive commodity drives, but augmented with a specialized data-naming scheme. They had correctly perceived that much of the data floating around corporate information systems is relatively static and unchanging. Overlaying this data with a content-addressing scheme would enable it to be tracked through its migration from one platform to another over time—a boon for anyone who's struggled to find a single piece of data in a growing haystack of storage.

In effect, EMC had seized upon new Health Insurance Portability and Accountability Act (HIPAA) and Securities and Exchange Commission (SEC) record-keeping rules that required certain types of data be maintained for a long time and in a manner that assured their accessibility, integrity and non-repudiation in order to create a new product offering. While Centera provides only a minimum of the functionality that needs to be provided by a full-fledged "point-of-creation" data-naming strategy, and a proprietary scheme at that, it was (and is) a watershed in the movement of data naming into the realm of practical business solutioneering.

What Goes into Naming Data?
We need to answer two key questions to arrive at a non-proprietary data-naming methodology. First, what attributes are required to make data self-describing in the first place? Second, how best to imprint data with the data-naming taxonomy that's produced as a result of answering the first question.

To discern what kind of information about data would be useful to make data truly self-describing, we can look to disaster recovery planning for guidance. In preparing for the possibility of a disaster, we first need to go through an intensive process of data classification so we can identify which data assets need to be accorded priority during a restore process.

We try our best to determine what data belongs to what application and what application belongs to what business process. That lets us then determine whether the data inherits any critical aspects from the application or from the business process it supports. Only in that way can we allocate scarce resources cost-effectively, provisioning recoverability and security resources to those data sets that support our most critical business operations first. Without such an analysis, we're forced to try to recover everything at the same time: An increasingly expensive and trouble-prone approach. (For my suggestions on a data-attribute-naming scheme, see "Five Proposed Rules for Naming Data Attributes.")

A data-naming taxonomy is only a beginning, of course. I encourage you to contribute your own suggestions about what descriptors would be useful in a data-naming methodology. If you provide an original and worthwhile contribution, in fact, I'll send you a free copy of my next book, The Holy Grail of Network Storage Management (Prentice Hall PTR). In that book, I'm dedicating a chapter to describing a data-naming taxonomy that vendors and users can freely use to design their own data-naming schemes. E-mail your suggestions to me at jtoigo@intnet.net.

Five Proposed Rules for Naming Data Attributes

Using the data-classification effort described here as guidance for a data-naming taxonomy, useful attributes in a naming scheme would include:

1. A reference to the application and operating system used in data creation
We need to know how it was created and what semantics have been imposed on the data that will determine our ability to read it correctly. This is especially important with long-term data storage, because applications tend to come and go and operating systems change and evolve over time, potentially affecting our ability to read the data at a later date.

2. A reference to the business process supported by the data
Data "inherits" its criticality to an organization based on the criticality of the business process that uses it (or the regulatory mandate that it fulfills), so it would be useful to have some mechanism for associating data with the business process it supports.

3. A "useful life" indication
We tend to hold on to electronic data for a much longer period of time than we need to. Old, infrequently accessed, or stale data needs to be purged from online storage once it exceeds its useful life, or, in hierarchical storage management (HSM) schemes, needs to be migrated to offline or near-line storage. A data- naming scheme should provide a "created on" date and a "discard by" date to imbue the data with a useful life parameter.

4. Error-checking
As data moves from platform to platform over its useful life, we need to ensure that its contents aren't changed—whether intentionally by hackers or unintentionally by compression algorithms, random bit flipping or other noise. This involves more than a simple checksum operation or byte count. The movement of data from one media type (say, a hard disk) to another (say, a CD-ROM) will change the size of a file by virtue of file system semantics. Thus, the mechanism used to track data integrity needs to do more than a simple comparison of file lengths to ensure data validity and non-repudiability.

5. Content addressing
We need to be able to track and find data over time as it migrates across platforms. For greater efficiency, the data-naming scheme should provide data with a unique identifier that remains consistent even as data changes location. This would solve a problem with many "application-centric" SRM tools today: coping with infrastructure changes. Today, migrating from one platform to another over time typically carries with it a tedious step of reconstructing the carefully built tables that associate applications and data. A content-addressing scheme would solve this problem.