Columns
        
        What's in a Name?
        Because we lack a good data-naming methodology, we're placing mission-         critical data—and ultimately entire business operations—at risk.
        
        
        Since the earliest days of recorded history, we've tried to control the         unknown forces that affect our lives by assigning names to them. Many         religious liturgies have their roots in this practice: Ancient Egyptians         sought to appease the forces of nature that made the Nile River overflow         its banks, flood crops and destroy homes by reciting or singing a lengthy         list of grandiose names for the river god. Their hope was that one of         the names would flatter the unknown deity, causing floodwaters to recede.
      
In our culture today, we've tended to replace superstition with empirical,         scientific precepts. Engineering teaches us that it's less important to         understand the nature of a thing than to discern enough about its behavior         to harness and control it. 
      However, when it comes to enterprise data storage, we may need to rethink         this bias. To cope with burgeoning data and its management, a naming system         is exactly where we need to beginnot to appease some unseen "god of storage,"         but to enable truly effective storage management. 
      What we need is a way to name data, to classify it and encode it with         certain key descriptive traits at the point where it was first created.         If we had that, then the subsequent control of dataits management, its         security, and its optimal provisioning across storage platforms and across         networkswould be a much more approachable task.
      Today, of course, data isn't named or categorized as it's created. The         file systems of most server operating systems usually "time stamp" data,         perhaps adding a bit of metadata describing its gross accessibility parameters.         However, there's no consistent "header" or other mechanism used by all         operating systems to associate data (1) with the specific business process         that it supports, or (2) with the applications that either generate or         use the data. Moreover, there's no "thumb-printing" method used consistently         by all operating systems to ensure non-repudiation (that data remains         valid and unchanged except by its owner).
      Given that, we're forced to manage mission-critical data using the clumsiest         of methods and tools. Rather than managing storage based on the intrinsic         requirements of the data itself, we instead struggle to manage the equipment         that is used to store data (for example, disk arrays) and the network         pipes used to move it or copy it from one place to another. That is Storage         Resource Management (SRM) in a nutshell, and it explains why SRM is a         source of chronic pain for most storage administrators.
      Why is All Data Treated the Same?
        Think of it this way: Because data is not "self-describing," there's no         easy way to optimize its storage on any contemporary storage platform.         For example, there's no way to provision the data belonging to a multimedia         application with the physical storage arrangement that would best facilitate         its use. Optimally, to provide efficient "jitter-free" playback of multimedia         content, such data would be written to the outermost tracks of each disk         in an array or SANthat is, to the longest contiguous amount of storage         space. But, because the data isn't self-describing, there's no way to         automatically provision physical storage to meet optimal data service         requirements. There's simply no way to discern what type of storage (in         terms of RAID level, physical layout on media) that various types of data         require. Data is simply data. It's all treated the same.
      Similarly, in the absence of well-described data, it's impossible to         optimize data movement schemes. Current data backup (making copies of         mission-critical data to an alternate location) and hierarchical storage         management (migrating data from more expensive to less-expensive storage         media over time) are cases in point. Without an elegant system of data         description, we tend to use brute force to segregate critical data from         non-critical data when defining a backup scheme. Because data isn't named         with its originating application or its business process association,         this task is burdensome, time-consuming and often subject to error. In         general, we back up a lot more data than we probably have to in order         to assure business continuity.
      Furthermore, without assigning a fingerprint to data at time of creation,         information security becomes a kludge of encryption keys, virtual networks,         and administrative headachesnone of which are particularly daunting to         an earnest hacker. If data were provided some sort of inviolable checksum         or Message Digest in its header to guarantee that its contents didn't         change except as a result of valid and appropriate application work, securing         the data would be considerably easier.
      Consequences of Inconsistent Data Naming
        You can see problems caused by the lack of common data-naming schemes         daily in the reduced performance and cost-efficiency of all storage platforms.         From a strategic point of view, the lack of an effective data-naming methodology         places mission-critical dataand ultimately entire business operationsat         risk.
      It should come as no surprise that a growing number of industry visionaries         have begun to view the absence of a coherent data-naming scheme as one         of the most important problems confronting IT this century. I've had conversations         about this with leading information security analysts such as Robin Bloor         of Bloor Research in Austin, Texas and Bletchley, UK; storage industry         luminaries including STK's Randy Chalfant, Horison Information Strategies         CEO Fred Moore, and Avamar Technologies CEO Kevin Daly; and notable "out-of-the-box"         technology thinkers such as father-of-the-VAX-turned-independent-consultant         Richard Lary.
      In each case, our discussions of IT challenges lead to the same conclusion:         Data naming is a prerequisite to taking data-storage technology in particularand         IT in generalto the next level. It's more important than issues of SAN         plumbing, tape versus disk mirroring, network convergence, or any of the         other "holy wars" in today's trade press. Perhaps that's because data-naming         schemes seem to be the stuff of "geek meets"debate fodder for gatherings         of engineers and propeller heads laboring in obscure and rarified technology         associations. They aren't, well, sexy.
      Change in the Works
        With the release of EMC's Centera platform, however, this may be changing.         Earlier this year, EMC responded to the burgeoning requirements of customers         confronting new regulatory mandates in the healthcare and financial industries         for long-term data storage by announcing Centera. The Hopkinton, Mass.-based         company was going to sell cheap storage boxes based on inexpensive commodity         drives, but augmented with a specialized data-naming scheme. They had         correctly perceived that much of the data floating around corporate information         systems is relatively static and unchanging. Overlaying this data with         a content-addressing scheme would enable it to be tracked through its         migration from one platform to another over timea boon for anyone who's         struggled to find a single piece of data in a growing haystack of storage.       
      In effect, EMC had seized upon new Health Insurance Portability and Accountability         Act (HIPAA) and Securities and Exchange Commission (SEC) record-keeping         rules that required certain types of data be maintained for a long time         and in a manner that assured their accessibility, integrity and non-repudiation         in order to create a new product offering. While Centera provides only         a minimum of the functionality that needs to be provided by a full-fledged         "point-of-creation" data-naming strategy, and a proprietary scheme at         that, it was (and is) a watershed in the movement of data naming into         the realm of practical business solutioneering.
      What Goes into Naming Data?
        We need to answer two key questions to arrive at a non-proprietary data-naming         methodology. First, what attributes are required to make data self-describing         in the first place? Second, how best to imprint data with the data-naming         taxonomy that's produced as a result of answering the first question.
      To discern what kind of information about data would be useful to make         data truly self-describing, we can look to disaster recovery planning         for guidance. In preparing for the possibility of a disaster, we first         need to go through an intensive process of data classification so we can         identify which data assets need to be accorded priority during a restore         process. 
      We try our best to determine what data belongs to what application and         what application belongs to what business process. That lets us then determine         whether the data inherits any critical aspects from the application or         from the business process it supports. Only in that way can we allocate         scarce resources cost-effectively, provisioning recoverability and security         resources to those data sets that support our most critical business operations         first. Without such an analysis, we're forced to try to recover everything         at the same time: An increasingly expensive and trouble-prone approach.         (For my suggestions on a data-attribute-naming scheme, see "Five Proposed         Rules for Naming Data Attributes.")
      A data-naming taxonomy is only a beginning, of course. I encourage you         to contribute your own suggestions about what descriptors would be useful         in a data-naming methodology. If you provide an original and worthwhile         contribution, in fact, I'll send you a free copy of my next book, The         Holy Grail of Network Storage Management (Prentice Hall PTR). In that         book, I'm dedicating a chapter to describing a data-naming taxonomy that         vendors and users can freely use to design their own data-naming schemes.         E-mail your suggestions to me at [email protected].
        
      
                                                                                                       Five Proposed Rules for Naming Data Attributes                                          Using the data-classification effort described here as                       guidance for a data-naming taxonomy, useful attributes in                       a naming scheme would include:                                          1. A reference to the application and operating system                       used in data creation                       We need to know how it was created and what semantics have                       been imposed on the data that will determine our ability                       to read it correctly. This is especially important with                       long-term data storage, because applications tend to come                       and go and operating systems change and evolve over time,                       potentially affecting our ability to read the data at a                       later date.                     2. A reference to the business process supported by                       the data                       Data "inherits" its criticality to an organization based                       on the criticality of the business process that uses it                       (or the regulatory mandate that it fulfills), so it would                       be useful to have some mechanism for associating data with                       the business process it supports.                     3. A "useful life" indication                       We tend to hold on to electronic data for a much longer                       period of time than we need to. Old, infrequently accessed,                       or stale data needs to be purged from online storage once                       it exceeds its useful life, or, in hierarchical storage                       management (HSM) schemes, needs to be migrated to offline                       or near-line storage. A data- naming scheme should provide                       a "created on" date and a "discard by" date to imbue the                       data with a useful life parameter.                     4. Error-checking                       As data moves from platform to platform over its useful                       life, we need to ensure that its contents aren't changedwhether                       intentionally by hackers or unintentionally by compression                       algorithms, random bit flipping or other noise. This involves                       more than a simple checksum operation or byte count. The                       movement of data from one media type (say, a hard disk)                       to another (say, a CD-ROM) will change the size of a file                       by virtue of file system semantics. Thus, the mechanism                       used to track data integrity needs to do more than a simple                       comparison of file lengths to ensure data validity and non-repudiability.                     5. Content addressing                       We need to be able to track and find data over time as it                       migrates across platforms. For greater efficiency, the data-naming                       scheme should provide data with a unique identifier that                       remains consistent even as data changes location. This would                       solve a problem with many "application-centric" SRM tools                       today: coping with infrastructure changes. Today, migrating                       from one platform to another over time typically carries                       with it a tedious step of reconstructing the carefully built                       tables that associate applications and data. A content-addressing                       scheme would solve this problem.                                          |                                        |