Reducing the Risk from Data that Lives Forever

These three rules of thumb will help you avoid the risk of keeping corporate data forever (and ever).

By Jim McGann

We all know that when users create a document, it is saved on their hard drive -- which becomes the document's primary home. When it's attached to an outgoing e-mail message, the e-mail server becomes the document's second home. When the IT department takes over, our hypothetical document is replicated many times over. E-mail servers have replication components, which make a failsafe copy, and, of course, the user's desktop and the e-mail server are backed up for disaster recovery purposes -- producing yet another copy.

All these processes create additional resting places for the same document and e-mail. Ultimately, it is not unusual to have dozens of copies of the same document and e-mail within a company.

Many of us still think that when you delete a document from a hard drive, it is gone forever. As the recent News of the World scandal showed us, getting rid of the data from its primary position or even destroying a user's PC will not make the document disappear. Deleting an e-mail message, shredding a hard drive, or even decommissioning an entire e-mail server does not purge corporate data.

All that redundant data can be potentially lethal to a corporation if it is not handled correctly and disposed of according to corporate policy. The News of the World tried to rid itself of the data source along with news editor Ian Edmondson and his PC. They naively thought recycling and deleting the data would eliminate the evidence, but that was not the case.

Several years ago simply deleting data would eliminate evidence because documents that got beyond the readily accessible PC hard drive could not easily be retrieved, so courts would accept the argument that the data could not be produced for litigation. It was easy to find a lawyer or judge who was baffled by technology and didn't know one could delve deep into corporate IT networks and infrastructures.

Today, times have changed; judges and lawyers are savvier about technology. Expert witnesses are called to expose the truth and uncover all the spots where data hides. Corporations, too, are learning that they must become serious about implementing data retention policies and must proactively manage data. It's then up to the IT and legal departments to create corporate policies that limit the potential online and offline locations where data is held. Companies can't assume the keep-everything mentality is the safest bet.

What Can IT Do?

There are three main guidelines to consider when managing data:

Know what you have: Many organizations still don't have a clear understanding of the content their users generate. All critical content is stored on corporate networks and is backed up for safekeeping. Because of the volume of corporate data and the complexity of the infrastructure, it is difficult to understand what exists. Users can consistently create content that becomes the smoking gun of a future lawsuit. Managing this content according to a corporate governance policy becomes very challenging if you don't understand what exists and where it is.

Know where the data is: Data is created by users on their desktops and in e-mails. It's then backed up on the networks and offline for disaster recovery purposes. The backups can be located on tapes or disks stored in outside facilities far away from the corporate offices. The bulk of legacy data is contained on these backup tapes, most of the data is redundant, and much of it is irrelevant and does not need to be stored.

Don't save everything: Saving all content can be dangerous to the company and its employees. It is estimated that 95 percent of all data created is considered irrelevant for long-term archiving and can be disposed of rather than stored. Only a small volume of this data (typically less than 1 percent) is critical to litigation or contains valuable intellectual property. However, finding this data is like the proverbial needle in a haystack, with a very big haystack.

Data Doesn't Have to Live Forever

Once we accept that we have a data hoarding issue and understand we don't need to save everything, we can move forward. We must (1) spend time creating a data retention policy and (2) know where all the data resides and have access to it all. You must be able to search across the entire enterprise in every area where data is stored.

Legal and IT departments need to work hand in hand to create data retention policies. Once policies are in place, all legacy data needs to be evaluated. New technology has made this process much easier, especially when there could be thousands or hundreds of thousands of legacy backup tapes stored. If all the data isn't looked at, especially the legacy data, then the potential for future litigation issues will continue to be a problem.

In the last couple of years, new technology has become available to tackle this problem quickly, easily, and affordably. Today, technology can scan tapes and then search and extract specific files and e-mail without the original backup software. This allows you to only deal with relevant files (less than 1 percent of the tape content) and not the bulk of useless content. Direct indexing technology has made tape remediation an achievable project. In significantly less time, an IT department can process tapes in house, find what the legal staff needs, archive it, and make it available when it is needed. This efficient, cost-effective process allows IT departments to recapture tape storage budgets while supporting the legal department with the data it needs.

To minimize data retention without sticker-shock technology costs, consider a solution that is:

Easy to use: The solution must transparently integrate into existing networks and infrastructures and be easy to deploy and manage.

Efficient: It must handle full content and metadata indexing on all types of user data -- from network data to backup tapes. Using one platform that provides an information knowledge layer across all data sources is critical. Simply attacking data on desktops misses the legacy content on backup tapes. One platform that can search and deduplicate the content across all sources that automates finding the relevant content among the useless system files and log files makes a complex task manageable. Speeds should be no slower than 1 TB/hour/node.

Cost effective: With complete knowledge of the data, including legacy content archived on backup tapes, legal teams can now apply policy and remediate a large percentage of historical data. Typically this process will recoup legacy data storage costs and simplify the management of corporate data centers.

In addition, the solution must perform simple or complex queries to quickly and easily find the important data. It must be able to automate queries, monitor data sources, and rapidly deliver sensitive content when it's needed in its native format.


Data used to be forever, but it doesn't have to be so anymore. The number of hiding places where data lives can now be minimized and only critical data should be kept according to corporate governance policies. The rule of thumb is: know what you have, know where it is, and don't save everything. Creating corporate data retention policies for legacy data as well as current data will cut down the exposure to possible liabilities and risk, reduce storage costs, and create peace of mind. In an age of ever-growing data, tight budgets, even tighter compliance regulations, and a more litigious society, remediating data is more advantageous than holding on to everything.

Jim McGann is vice president of information discovery at Index Engines, a tape discovery and remediation company based in Holmdel, NJ. McGann is a frequent speaker on electronic discovery and has authored multiple articles for legal technology and information management publications. You can contact the author at

Must Read Articles