Five Ways to Streamline Your Data Classification Project

This approach can help you achieve data classification results dramatically faster than conventional methods.

by Raphael Reich

Management of unstructured data on shared file systems, NAS devices, SharePoint sites, and similar locations is challenging for most organizations. All of the spreadsheets, presentations, documents, and multimedia files stored on those systems account for roughly 80 percent of business data, according to analyst firm IDC (see Note 1). This shared data is highly dynamic; new data is constantly added, accounting for an average growth rate of 57 percent per year (see Note 2), with some organizations doubling their unstructured data volume each year.

The data's relevance is also constantly in flux. Users may need access to data now but not in a few months when their project finishes or when the data itself becomes outdated and stale.

Organizations can quickly become overwhelmed managing and protecting this large, changing pool of data. As a result, more organizations are initiating data classification projects in the hopes of identifying their most sensitive data, remediating any problems, and implementing proper controls. Unfortunately, several challenges prevent data classification deployments from reaching their full potential.

From a business perspective, a lack of actionable results is the primary challenge. Data classification solutions produce a list of files with sensitive content, but what the files mean to the business (including what to do with them) is not inherently obvious. From a technical perspective, the challenge is that data classification solutions scan every file looking for relevant content and are consequently slow to deliver results. Even on subsequent searches, these solutions must look at all files again, making it virtually impossible to keep pace with data growth and change.

The five steps below outline an approach for achieving data classification results dramatically faster than conventional methods.

Step 1: Identify Data Owners

Data owners are at the heart of the process when it comes to managing unstructured data. Because they understand the business importance of data assets, they are critical to creating policies that make business sense. They can help determine who should and should not have access, what type of protection the data should have, and indicate when the data is no longer relevant to the business. When it comes to sensitive data, owners can decide whether data is at risk and what remediation steps are required.

However, this is no easy task because the locations of data and the names of folders, directories, and sites often provide few clues to true data ownership, and file system metadata about data ownership goes stale quickly. Phone calls and e-mail messages -- the most common methods for identifying data owners -- are not efficient or sufficiently reliable to constitute viable processes.

The best way to track data owners is to have an automated, repeatable process in place. One of the most effective ways to determine data owners is to track who is accessing the data. Over time, the top users of data will emerge and these people will be able to tell organizations who owns the data.

Step 2: Define Data of Interest

Once data owners are identified, organizations should work with them, as well as security and risk managers, to identify the key words, phrases, and patterns that are of business interest. To do so successfully requires investigative work and an understanding of what’s driving the need to find data. In many organizations, regulatory compliance is a driver. Regulations often specify which data is sensitive and what measures are required to protect it. Other common types of information requiring special attention are intellectual property, customer data, and employee information.

As part of defining what’s of interest, it’s helpful to establish different levels of sensitivity based on the type of content your organization needs to manage and protect. Industry best practices show that a good rule of thumb is to constrain an organization’s hierarchy to four levels. More levels and it becomes difficult and impractical to manage. To begin, your four levels might be defined at “secret data,” “confidential data,” “private data” and “public data.”

Step 3: Use Metadata to Focus and Accelerate Your Project

Metadata is data about an organization’s data such as file sizes, types, and locations. Metadata can be used to focus and accelerate the data classification process. In this way, it becomes another element of the search. Specifically, it provides a short list of file locations and what you can expect to find at each location. For example, finding poorly protected files, then looking inside them for credit card data, is a fast way to identify credit card data that is at risk. Conveniently, organizations can use permissions metadata to do just that. Any sensitive data found in overly accessible files has a clear remediation path: fix the access permissions to the data so that it is based on least privilege (i.e., business need-to-know).

The following are examples of metadata and how it can be used to focus and accelerate data classification:

  • Access permissions: A careful analysis of permissions will tell organizations who can access their sensitive data and which data is overly accessible.
  • Access activity: Data access activity provides important information such as which folders are the most frequently used and which folders are not being used at all. It also indicates which data was recently added or modified. That intelligence is tremendously useful, for example, in reducing the time spent searching. After the initial classification scan is complete, subsequent searches can be restricted to just the data that must be classified (i.e., data that has not yet been searched). For specific users or groups, organizations can determine what data they have been accessing to see who has actually been using the sensitive data.
  • Ownership: Ownership information helps limit searches to data owned by specific people. If organizations are working with individuals to help them get control of their sensitive data, this piece of metadata will narrow sensitive data searches to just the relevant data.

Step 4: Reporting and Remediation

Generating results is obviously an important part of classification projects, but it’s not the final stage. After obtaining results, organizations need to put it into the hands of decision makers -- which are typically data owners and governance/risk/compliance teams -- so these people can understand the situation and begin formulating remediation strategies and plans.

Data owners are important stakeholders for results because they are typically in the best position to identify exactly what the content is, whether the data is stored in the right place, and who should and should not have access to it. They can also help build a remediation strategy and process, especially once they are armed with specific examples involving their own data. Governance, risk, and compliance staff can provide the oversight needed to ensure data is protected in accordance with the organization’s objectives. These teams can use result reports as the basis of documentation for audit requirements.

Step 5: Data Rescanning

Data is constantly growing and changing, hence the need to periodically re-scan data to ensure that organizations have an accurate view of their sensitive data. Ideally, organizations should limit searches to newly added data to determine if it contains sensitive information and to existing data that has been modified to determine if it has either gained or lost relevance to classification projects. Organizations should provide data owners and governance, risk, and compliance staff with updated intelligence based on re-scanning.


Searching for sensitive data in unstructured data stores requires data classification solutions: this data is simply too voluminous and dynamic to process and manage manually. A solution that allows organizations to implement the five steps outlined above -- especially incorporating metadata into the process -- is critical for achieving actionable results.

Without metadata, data classification projects can take far too long, and the results they produce typically won’t have the context required to remediate problems. Using metadata dramatically cuts the time needed to produce results and provides the context required for problem remediation.


1. IDC: "The Expanding Digital Universe, 2007"

2. Gartner: "Predicts 2006: Storage Technology Evolves Along With Demand"

Raphael Reich is the senior director of marketing for Varonis Systems. You can contact the author at

Must Read Articles