Proving Grounds: Securing Test Data in Regulatory Environments

In many companies, developers use live data in unsound, test environments but remain unmindful of the fallout if that data leaks out. Why should your compliance guard be relaxed when developers use test data to design the systems that store and dole out access to such sensitive information? Here are five ways to manage test data in regulated environments.

Do your developers test with real data?

Consider the perils of using real data in test environments. For example, in August, AOL released to the general public 658,000 individuals' spring 2006 search queries, for what it characterized as academic research purposes. Not surprisingly, "search terms revealed medical conditions, illegal activities, illicit interests, financial information, even social security numbers," notes the Privacy Rights Clearinghouse. And even though AOL had replaced people's names with numbers, the actual search terms made tracing many people's identities easy, as multiple newspaper articles demonstrated. Public outrage resulted and AOL removed the data—labeling its release a "screw-up"—yet copies of the data can still be downloaded from the Web.

Or take last year's CardSystems debacle, when the company retained credit card information on 40 million people—in violation of the credit card industry's Payment Card Industry Data Security Standard—for what it characterized as internal research purposes. After attackers accessed the stored information, disclosure of the breach brought regulatory sanctions, and ultimately CardSystems' demise.

The moral: storing sensitive data for testing or research purposes is no safer than storing it for any other reason. Data breaches perpetrated by dishonest insiders especially abound, and any poorly secured data is at risk, and perhaps also in violation of privacy regulations. Indeed, numerous privacy regulations—including HIPAA and Gramm-Leach-Bliley in the United States, and laws in Australia, Canada, the European Union, Japan, and the United Kingdom—require companies to restrict access to people's personally identifiable information on a need-to-know basis, and to protect such information regardless of where it's stored.

Yet when studying how sensitive data is stored and used, many organizations overlook their development and testing environments. "Testers, developers, and quality assurance employees—as well as partners, offshore development and support personnel, and consultants who are participating in the testing or QA process—often use actual data to test software applications," notes analyst Rikki Kirzner in a white paper from Hurwitz & Associates.

Compounding that security risk, many development environments are also relatively poorly secured. "Too many organizations believe that since the data is only being used internally, it isn't necessary to adhere to the procedures that govern its use in production," she says. As a result, "application quality assurance and testing is currently one of the areas most vulnerable to data theft, fraud, or unauthorized copying and replication."

The solution to this problem: don't allow developers to use real data. Instead, give them data good enough for development, but sufficiently obscured or faked to no longer be a security threat. With that in mind, here are five ways to manage data for development and testing purposes:

1. Decide Who Needs to Know

Restricting data access on a need-to-know basis is easier said than done. Companies are still trying to decipher privacy regulations at home and abroad and to determine "who can have access to the data, what kind of data can they see and work with, and how are we ensuring sensitive information that resides in those databases is not accessible by unauthorized internal or external parties," notes Brian Babineau, an analyst with Enterprise Strategy Group.

Regardless of what a particular regulation mandates, "the hardest part is securing information from internal resources," he says. "What happens when I'm upgrading my Oracle financials application and I want to test that with real data? I can't let my IT guys see the quarterly results before I report them to Wall Street, because they could buy stock, or short stock," all of which would invite US Securities and Exchange Commission sanctions and shareholder lawsuits.

Simply put, then, all real data used for test purposes is a risk. "The test environment is no different than the production environment, and if you want to manage risk proactively, you have to pay attention to production data," notes Moungi Slim, product manager for Compuware's File and Data Management suite. So, determine who can access sensitive information, and restrict access accordingly.

2. Centralize Control of Data Flows

Developers and testers simply have no need to access or use real data, since fake data, properly constructed, can stand in for the real thing. Accordingly, organizations must "de-identify" data before allowing it to move into development, testing, or QA environments.

Available data-obfuscation techniques include such things as "scrambling, aging, concealing sensitive values, replacement of the original data with meaningful readable data via translation tables, generating fictitious data," and numerous other options, says Kirzner.

Often the best technique for ensuring developers get the data they need, and that the quality is sufficient, is to "identify a security administrator whose main responsibility is to enforce, monitor, and manage data privacy," says Slim. Ensure this "test-data czar" has executive-level backing so other employees take seriously the need to keep private data private, as well as the tools necessary to obfuscate data before releasing it to developers.

3. Find Sensitive Information

Which stored information is sensitive? When Enterprise Strategy Group surveyed multiple companies—including airlines, credit card companies, and banks—about their confidential information, "over half the companies we talked to said at least 50 percent of the information that they create on a regular basis was confidential information," says Babineau. "That's a pretty exorbitant number, when you think about it."

Confidential information might include intellectual property, extremely business-sensitive information, or pieces of data harmless on their own but which become personally identifiable when brought together. "Many people think confidential information only has to do with external parties, but when you run payroll for your company, there are social security numbers, bank account, routing numbers, employee addresses, phone numbers," says Babineau. "If you don't secure that stuff, you have a problem."

Accordingly, companies must determine every place such sensitive information is stored, and also trace every application or process, including the test bed, that accesses or moves such data, both internally and with business partners. "Everything must be accounted for," says Slim.

4. Enforce a Data Obfuscation Plan

Next determine how to obfuscate data for each application or process relying on sensitive information. Different applications will have different requirements, thus each needs its own data-obfuscation plan. Consistency also counts, says Slim. "If you disguise it in one place you have to disguise it in other places to maintain the integrity of data."

Exactly how data should be disguised depends on the application using the data. For example, one application may just store credit card numbers, meaning they can be left encrypted until needed. Meanwhile, another application may need to see the actual third digit of a credit card number for verification purposes. In that case, perhaps the one digit can be left unencrypted, and fake numbers substituted for every other digit. Likewise, an application may run correctly even when every first and last name reads "John Doe," which makes it easier to generate fake names and begin obscuring identities.

Two ways to enforce data obfuscation are via dedicated test-data-obfuscation tools, and also via database archiving products, many of which can create database archives with subsets of information, and specific types of columns or fields transformed or masked. "A lot of the archiving solutions today offer the ability to do masking, so if it sees a specific format—zip code, or social security or credit card number—it can scramble things. You may scramble all the zip codes and addresses for specific users," says Babineau. "That way you have all the right information, it's just not organized in the correct way.

Vendors offering such technology, says Babineau, include Applimation, CA, Compuware, HP's Outerbay and Mercury Interactive, Princeton Softech, and Solix Technologies.

5. Automate Test Data Generation

For optimum security, companies must completely remove developers from the data-disguise equation. "You don't want to give them the option of do you disguise the data or not, and if I'm a developer, you don't want to have to set up disguise rules; you just want to grab the data," says Jim Wyne, Compuware's director of field technical support. The ideal, then, is to create automated processes that give anyone who needs access to test data—developers, testers, or QA personnel—the precise obfuscated data they need to do their job.

While a test-data czar can generate test data every time it's needed, for security and compliance purposes, the more automated the controls, the better. "The more of those processes we can put in place that automate that step or that task, the more you're going to meet the regulations or the requirements, the more auditable the data is, and the more safeguarding you're doing to ensure the data is protected," notes Wyne.

Using automated tools "to create these rules and then refresh and reuse them" also saves time, says Slim, since it ensures consistent, high-quality test data, "without having to reinvent the reality every time."