Q&A: New Approaches for Tackling Big Data Security Issues

Data management expert David Loshin and security expert Wasim Ahmad discuss security threats to big data, and how encryption can protect data as it moves about the enterprise. Second in a two-part series on security.

"Yesterday's data breaches were all about stolen laptops and customer data somehow leaving the enterprise," but today's serious security incidents are very different, says security expert Wasim Ahmad of Voltage Security. "Today's breaches are malware getting inside your network, with access to all the pathways your data travels, then sniffing that out and passing it back to organized crime. We're talking about much more serious network down time, and a much more sophisticated attack." In this interview, the second of two parts on data security, Ahmad and data management expert David Loshin discuss big data threats, and using encryption effectively to protect data in transit.

A consultant and thought leader in BI, data quality, and master data management, Loshin is president of Knowledge Integrity, Inc. He is the author of numerous articles and books on data management, including the best-selling Master Data Management; his most recent book is Practitioner's Guide to Data Quality Improvement. Loshin is a frequent speaker at conferences and other events; he discussed data security issues in a TDWI Webinar on July 20 with Wasim Ahmad, Data Protection and Security: Considerations, Compliance, and Best Practices.

Ahmad is vice president of marketing for Voltage Security; he has over 19 years of experience in enterprise software, application development, and business intelligence, including management positions at CA, Sterling Software, and Synon.

BI This Week: David, what about the security risks of big data in particular. Are there unique concerns there?

David Loshin: I had an interesting conversation a few weeks back about whether data in the cloud is better protected than data that's sitting in your own systems. We drilled down into that in the context of big data, such as that managed by Hadoop and those types of frameworks; we discussed the fact that the people who are developing Hadoop applications or MapReduce applications are developers. They are presuming access to the data, and the data sitting out on a collection of nodes and in some massively parallel configurations -- presumably, that's data in an uncontrolled environment, just as we discussed [in part one of this interview,] and there is even greater opportunity for exposure.

Essentially, the data needs to be moved over to the framework, be it Hadoop or whatever. It's then exposed as the analysis is being done, and then the results are integrated back in, or reconnected back to, for example, some traditional data warehouse or business intelligence framework.

When you're looking at large collections of data, there's the potential once again for lots of data being exposed. On the other hand, if you're creating a controlled environment where there is no other means of getting access, one might say that there might be an opportunity for increasing the protection, if you're instituting your development framework in the right way.

Wasim Ahmad: Many of our customers, certainly our large enterprise customers, have large datasets and large data stores -- big data, if you will. These get breached in the same way as anything else, really. No one is cutting out thousands of terabytes of data, obviously, because someone would notice. However, the techniques of looking out for a pattern that matches a credit card number or a Social Security number -- they still apply even if the dataset is really large, so we're still seeing that issue of datasets being targeted across the board regardless of size.

We have a global telecommunications company as a client that has many, many databases, some with details about almost everyone in the U.S. Really, the only way to protect it is to make sure that the sensitive data elements -- people's addresses, dates of births, Social Security numbers, and bank account details or credit card numbers -- are, in fact, encrypted. That ensures that even if that dataset is being transported, manipulated, or sliced in some way, there's no point at which, if someone came in and grabbed that data, they would get anything that could be matched back to a consumer's identity. Regardless of the size of the data, the same challenges apply.

In the future, as we look at more and more systems that collect consumer information, whether it's in Hadoop-type systems or others, those are going to become very nice targets for organized crime. In the next 10 years, that's the scary side of social networking. With everyone connected online in the cloud, there is all this data that is potentially exposed.

Wasim, you've used the phrase "focusing on the data rather than trying to lock people out." That seems to bring us to encryption technologies. When we talk about encryption as a solution, where should it happen within the structure of an enterprise, especially a data warehousing system?

Wasim Ahmad: If you ask TDWI members whether encryption is taking place in their organizations, I'm sure the answer is yes. However, not all encryption is equal. Not all the places in which you encrypt things guarantee that everything stays protected no matter what. That's really the heart of the issue. What architecture are you using within your systems? That spans not just your data centers now but data that might be held in a cloud somewhere, whether your cloud or someone else's. Then, with the proliferation of mobile devices, and with more and more employees interacting with data through those devices, including your customers and partners, your data is in a lot of places outside the warehouse. Protecting it is a big challenge.

In short, we believe that you should encrypt things as soon as you have some sort of control over them -- that is, once they come into your environment. Maybe someone swiped a credit card, filled out a Web form, or had something scanned in by fax. The data should remain encrypted at all times until it literally needs to be used. In many cases, with certain types of data, you need it, but you don't need to have it in the clear, available for everyone to look at -- that's completely unnecessary. In fact, it turns out that in various business processes, there is a limited set of data that actually needs to be available in the clear. Figuring that out, and which data that is, is pretty important.

We also believe that there are ways of encrypting data by encrypting the database. All of the leading database [vendors] have database encryption built in, or there are solutions out there that can encrypt just data on certain servers. The trouble with these solutions is that everything is nicely protected when you're in the database and you're on the server, but the minute you move it, the data is exposed.

So, you might have data that is stored inside containers -- in a database or a server. The data is protected, but the minute you start using it via an application that accesses it, the data is decrypted and in the clear. On the way to the application, it might also be in the clear. In fact, you really don't know unless you've analyzed it. That's what malware takes advantage of -- the fact that you're moving data around, and during that time, it's vulnerable.

To address that, in addition to protecting the containers, you should also protect the data itself. That way, even if you're moving it, it travels with a little force field of protection around it. To do that, you need new types of encryption. While a lot of the encryption we rely on day-to-day has been around for many years, there have been innovations in encryption that now allow you to do things like encrypting without changing the schema. That means you can protect information and still be able to use that information, process that information, and analyze that information without having to change any of your systems.

What about encryption's impact on performance?

Wasim Ahmad: With some of the newer types of encryption such as format-preserving encryption, you can encrypt things, keep the same format, but also keep the same semantics. That means that things like checksums will still validate. Or, you can apply policies that, say, leave the first four digits of the credit card number in the clear, or the last four digits. Then you can still perform analysis on what type of credit card was used on certain types of transactions in this period of time. You can still make some of the elements that are necessary for analysis available, either in the clear or by the fact that each individual's data elements will always encrypt to the same protected data elements. All of your relationships, your data integrity -- all of those kinds of things stay intact, so you can still identify patterns in the data as you need to.

The first step in a data-encryption strategy is figuring out data flows -- where it is, where it's going, and which processes use it. That might include applications running in your data center, applications crunching data in the cloud, or human processes such as customer service representatives accessing a customer profile.

The second step is to really look at the right mechanism to protect your data under all those circumstances. It might be done in a batch operation overnight, you may need to do it in real time, or it may be something that happens as the data comes in, in which case it's stored and you encrypt it at that point, then it remains that way until it's needed.

Figuring all that out allows you will help you in considering the mechanism you want to use to perform encryption and decryption.

Is that where concepts such as key management come in to play?

Wasim Ahmad: Yes. Key management is all about, what keys do you use to lock the data and unlock the data, and who do you need to make them available to, and when?

In older encryption technologies, these keys are actually bits of digital information that you need to possess. They need to get to you, you need to store them, and they expire. Sometimes users move up in the chain of what they are authorized to access, so they need new keys. You have many pieces of digital information floating around, intimately tied with how you're protecting the data. That can become very cumbersome and a big operational overhead, so just understanding what your needs are is important.

There are newer key management technologies that compute keys as you need them -- you don't need to worry about storing them. Your users don't need to worry about having a certificate or those kinds of things. That smooths out their experience and makes it much more natural. It cuts down on the operational side.

Wasim, in a TDWI Webinar earlier this year with David on these issues, you talked about "lowering the economic value of target data." Can you talk about what that means and how it's done?

Wasim Ahmad: The phrase refers to the fact that organized crime is a business and organized criminals are going after your data because they know that they can monetize it. We know this to be true because we've seen an increase in the types of attacks that affect higher-network individuals -- those kinds of things. By striking at the heart of that economic value by making the data that they're after useless to them, it means that they can't make any money off of it. Therefore, they're simply going to go and target somewhere where they can -- which is unfortunate, but from your company's perspective, it means that you have successfully reduced the risks.

How do you do that? Basically, you need to turn the gold that is your sensitive data or your consumer data into straw. One way to do that is to encrypt it, so that even when the hacker has gotten in and grabbed the data through some kind of SQL injection attack or other means, they literally cannot do anything with it because it is scrambled, it is encrypted. That's really the heart of this concept of data-centric protection.

What sort of advice can each of you offer to companies that are just coming to terms with the fact that they need to focus more on data security?

Wasim Ahmad: The first step is acknowledging that your traditional security approaches, even if they were implemented by the security team, are not sufficient. That's not to say that they're wrong, they're bad, or they're not needed. It's just that they're not sufficient to deal with today's threats. Not all data breaches are equal. Yesterday's threats were all about stolen laptops, datasets copied from an exposed laptop, and customer data somehow leaving the enterprise.

Today's breaches aren't like that. Today's breaches are malware getting inside your network and having access to all the pathways that your data travels on, then sniffing that out and passing it back to organized crime. We're talking about much more serious network down time, and a much more sophisticated attack.

Also, you need to realize that your data is no longer just in your data center -- it's in many different places, including clouds and mobile devices, so the approach of having a well-fortified perimeter doesn't really work. You need to look at what the boundary is. The minimal boundary is the data itself. In order to create a data-protection strategy, you need to look at where the data is, where it flows, and which processes it impacts, then look to see how much of it you can encrypt. Ideally, you would encrypt all of it for as long as possible, until it's needed. That will give you the best possible protection.

Also, you need to implement workflow documentation, including working out who actually needs to use the data. That might be business processes or people. Then look at the regulatory and compliance issues that apply. Generally, you'll find that to protect data properly, you're going to go way beyond any specific regulation or compliance best practice.

David, your expertise is data management. What can you add in terms of tightening data security, in particular for TDWI members?

David Loshin: When you're putting together your development plan for a data warehouse, you have to take into account the characteristics of performance and usability of the data, and evaluate your security alternatives in the context of how they impact your performance and usability criteria.

You have to balance data utility or data usability with whatever solution you choose. When you apply your security technique, you don't want to wash out the factors that make the data warehouse usable or valuable. If there are ways to preserve your hierarchies or your relationships, that would be a better alternative than one that has variant mapping characteristics for something like encryption, or applying significant constraints along the way for validation and authentication of individuals who are looking at the data.