How to Secure Data in the Cloud and Beyond

Next-generation tokenization has the potential to help businesses protect sensitive data in the cloud in a more efficient and scalable manner than encryption and first-generation tokenization.

By Ulf Mattsson, CTO, Protegrity

What do McDonald's, Gawker and Microsoft all have in common? All three enterprises suffered from data breaches in late 2010 that compromised sensitive personal customer data. All of these cases were slightly different: McDonald's data was outsourced to a third-party provider which was compromised; Gawker was guilty of poor security practices including a lack of password policy and outdated encryption software; and Microsoft's cloud data centers were configured incorrectly. Despite their unique situations, all of these breaches raised one big question: If databases are inevitably going to be compromised, why isn't the focus on protecting data itself rather than the database infrastructure?

Particularly when more enterprises are moving their data to the cloud and are using outsourced service providers to perform activities that require them to have access to sensitive data, it is paramount that the data itself is protected.

The cloud is widely recognized as the disruptive technology that is changing the way everyone (from consumers to small businesses to large enterprises) communicates and does business. It comes as no surprise that adoption of cloud-based data storage is on the rise. A December 2010 Cisco study revealed that 52 percent of IT officials surveyed currently use or plan to use cloud computing. Unfortunately, the biggest hurdle they face in bringing their data to the cloud is security. This survey proves that despite growing concerns and security breaches, cloud adoption will continue to rise as it presents many opportunities for cost savings, collaboration, efficiency, and mobility.

How can cloud strategies ensure that all possible links to the original data are secure?

My primary recommendation is that all personal customer and employee data be stored in an encrypted form. Although there is no silver bullet solution for doing so, tokenization is an emerging method for securing sensitive data that is being carefully considered and deployed by some of the most innovative companies around the world.

Let's compare how encryption and tokenization secure data and look at the different types of tokenization available.

Differences between Encryption and Tokenization

End-to-end encryption secures sensitive data throughout most of its lifecycle, from capture to disposal, providing strong protection of individual data fields. Although it is a practical approach on the surface, encryption keys are still vulnerable to exposure, which can be dangerous in the riskier cloud environment. Encryption also lacks transparency because applications and databases must be able to read specific data types and lengths in order to decipher the original data. If the data type and associated data length are incompatible with each other, the text will be rendered unreadable.

Tokenization solves many of these problems. At the basic level, tokenization is different from encryption in that it is based on randomness, not on a mathematical formula. It eliminates keys by replacing sensitive data with random tokens; this mitigates the threat of a thief obtaining a token. The token cannot be discerned or exploited because the only way to get back to the original value is to reference the lookup table that connects the token with the original encrypted value. There is no formula, only a lookup.

A token by definition looks like the original value in data type and length. These properties enable it to travel inside application databases, and other components without modification, resulting in greatly increased transparency. This transparency translates to reduced remediation costs to applications, databases, and other components where sensitive data lives because the tokenized data will match the original data type and length.

The cloud is most certainly a high-risk environment as it decreases administrators' ability to control the flow of all sensitive data. As cloud computing introduces risk, encryption keys become more vulnerable and put data at risk. Because tokenization eliminates encryption keys on most systems, the chance that thieves can do anything with stolen data on these systems is mitigated. Tokenization adds an extra layer of security that keeps information safe even if these cloud systems are breached.

First-Generation Tokenization

Currently there are two forms of tokenization available: "first generation" and "next generation 'small footprint' tokenization." First-generation tokenization is available in two flavors: dynamic and static.

-- Dynamic first-generation tokenization is defined by large lookup tables that assign a token value to the original encrypted sensitive data. These tables grow dynamically as they accept new, un-tokenized, sensitive data. Tokens, encrypted sensitive data and other fields that contain "administrative" data expand these tables, increasing the already-large footprints.

-- Static first-generation tokenization is characterized by a pre-populated token lookup table. This approach attempts to reduce the overhead of the tokenization process by pre-populating lookup tables with the anticipated combinations of the original sensitive data and its corresponding token, thereby eliminating the tokenization process. Because the token lookup tables are pre-populated, they also carry a large footprint.

Although these approaches are promising, they also introduce great challenges:

-- Latency: Large token tables are not mobile and are deployed centrally. The need to use tokenization throughout the enterprise will introduce latency and thus poor performance and limited scalability.

-- Replication: Dynamic token tables must always be synchronized (also known as replication), an expensive and complex process that may eventually lead to collisions. Complex replication requirements impact the ability to scale performance to meet business needs and to deliver high availability.

-- Practical limitation on the number of data categories that can be tokenized: Consider the large lookup tables that would be needed to tokenize credit cards for a merchant. Now consider the impact of adding social security numbers, e-mail addresses and any other fields that may be deemed sensitive. The use of dynamic or static first generation tokenization quickly turns into an impractical solution.

Next-Generation "Small Footprint" Tokenization

Unlike both first-generation solutions that use large token servers that are constantly growing with encrypted data, next-generation tokenization compresses and normalizes random data to create a small system footprint. Because the "small footprint" tokenization can be fully distributed and does not require any resource-intensive data replication or synchronization, users benefit from a lower-cost, less-complex, higher-performing data security solution that guarantees no collisions. Let me explain.

Token servers with small footprints enable the distribution of the tokenization process so that token operations can be executed in parallel and closer to the data. Thus, latency is eliminated or greatly reduced depending on the deployment approach used. The smaller footprint also enables the creation of farms of token servers that are based on inexpensive commodity hardware that create any scaling required by the business, without the need for complex or expensive replication processes. Any number of data categories (ranging from credit card numbers to medical records) can be tokenized without the penalty of increasing the footprint, and more data types can benefit the transparent properties that tokens offer.


A holistic solution for data security should be based on centralized data security management that protects sensitive information throughout the entire flow of data across the enterprise, from acquisition to deletion. Next-generation tokenization has the potential to help businesses protect sensitive data in the cloud in a more efficient and scalable manner than encryption and first-generation tokenization, allowing them to lower the costs associated with compliance in ways never before imagined. Although no technology offers a perfect solution, tokenization is certainly worth considering to confidently bring your data to the cloud securely.

Ulf Mattsson is the chief technology officer of Protegrity, a company that specializes in enterprise data security management. Ulf created the architecture of the Protegrity Data Security Platform and is commonly considered one of the founding fathers of tokenization. Ulf holds more than 20 patents in the areas of encryption key management, policy-driven data encryption, internal threat protection, data usage control, and intrusion prevention. You can contact the author at
comments powered by Disqus