What is data masking?

It’s no secret that the world runs on data. It is critical for product development, key to mastering the supply chain, vital for communication, and the very essence of commerce around the globe. Because data is so important to every part of life, it has gone from a simple way of describing information and transactions to a valuable asset in its own right.

Of course, by becoming such a valuable asset, data has likewise become vulnerable to theft, misuse, and exploitation. At this point, most of us have been touched by the fallout of a data breach in one way or another. We all fully understand the consequences of a lapse in data security — both the personal costs to individuals who have had their data exposed and the financial, legal, and reputational repercussions for businesses that have fallen victim to a data breach.

There is a balance that needs to be struck between protecting the contents of data — particularly sensitive personally identifiable information (PII) such as social security numbers or tax IDs, credit card numbers or bank information, or health information — without obstructing the flow of data through and between systems. The answer is data masking.

Definition of data masking

Data masking is an umbrella term for a range of techniques and strategies to protect classified, proprietary, or sensitive information while still preserving data usability. In other words, you replace the sensitive data with something that isn’t secure but has the same format so you can test systems or build products using production-ready data without putting the original data at risk.

How does data masking work?

In most ways, data masking is like any other data transformation. In the data pipeline, you add a masking component and select the appropriate algorithm for the format of the data and method of transformation. When the data passes through this data masking component, obscured or otherwise anonymized data comes out the other end. This style of masing can be batch or streaming, and can be applied to data in a single database or as it moves from one database to another — for example, from a production environment to a testing environment.

Crucially, data masking obscures the actual value while preserving the format of the data. Take an email address. An email address has two parts: the username before the @ symbol and the domain name after the @ symbol. In a valid email address, there are also limitations restricting which characters can be used and in both the username and the domain name. For example, “xxxxxxx@xxxxx.xxx” looks like an email address, while “xxxxxxx.xxxxxx@xxx” does not.

While a test environment does not need genuine user email addresses, it will need values that look and operate like real addresses to correctly build and test those processes. In this scenario, we would want to replace the real data — the email address that is stored in the production database — with something that’s fake, but follows the same rules.

This can be a 1:1 replacement, but it doesn't have to be. For example, if one email address in the source data has seven characters in the username and another has nine characters, they might be represented in the testing environment with seven-character and nine-character strings, respectively. But it could also be that every single email address, no matter the number of characters in the source data username, is going to be replaced with five characters, regardless.

In some scenarios, you may only want to mask part of the data — for example, masking the username, but retaining the domain.

It's really up to the need. When the purpose is to test something like email validation or transformation of the email address data itself within the test environment, it would be important to retain some of those unique qualities of the original data values. But sometimes you just need something simple, like testing what happens when the email address is complete vs. when the email address is blank. In that case, as long as the records have correctly formatted data, the specifics don’t really matter. You don't always need to use a fancy algorithm that will take time to process.

Types of data masking

Depending on an organization’s specific needs and blend of production and non-production environments, they may want to consider several types of data masking. These are the most common:

Static data masking usually describes the masking of data in storage. This can apply to all or part of a production database, and usually involves creating a backup copy.
On-the-fly data masking refers to real-time data masking that happens as part of the process of moving or replicating data from one place to another without exposing the data either in transit or in the destination.
Dynamic data masking is similar to on-the-fly data masking in that it is a real-time function, but unlike on-the-fly masking, the data is masked in real time when the data is accessed. For example, in a production database, sensitive data will be masked at the query time while remaining in plain text in the database.

Data masking techniques

Within the category of data masking, there are a wide variety of methods for obscuring some or all of the original values in a given dataset. These methods all have their own unique advantages and disadvantages, depending on the needs of the user. For example, some techniques are more secure than others. Some are reversible, while others are much, much more resilient to reverse engineering.

Here are a few of the most common data masking methods:

Substitution. One of the most effective methods for data masking involves simply swapping out real data for different (but authentic-feeling) data of the same type. Depending on the type and complexity of the data, this may involve maintaining two separate databases — the source data plus a lookup database for reference. It does result in records that look and feel real, however, and can be particularly helpful with values that typically include a checksum, such as credit card numbers, where a randomly generated or fully obfuscated value might not pass certain validation tests.
Encryption. Another popular method uses a masking algorithm to replace the personal data with an apparently random string called ciphertext. In most cases, decryption of the data is only possible for someone with access to both the encryption algorithm and the cryptographic key used to encode it in the first place. This is very useful for storing and transferring data securely, but can be less helpful in testing environments that need more realistic-feeling data.
Shuffling or scrambling. A very simple method for masking data takes the characters in a string and mixes them randomly — for example, “Smith” could become “Tmhsi.” This is obviously not the most secure or sophisticated data masking method, but it can be highly effective for certain data types, and particularly in conjunction with more sophisticated data masking for other fields.
Redaction or masking out. The simplest technique of all calls for the simple anonymization of the source data by completely overwriting the string. Under this method, “Smith” would become “XXXXX” or “█████”. This method is quick, easy, and highly secure, but it has limited applications in testing and cannot be reverted.

Benefits of data masking

Secure development

One of the greatest benefits of data masking — and one of the most common applications of masked data — is creating completely secure test and development environments.

Often, you might not want the engineers responsible for developing your applications to have complete access to secure data like credit card numbers or health information. That’s simply not their role. However, in order to build these systems, they need to be able to test them against realistic volumes of realistic data.

By masking the data that appears in non-production environments, you can preserve your internal rules and protocols for security and data protection, while still giving your developers everything they need to build robust systems.

Risk mitigation and protection against data breaches

We've seen so many breaches in recent years. Even the biggest companies have breaches. If hackers and bots are determined to break into your system, it’s very hard to keep them out entirely.

But the good news is that, while data breaches do and will happen, most of the most sensitive data — email addresses, passwords, financial or medical records — is masked and secure. That way, even if a hacker is able to access your system and download data, they will be left with a meaningless, anonymized hash that is of no practical value.

You should assume that someone at some point is going to try to access your data maliciously. While it may not be possible to fully eliminate risk, it is possible to mitigate the worst of the effects. Data masking can protect your company, your employees, and your customers.

Regulatory compliance

And, finally, there are data privacy regulations such as the General Data Protection Regulation (GDPR) or the California Consumer Privacy Act (CCPA) that determine how data can be used, transmitted, and stored.

When it comes to compliance with regulatory laws, data masking is particularly valuable because it gives you so much control over who has access to data, which data they can access, and how you track the movement of data through your systems. You can restrict access to the production systems where you have the real data that has not been masked. But in non-production environment systems where the developers and engineers do their tests, you can operate without the sensitive data and still test the system in real conditions.

Data masking use cases

Certain types of data — particularly financial and health information — have specific data privacy regulations that may change the way you think about data masking.

Credit card transactions

Most organizations today process credit card payments of some kind. The Payment Card Industry Data Security Standards (PCI DSS) are a set of standards and regulations surrounding credit card security that protect consumers worldwide.

Among other specifications, Requirement 3 of the PCI DSS says that cardholder data — including account numbers, account holder names, expiration dates, and security or validation codes — be stored in such a way that it is only usable to someone with the proper cryptographic keys.

Healthcare data

Unsurprisingly, health information is subject to particularly stringent rules around data security. In the US, that is the HIPAA Privacy Rule — which includes a provision that individuals have a right to request access to their protected health information.
For many organizations, this may mean that you must be able to revert data masking so that the data can be reported back to the individual. In this case, you would use an encryption-based algorithm so that it can be well protected, but it can be deciphered by an approved user with the encryption key.

The future of data masking

Today we have a lot of data masking algorithms that give you a great deal of control over how and where you mask your data, which data formats require pseudonymization, and more. But you still need to define which format, which type of masking is appropriate, and which technique you want to use. That algorithm is powerful, but you need to feed a lot of settings into it first.

The future of data masking is intelligent data masking, an AI-driven extension of the processes we already have. But with intelligent data masking, the program could do the hard work for you by identifying what type of information you have, what format it’s in, and what type of data masking would work best.

As soon as sensitive data enters the system, it will be identified and secured. For data in the cloud, Talend Data Preparation is already applying some principles of intelligent data masking.

Get started with data masking

Every business requires some degree of data masking. It could be the simple anonymization of PII: an email address, a password, an address, a social security number, and so on. It could be the replication of an entire database for development and testing.

Then there’s the data integration process. What are the specifics around when data is being replicated, integrated, moved, transformed from one place to another? Are there different needs for data masking in that process? Can data be masked throughout that process?

Every organization has its own unique development, migration, security, and compliance needs. That’s why Talend offers such incredible flexibility. With Talend, you can use data masking exactly when and how you need it.

Get started with Talend today.