Data Privacy through shuffling and masking – Part 2

By Talend Team

More beyond compliance

In the first part of this blog two-part series, we took a deep dive on Data Shuffling techniques aiming to mix up data and allowing to optionally retain logical relationships between columns. In this second part, we will now focus on Data Masking techniques as one of the main approaches to guarantee Data Privacy.

Data Masking

Simply put, masking techniques allows to block visibility of specifics fields or pieces of data. It hides data while preserving the overall format and semantic. It actually creates a structurally similar but inauthentic version of the data after having applied specific functions on data fields.

Note that, when using the most usual technics for data masking, original data cannot be retrieved after having been masked. Still, some encryption-based algorithms exist and allow to encrypt and decrypt data while preserving the format, as we will see at the very end of this section.

In the following, we first describe some of the numerous data transformation functions used to hide pieces of data. Then we detail the different masking modes and their implications at runtime.

Data Transformation Functions

To mask data, lots of transformation functions can be applied on the original data. Let’s first dig into the most common ones. This list is not exhaustive and other transformations can be easily applied to create other inauthentic version of the data.

Text handling functions

The following table lists some of the available masking routines for text, and their effects on the value Talend in 2019 is awesome for example.

Numeric values handling functions

The following table lists the available masking routines for a column containing numeric values, and their effect on the value 21803 for example.

Date handling functions

The following table lists the available masking routines for a column containing date values, and their effects on the value 05/04/2018 for example.

Patterns handling function

Specific algorithms can be applied to mask data that follows a specific pattern. This can be ideal to mask records such as credit card numbers, Social Security Numbers (SSN), account ids, IP addresses, etc. which is structured and standardized data.

For example, if we want to mask a French Social Security Number, the input values consist of 15 characters, excluding spaces, and use the pattern “s yy mm ll ooo kkk cc” where:

s is the gender: 1 for a male, 2 for a female,
yy are the last two digits of the year of birth,
mm is the month of birth,
ll is the number of the department of origin,
ooo is the commune of origin,
kkk is an order number to distinguish people being born at the same place in the same year and month,
cc is the “control key”.

By specifying exactly how to mask which parts using specified ranges allows to transform the original data in consistent manner. For example, you can specify that:

s must be generated between 1 and 2,
yy must be generated between 00 and 99,
mm must be generated between 01 and 12,
etc.

You also get the ability to mask specific parts of the input and keep other parts unmasked. For example, you might want to mask all the SSN characters except the exact first one. This would allow to keep real statistics for gender (represented by the first character) by preserving the anonymity of the real person – the other characters being fully masked.

Of course, same behaviors can be applied for dedicated data type such as emails, phone numbers, addresses, etc.

Masking Modes and Runtime Behaviors

When masking data, besides those technical routines applied to transform data, another component is also key. It concerns the masking modes and the behavior the functions have at runtime.

Depending on the targeted use case, data masking routines can be purely random. But they can also be repeatable from one execution to another on the same dataset. This opens huge perspectives, especially allowing joins and lookups on masked data. Let’s dig into that…

Random Data Masking

Random masking consists of masking an input value with a randomly generated value. As a consequence, when there are multiple occurrences of the same value in the input dataset, it can be masked to different values. Vice versa, different values from the input dataset can be masked to the same value.

For example, the following diagram shows an example of pure random data masking:

The A value is masked to D when it first appears in the input dataset.
The B and C values are masked to E.
The A value is masked to F when it appears in the input dataset for the second time.

The following table shows examples of generated masked values using a “Replace n first chars” function:

Here, two input values are the same. Once masked, the output values are completely different. “newuser” is masked by “uãáìser” in the first occurrence and is masked by “åõzoser” in the second occurrence.

The following table shows examples of generated masked values from a French SSN number:

Again, two input values are identical. Once masked, the corresponding output values are completely different. “1 90 04 94 184 376 21” is masked by “2 59 04 592 221 47 22” in the first occurrence and by “2 73 03 64 078 284 70” in the second occurrence.

Random data masking is a good fit and might be sufficient if you only need to hide data while preserving the overall format and semantic without having any further constraints on keeping some relationships between initial values and masked values.

Consistent Data Masking

When the same value appears twice in the input data, consistent masking functions output the same masked value. However, two different input values can be replaced with the same masked value in the output.

For example, the following diagram shows an example of consistent masking:

The A value is masked to D, regardless of the number of occurrences in the input dataset.
The B and C values are masked to E.

The following table shows examples of generated masked values using a “Mask email left part of domain” with consistent items function (i.e. replace the left part domain by one of the items set in the extra parameters list):

Here, the same input values are always masked by the exact same output values: “domain” is always masked by “newcompany” and “company” is always masked by “value”.

Consistent data masking can be seen as a first step prior to bijective data masking.

Bijective Data Masking

Bijective masking functions have the following characteristics:

They are consistent masking functions.
They always output two different masked values for two different input values.

For example, the following diagram shows an example of bijective masking:

The A value is masked to D, regardless of the number of occurrences in the input dataset.
The B value is masked to E.
The C value is masked to F.

The following table shows examples of generated masked values from a French SSN number:

Here, the same input values are always masked by the exact same output values: “1 90 04 94 184 376 21” is always masked by “2 89 05 24 283 319 01”.

Bijective data masking is a good fit if you need to ensure one-to-one correspondence between initial values and masked values. As we will see in the next section, this property is key if you aim to join/ lookup several masked datasets while keeping the correct relationship.

Repeatable data masking

Repeatable masking allows to maintain consistency between Job executions. A seed is defined so that, for a given combination of input and seed values, the same output masked value is produced.

Combining bijective data masking and repeatable data masking has very powerful properties. Especially it allows to join different datasets based upon a key already masked.

Let’s say you want to perform Business Intelligence on an insurance database and a healthcare database, and we lack explicit consent to directly access those data. We might still be able to perform some statistics on data once shuffled and masked.

Since we have to join both databases, we rather make sure that the data used to make the join is masked exactly the same way everywhere.

By leveraging bijective data masking, we can ensure the same input value is always masked with the same output value, and vice versa.
By leveraging repeatable masking, we can ensure the above is true… at each job execution.

Format-Preserving Encryption (FPE)

Format-Preserving Encryption algorithms are cryptography algorithms which keep the input value formats. Those algorithms require a secret key to be specified to generate unique masked values.

Since those methods are based on encryption, several advantages exist:

Once encrypted, the data can be decrypted – meaning that it is possible to unmask the output values (knowing the secret key, of course). Then, such a masking function is reversible – you can retrieve the original input data.
It natively implies bijectivity and repeatability.

FF1 algorithm is the NIST-standard Format-Preserving Encryption algorithm.

Conclusion

We saw several techniques helping to ensure data privacy. Here, the key takeaway is that there is no easy magic solution that will solve your entire data privacy concerns.

Depending on the type of data you are working with and the use cases you want to address, some techniques are more relevant than others. A mix of different techniques such as data shuffling sprinkled with a bit of repeatable data masking and a pinch of hashing is often the right path to correctly address such complex data privacy projects.

The good news is that Talend Data Fabric provides all those different technical resources to help you to address your data privacy needs!

Take the next step: Download a Free Trial