Data Privacy through shuffling and masking – Part 1
By Talend Team
Protecting sensitive data can be a challenging task for companies. In a connected world in which data privacy regulations are continually changing, some technics offer strong solutions for staying compliant with the latest requirements such as the California Consumer Privacy Act (CCPA) in the United States or the General Data Protection Regulation (GDPR) in Europe.
In this two-part blog series, we aim to explore some of the techniques giving you the ability to selectively share production quality data across your organization for development, analysis and more without exposing Personally Identifiable Information (PII) to people who are not authorized to see it.
To guarantee Data Privacy, several approaches exist – especially the four following ones:
Choosing one of them – or a mix of them – mainly depends on the type of data you are working with and the functional needs you have. Plenty of literature is already available for what regards Encryption and Hashing techniques. In the first part of this blog two-part series, we will take a deep dive on Data Shuffling techniques. We will cover Data Masking in the second part.
Simply put, shuffling techniques aim to mix up data and can optionally retain logical relationships between columns. It randomly shuffles data from a dataset within an attribute (e.g. a column in a pure flat format) or a set of attributes (e.g. a set of columns).
You can shuffle sensitive information to replace it with other values for the same attribute from a different record.
Those techniques are generally a good fit for analytics use cases guaranteeing that any metrics or KPI computed on the whole dataset would still be perfectly valid. Then, it allows production data to be safely used for purposes such as testing and training since all the statistics distribution stays valid.
One typical use case is the generation of “test data” where there is a need to have data looking like real production data as input for a new project (e.g. for a new environment), but guaranteeing anonymity while ensuring data statistics are kept exactly the same.
Let’s take a first simple example where we want to shuffle the following dataset.
By applying a random shuffling algorithm, we can get:
The different fields from phone and country have correctly been mixed up as the original link between all columns has completely been lost. Per column, all statistics distributions remain valid, but this shuffling method caused some inconsistencies. For instance, the telephone prefixes vary within the same country and the countries are not anymore correctly paired with names.
This method is a good fit and might be sufficient if you need statistics distributions to remain valid per column only. If you need to keep consistency between columns, specific group and partition settings can be used.
Groups can help preserve value association between fields in the same row. Columns that belong to the same group are linked and their values are shuffled together. By grouping the phone and the country columns, we can get:
After the shuffling process, values in the phone and country columns are still associated. But the link between names and countries has completely been lost.
This method is a good fit and might be sufficient if you need to keep functional cross-dependencies between some of your attributes. By definition, the main drawback is that grouped columns are not shuffled between themselves, which let you have access to some of the initial relationships.
Partitions can help preserve dependencies between columns. Data is shuffled within partitions, and values from different partitions are never associated.
By creating a partition for a specific country, we can get:
Here, the name column still stays unchanged. The phone column has been shuffled within each partition (country). Names and phones are now consistent with the country.
In other words, this shuffling process only applied to the rows sharing the same values for the country column.
Sensitive personal information in the input data has been shuffled but data still looks real and consistent. The shuffled data is still usable for purposes other than production.
This method is a good fit and might be sufficient if you need to keep functional dependencies within some of your attributes. By definition, the main drawback is that values from different partitions are never associated which let you have access to some of the initial relationships.
As a reminder, shuffling algorithms randomly shuffle data from a dataset within a column or a set of columns. Groups and partitions can be used to keep logical relationships between columns:
- When using groups, columns are shuffled together, and values from the same row are always associated.
- When using partitions, data is shuffled inside partitions; values from different partitions are never associated.
Shuffling techniques might look like a magical solution for analytics use cases. Be careful. Since the data is not altered per se, it might not prevent sometimes to come back to some of the original values using pure statistical inference.
- Take the example where you have a dataset with the population for French cities and you want to perform data shuffling on it. Since you have about 2.5M individuals in Paris, which is by far the biggest city in France, you will have as much occurrences as the original value after shuffling. It would allow to easily “un-shuffle” this specific value and retrieve the original one.
- Even worse – take the example of a single value representing more than 50% of a dataset. After having been shuffled, some of the input values will have exactly the same output values, as if they had not been shuffled at all!
Ultimately, shuffling techniques can certainly be used in addition of other techniques such as masking, especially if the predominance of a dedicated value has to be broken. This concludes the first part of the two-part series on data privacy techniques. Stay tuned for the second part when we will discuss, in detail, the data masking technique as an effective measure for data privacy and a crucial tool available through Talend Data Fabric.