Data privacy hidden gems in Talend Component Palette: Part 2
Data Privacy is becoming the main buzz word in technical circles day by day. Sometime back, we thought that illegal gathering personal identifiable information from data servers can happen only in James Bond and Mission Impossible movies. But technology is changing quite rapidly and in this era of global virtual connectivity, customer private information is becoming more and more insecure. The news of customer data getting misused by data analytics companies, data theft from major banks, etc. are no longer front-page headlines in news channels.
Privacy protection policies
The growing outrage against these data thefts has forced law makers to think about data privacy laws. European Union brought the most famous data privacy law called General Data Protection Regulation (GDPR) and more and more countries and states are creating similar laws to protect their citizen’s data privacy rights. Some of the other popular data privacy acts are California Consumer Privacy Act (CCPA) created by California State and implemented just a few days ago, Personal Information Protection and Electronic Documents Act (PIPEDA) created by Canadian law makers. We are seeing more and more countries moving in this direction to safe guard private and confidential data of their citizens.
Meeting the privacy demand
The increasing demand for data privacy has forced the software vendors to bring more and more functionalities to address the concerns of software developers. Today we are going to discuss some of the hidden gems available in Talend component palette to address the concerns related to Data Privacy. This is the second part of the Hidden Gems Blog series and if you have missed the first part, please refer the link here.
Talend Data Privacy Components
Talend has created an array of Data privacy components to handle concerns related to Data Privacy. Broadly, we can classify the components to below categories.
- Data Encryption and Decryption
- Data Masking and Unmasking
- Data Shuffling
- Data Duplicate Row Generation
In the subsequent sections, we will do a quick glance of each of these categories and Talend components available under each of these categories. Below diagram shows the full list of components available in Talend Component Palette under Data Privacy group.
Data Encryption and Decryption
Encryption becomes handy when you are handling confidential and sensitive information, which needs to be stored in most secured manner. In Talend world, the original data can be converted into unreadable cipher text by tDataEncrypt component of Talend and the original data can be retrieved back using tDataDecrypt component of Talend.
Talend allows developers to select one of the below cryptographic methods for encryption.
A simple example to demonstrate the encryption and decryption capabilities of Talend is shown below.
The input data containing name, postal code and date of birth is transferred from input component.
Required columns are selected in tDataEncrypt as shown below.
The output data after encryption will be transferred to the file as shown below.
If the developer would like to convert the encrypted data back to original format, they can easily do this step using Talend component tDataDecrypt, as shown in below job.
Please refer the links below to understand more about Encryption and Decryption components available in Talend.
Data Privacy Requirement
Data Masking and Unmasking
Data Masking is the process of hiding the original data with random characters or figures with functional substitutes to protect the actual sensitive data. This process is used to conceal the original confidential data while doing activities like data testing, user training etc. The process is widely used when the Talend developer need to handle personally identifiable information like customer name, address, email, phone number, SSN or financial information like credit card number, salary etc.
Talend helps developers to perform Data Masking by two pairs of components. They are:
- tDataMasking and tDataUnMasking components to perform masking operations of heterogenous input data. A simple example for this category of operation is as shown below.
- tPatternMasking and tPatternUnmasking components which will replace pattern-specific and generic data with random characters from a specified range of date and numeric values or a set of named values. A simple example to handle pattern based processing for phone numbers is as shown below.
Now, let us discuss each of these functionalities with an example. The first scenario is related to tDataMasking and tDataUnMasking components where the customer information like credit card, name and email are masked.
The data is transmitted from an input file as shown below.
Data Masking is done using tDataMasking component as shown below.
The output after Data masking will be generated as shown below. All the input records will have ORIGINAL_MARK column as true and all the modified records will have this column value as false.
The data can be unmasked in similar method using tDataUnMasking component as shown in below Talend job.
The second scenario is related to tPatternMasking and tPatternUnMasking components where we will be masking and unmasking phone numbers in specific pattern.
The sample input data component contains phone numbers as shown below.
The masking pattern for phone numbers can be created by using tPatternMasking component as shown below.
The masked output will be stored to file as shown below. All the input records will have ORIGINAL_MARK column as true and all the modified records will have this column value as false.
The unmasking of data can be done in same methodology and the only change will be to use tPatternUnmasking component.
To understand more about the masking components available in Talend, please refer the links below.
Data Privacy Requirement
Data Shuffling is the process of moving sensitive information available in a column from one row to another by the method of shuffling. This method is widely used to quickly create data set for testing purpose. tDataShuffling component in Talend Palette helps the Talend developers to do the data shuffling.
Quick example for data shuffling method using Talend components is as shown below.
The input data is loaded from file as shown below.
The data in the content column is as shown below.
The data shuffling is done by using tDataShuffling component as shown below.
In this example, the credit_card column is considered as first group to shuffle and it is allocated group id value 1. Similarly, lname, fname and mi columns are grouped together under group id value 2.
In many cases, we would like to shuffle the data based on a specific partition. In this example, the data shuffling is happening based on country column.
The shuffled data will be loaded to output file and the quick review of data is as shown below.
Please refer the link below to understand more about Shuffling component available in Talend.
Data Privacy Requirement
Data Duplicate Row Generation
Duplicate row generation is performed to quickly create sample data for data quality checks and for functional testing. tDupliateRow in Talend Palette helps to generate duplicate records based on the criteria mentioned in the component and it will be used for further data processing.
A simple example for tDuplicateRow is as shown below.
Input record is loaded from file and there are 6 input records in the file.
The configuration rules to generate duplicate rows will be specified in the tDuplicateRow component as shown below.
The output will be printed to console using tLogrow and you can see that the data got replicated to multiple output records. The original input record can be recognized using ORIGINAL_MARK column and the value for input record will be true for these records. All the records which got generated by the component will have ORIGINAL_MARK as false.
Please refer the link below to understand more about Duplicate row generation component available in Talend.
Data Privacy Requirement
Duplicate Row generation
Talend Data Privacy components enable the Talend developers to handle customer sensitive data with more confidence. Gone are those days when you used to write lot of custom code to complete activities related to Data Privacy. You can easily do the same functionalities in Talend with its signature graphical user interface. Till we meet again for another blog topic, enjoy your time using Talend 😊