Data privacy hidden gems in Talend Component Palette: Part 2

Data privacy hidden gems in Talend Component Palette: Part 2

  • Nikhil Thampi
    Nikhil Thampi is Customer Success Architect at Talend and his core expertise are in Data Integration, Database and Data warehousing technologies. He has more than 12 years of IT experience and during this career, he has helped to create technical solutions for customers from different parts of the globe. His areas of interest also include Cloud, Containers, Big Data, Data Governance and Machine Learning technologies. He is passionate about teaching and increase awareness about Talend among IT developers and he is one of the top contributors of Talend Community site.
  • data privacy.

Data Privacy is becoming the main buzz word in technical circles day by day. Sometime back, we thought that illegal gathering personal identifiable information from data servers can happen only in James Bond and Mission Impossible movies. But technology is changing quite rapidly and in this era of global virtual connectivity, customer private information is becoming more and more insecure. The news of customer data getting misused by data analytics companies, data theft from major banks, etc. are no longer front-page headlines in news channels.

 

Privacy protection policies 

The growing outrage against these data thefts has forced law makers to think about data privacy laws. European Union brought the most famous data privacy law called General Data Protection Regulation (GDPR) and more and more countries and states are creating similar laws to protect their citizen’s data privacy rights. Some of the other popular data privacy acts are California Consumer Privacy Act (CCPA) created by California State and implemented just a few days ago, Personal Information Protection and Electronic Documents Act (PIPEDA) created by Canadian law makers. We are seeing more and more countries moving in this direction to safe guard private and confidential data of their citizens.

Meeting the privacy demand

The increasing demand for data privacy has forced the software vendors to bring more and more functionalities to address the concerns of software developers. Today we are going to discuss some of the hidden gems available in Talend component palette to address the concerns related to Data Privacy. This is the second part of the Hidden Gems Blog series and if you have missed the first part, please refer the link here.

 

Talend Data Privacy Components

Talend has created an array of Data privacy components to handle concerns related to Data Privacy. Broadly, we can classify the components to below categories.

  • Data Encryption and Decryption
  • Data Masking and Unmasking
  • Data Shuffling
  • Data Duplicate Row Generation

Talend Component Palette Data Privacy group

 

In the subsequent sections, we will do a quick glance of each of these categories and Talend components available under each of these categories. Below diagram shows the full list of components available in Talend Component Palette under Data Privacy group.

talend components

 

Data Encryption and Decryption

Encryption becomes handy when you are handling confidential and sensitive information, which needs to be stored in most secured manner. In Talend world, the original data can be converted into unreadable cipher text by tDataEncrypt component of Talend and the original data can be retrieved back using tDataDecrypt component of Talend.

 

Talend allows developers to select one of the below cryptographic methods for encryption.

  • AES-GCM
  • Blowfish

 

A simple example to demonstrate the encryption and decryption capabilities of Talend is shown below.

 

The input data containing name, postal code and date of birth is transferred from input component.

 

Required columns are selected in tDataEncrypt as shown below.

The output data after encryption will be transferred to the file as shown below.

If the developer would like to convert the encrypted data back to original format, they can easily do this step using Talend component tDataDecrypt, as shown in below job.

Talend component tDataDecrypt

Please refer the links below to understand more about Encryption and Decryption components available in Talend.

Data Privacy Requirement

Talend Component

Data Encryption

tDataEncrypt

Data Decryption

tDataDecrypt

 

Data Masking and Unmasking

Data Masking is the process of hiding the original data with random characters or figures with functional substitutes to protect the actual sensitive data. This process is used to conceal the original confidential data while doing activities like data testing, user training etc. The process is widely used when the Talend developer need to handle personally identifiable information like customer name, address, email, phone number, SSN or financial information like credit card number, salary etc.

 

Talend helps developers to perform Data Masking by two pairs of components. They are:

  • tDataMasking and tDataUnMasking components to perform masking operations of heterogenous input data. A simple example for this category of operation is as shown below.

Talend Component Palette Data Privacy

 

  • tPatternMasking and tPatternUnmasking components which will replace pattern-specific and generic data with random characters from a specified range of date and numeric values or a set of named values. A simple example to handle pattern based processing for phone numbers is as shown below.

Talend Component Palette Data Privacy

 

Now, let us discuss each of these functionalities with an example. The first scenario is related to tDataMasking and tDataUnMasking components where the customer information like credit card, name and email are masked.

Talend Component Palette Data Privacy

 

The data is transmitted from an input file as shown below.

Talend Component Palette Data Privacy

 

Data Masking is done using tDataMasking component as shown below.

Talend Component Palette Data Privacy

 

The output after Data masking will be generated as shown below. All the input records will have ORIGINAL_MARK column as true and all the modified records will have this column value as false.

Talend Component Palette Data Privacy

 

The data can be unmasked in similar method using tDataUnMasking component as shown in below Talend job.

Talend Component Palette Data Privacy

 

The second scenario is related to tPatternMasking and tPatternUnMasking components where we will be masking and unmasking phone numbers in specific pattern.

Talend Component Palette Data Privacy

 

The sample input data component contains phone numbers as shown below.

Talend Component Palette Data Privacy

 

The masking pattern for phone numbers can be created by using tPatternMasking component as shown below.

Talend Component Palette Data Privacy

 

The masked output will be stored to file as shown below.  All the input records will have ORIGINAL_MARK column as true and all the modified records will have this column value as false.

Talend Component Palette Data Privacy

 

The unmasking of data can be done in same methodology and the only change will be to use tPatternUnmasking component.

Talend Component Palette Data Privacy

 

To understand more about the masking components available in Talend, please refer the links below.

Data Privacy Requirement

Talend Component

Data Masking

tDataMasking

Data Unmasking

tDataUnmasking

Pattern Masking

tPatternMasking

Pattern Unmasking

tPatternUnmasking

 

Data Shuffling

Data Shuffling is the process of moving sensitive information available in a column from one row to another by the method of shuffling. This method is widely used to quickly create data set for testing purpose. tDataShuffling component in Talend Palette helps the Talend developers to do the data shuffling.

Quick example for data shuffling method using Talend components is as shown below.

Talend Component Palette Data Privacy

The input data is loaded from file as shown below.

Talend Component Palette Data Privacy

 

The data in the content column is as shown below.

Talend Component Palette Data Privacy

 

The data shuffling is done by using tDataShuffling component as shown below.

Talend Component Palette Data Privacy

 

In this example, the credit_card column is considered as first group to shuffle and it is allocated group id value 1. Similarly, lname, fname and mi columns are grouped together under group id value 2.

In many cases, we would like to shuffle the data based on a specific partition. In this example, the data shuffling is happening based on country column.

Talend Component Palette Data Privacy

 

The shuffled data will be loaded to output file and the quick review of data is as shown below.

Talend Component Palette Data Privacy

 

Please refer the link below to understand more about Shuffling component available in Talend.

Data Privacy Requirement

Talend Component

Data Shuffling

tDataShuffling

 

 

Data Duplicate Row Generation

Duplicate row generation is performed to quickly create sample data for data quality checks and for functional testing. tDupliateRow in Talend Palette helps to generate duplicate records based on the criteria mentioned in the component and it will be used for further data processing.

A simple example for tDuplicateRow is as shown below.

Talend Component Palette Data Privacy

 

Input record is loaded from file and there are 6 input records in the file.

Talend Component Palette Data Privacy

 

The configuration rules to generate duplicate rows will be specified in the tDuplicateRow component as shown below.

Talend Component Palette Data Privacy

 

 

The output will be printed to console using tLogrow and you can see that the data got replicated to multiple output records. The original input record can be recognized using ORIGINAL_MARK column and the value for input record will be true for these records. All the records which got generated by the component will have ORIGINAL_MARK as false.

Talend Component Palette Data Privacy

 

Please refer the link below to understand more about Duplicate row generation component available in Talend.

Data Privacy Requirement

Talend Component

Duplicate Row generation

tDuplicateRow

 

Conclusion

Talend Data Privacy components enable the Talend developers to handle customer sensitive data with more confidence. Gone are those days when you used to write lot of custom code to complete activities related to Data Privacy. You can easily do the same functionalities in Talend with its signature graphical user interface. Till we meet again for another blog topic, enjoy your time using Talend ?

 

 

Join The Conversation

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *