Talend Cloud How-to #3: Minimize Turbulence with Data Quality

There is a yin and yang between business and IT when it comes to enterprise data. Business users know their data, and they know what needs to be cleaned or changed to ensure data quality. IT wants to be able to take those rules that the business user knows about the data and apply the rules to data jobs. Find out how Talend Data Preparation enables the sharing and collaboration that IT and business have been so urgently seeking. This article will show you how easy it is to prepare Salesforce data that has been imported into a Snowflake cloud data warehouse using Talend data quality tools

In this installment of the “Talend Cloud How-to” series, we are going to make the Salesforce data we moved to Snowflake (in Part 2) better using the data preparation capability that's built into the Talend Cloud. We will perform some data quality work and clean the data using a simple spreadsheet-like user interface.

Publishing Data to the Data Preparation Tool

We begin by opening Talend Open Studio, but in the “Output” field, rather than loading it into Snowflake, we load the data into Talend Data Preparation using the mode called “LiveDataset” tDatasetOutput component which is design just to load data to Talend’s Data Preparation tool.

data preparation tool

There are different ways to load data into Talend Data Preparation. You can manually load a local file, or directly from databases. You can load data from an S3 bucket on AWS, or straight from Salesforce. For this demo, we are going to use the “LiveDataset”.   Selecting the “LiveDataset” mode means any time we go to the data in the data preparation tool, it fetches the most recent data from Salesforce, capturing the most recent orders, customers, and contact information, among other things or whatever you have the Talend Job doing.  To push this out to the cloud, we simply right-click and hit “Publish”. 

Download The Definitive Guide to Data Quality now.
Download Now

We go to the Talend Cloud home screen and we choose Talend Data Preparation from the Launch button (below) to open the data preparation tool.

data preparation

 

Using the Preparation Tool to Ensure Data Quality

On the left, you see “PREPARATIONS” and “DATASETS”. We are going to start in “DATASETS” and build a preparation from a Talend Job. There are multiple ways you can add data into the data sets, which you see on the drop-down menu under “Add Dataset”.

data quality

Once the data sets are imported and stored, a user can mark them as certified. For this demo, we are using the (fictitious company) World Beauty data set which contains our Salesforce customer orders.

Checking the Semantic Types of the Data

When we click on the World Beauty data set, it runs the process on Talend Cloud in the background, fetching the Salesforce data, and then it displays the data in a spreadsheet-like format inside the data preparation tool. It also shows all the results of a feature that deduces the semantic types of each column across the top: for example, the quantity as a decimal, list price is a decimal, date type, first name, last name, the email, looks like it's an email format.  Underneath the semantic types there is a green bar called the quality bar that indicates how much of the data in this column matches the first name, last name or the suggested semantic type. If data quality issues are detected, the bar will be orange.

data quality

On the right side are the actions we can take to improve data quality – where we choose what we want to fix. The first thing we want to fix is the data in the “Quantity” column, because it should always be an integer – not a fraction – and contain no decimals. On the right side under Actions, select “Remove” and a suggestion comes up to remove the fraction part, which is exactly what we want to do. If we mouse over that suggestion, it gives a preview of what the numbers will look like under the Quantity column. We apply that to update the Quantity view to be integers. Note that we are not actually doing anything to the Salesforce data right now, it's just fixing the data in the real-time view.

Data Masking to Ensure Privacy

There is one more data quality issue we want to address. We don’t want any specific, unique identifiers in the system, so we need to mask the emails, as those are a unique identifier. On the right side, there is a “Mask” option. We have the option to mask the first part, which puts X’s over the first half of the email, leaving the domain. We choose to apply that data mask.  You will see the dates are in two different formats so with a couple clicks the date will be standardized, in this case we formatted all to ISO 8061 date standard.

Joining a Data Set

The next join we are about to make is a little trickier. This data set has risk categories for different products, and chemical risk is one example. We want to join to another file provided by the FDA to include the chemical risks in this dataset.  I have already saved the FDA Chemical Risk file to the corporate data lake on the S3 cloud data lake.

data quality

To do this, we click on the double rings that are overlapping and on the bottom part of the page there is a new window. We may have a file selected but if it’s the wrong one, we just need to click the “+” to pull up our active datasets.  We see all the datasets below, and we can also see which ones are certified. I found the CASN dataset which is the one we need, which is S3, and certified.

data quality

Assigning Risk-levels to the Dataset      

In this view we select the column to join and the columns to be added to the dataset view. Here we added the activity level, which is the risk level field from the dataset that we are pulling. We click “Continue” to join the two data sets together based on the field that was highlighted.

data quality

Once the data has been joined, we see the new field, which we change the name to “RiskLevel” to be more descriptive. There are letter codes under the new “RiskLevel” column, which we want to translate into more obvious risk levels: high, low, or medium risk.

data quality

We then double click and type what we want it to change it to, and check “Apply to all cells”.  We translate all the codes to LOW, MEDIUM, AND HIGH.  In our case, “I” indicates high-risk, so we change all “I” fields to HIGH. 

The preparation is combining the data with a recipe and it's applying these actions/steps in real-time, just on top of the view for you. These actions are contributing to the data quality; ensuring it’s standardized and usable.

Exporting the Data Preparation Job

There is an option to export the data from the preparation to various places, but for this demo, we save the preparation so we can ultimately load the fixed data back into Salesforce. We save it as “v1” and put it into the “Compliance Certified” folder that is shared between different people.

 

data quality

Now we go back to our Talend Cloud screen so we can take the job we built at the very beginning and add in the data preparation steps. In the job, I replaced the tDatasetOutput with a tDataprepRun component which will run the preparation I just created in the job. We have given it some context variables that relate back to the data preparation job.data quality

Using T-map for Schema

We keep it the version as “Current state”. “Fetch schema” is where it pulls in all the fields from the data preparation. There are certain fields coming from Salesforce and going into that data preparation, but we also split some columns out in the first Salesforce to Snowflake tutorial, and when we pull a preparation from the cloud, it automatically pulls that schema in and I need to map the fields to the output using a T-map.  If we click on “Map to Snowflake” we can use our T-map to map all the fields into the Snowflake database like you see below.

data quality

We are ready to publish this to the cloud, so we right click and follow the prompts. Once the information loads, we click “Finish” to complete the publishing.

data quality

Checking our Data Preparation Work

Next, we go to our Talend Integration Cloud instance to check out the data, and it brings us into the most recent environment that we have been using (the PROD environment), and we find the process that has been published.  In Part Two of the Cloud Flight Plan series, we created a flow, and we will now go back into that flow.

data quality

We then need to assign some context variables, so we go into the “GO LIVE” and select which environment, which Salesforce account, and which Snowflake account we want to use. Then we can either schedule it or run it now, and we choose to run it now to make sure everything is working.

data quality

Loading the Prepared Data into Snowflake

Choosing to run it now brings us back into an overview of the flow, showing the history of runs and logs for example. This flow is running the job that we built in the Talend Studio, and we’ve tested it by taking the data from Salesforce and running it through the preparation and going into Snowflake after it's gone through the steps within the preparation.

Next, we navigate to our Snowflake UI to the user dashboard. We re-run the query and now we can see all the data's been loaded:  chemical risk levels, high, medium, and low have been set properly, and all of our data is looking great and ready to go for the final reports in Tableau, which brings us to the next installment in the Cloud Flight Plan series: Operationalize your Data Flight Plan.

data quality

We have now built a process that bakes in data quality in and is entirely run on our remote engine. Pulling the data from Salesforce, writing the data, using the data preparation tool in Talend Cloud, and pushing it out to Snowflake, which we can then hook up to our reports.

| Last Updated: August 29th, 2019