DOWNLOAD : The Cloud Data Integration Checklist from TDWI

Data Preparation and Wrangling Best Practices – Part 2

Data Preparation and Wrangling Best Practices – Part 2

  • Rekha Sree
    Rekha Sree is a Customer Success Architect, using her expertise in Data Integration, Data Warehouse and Big Data to help drive customer success at Talend. Prior to joining Talend, Rekha worked at Target Corporation India Pvt Ltd for more than a decade using her vast knowledge in building their enterprise and analytical data warehouse.

This is the second part of my blog series on data preparation and data wrangling best practices. For those of you who are new to this blog, please refer to Part 1 of the same series ‘Data Preparation and Wrangling Best Practices – Part 1’ and for those who are following my blog series, a big thank you for all the support and feedback! In my last blog, we saw the first ten best practices when working with your data sets in Talend Data Preparation. In this blog, I want to touch on some advanced best practices. So with that, let’s jump right in!

Best Practice 11 – Identify Outliers

Outliers are problematic because they can seriously compromise the outcome of a data set. Outliers are data points that are distant from the rest of the distribution in your data set. They are either very large or very small values compared with the rest of the dataset. When faced with outliers, the most common strategy is to delete them. However, it depends on the individual project requirements.

The image below shows data preparation identification of outliers on the Talend Data Preparation UI:

Figure 1: Quick identification of outliers in Talend Data Preparation

A single outlier can have a huge impact on the value of the mean. Because the mean is supposed to represent the center of the data, in a sense, this one outlier renders the mean useless. Keep an eye out for these!

Practice 12 – Deal with Missing values

A missing value in your dataset could cause potential risk to the data being analyzed.  As a best practice, I always recommend handling the missing values rather than ignoring them. The best way to resolve missing values in your dataset  really depends on the project but you could either:

  • Replace the missing values with a proper value.
  • Replace it with a flag to indicate a blank.
  • Delete the row/record entirely.

Figure 2: Dealing with missing values

Best Practice 13 – Share and Reuse Preparations

Re-usability is music to the ears of anyone that lives in the coding world. It saves a lot of effort, time and eases out the whole Software Development Life Cycle (SDLC). With Talend Data Preparation, users can share the preparations/datasets to either individual users or a group of users. As a best practice, always create a shared folder and place all the shareable data preparations in the folder.

Figure 3: folder ‘ASIA_DEFAULTERS’ is being shared with ‘User1’ by owner ‘Rekha Sree’

Best Practice 14 – Mask Sensitive Data

When manipulating sensitive data, such as names, addresses, credit card numbers or social security numbers, you will probably want to mask the data. To protect the original data while having a functional substitute, you can use the mask data (obfuscation) function directly in Talend Data Preparation.

Figure 4: Masking function using Talend Data Preparation

Best Practice 15 –  Use Versioning

Adding versions to your preparation is a good way to see the differences that have been made to the dataset over time. It also ensures that your data is always the same state of a preparation that is used in Talend Jobs, even if the preparation is still being worked on. Versions can be used in Data Integration as well as Big Data Jobs.

Figure 5: Versioning in Data Preparation

Best Practice 16 – Check the Log File Location

Using logs in Talend Data Preparation allows you to analyze and debug the activity of Talend Data Preparation. By default, Talend Data Preparation will log in two different places, namely the console, and in a log file. The location of this log file depends on the version of Talend Data Preparation that you are using:

  • <Data_Preparation_Path>/data/logs/app.log for Talend Data Preparation.
  • AppData/Roaming/Talend/dataprep/logs/app.log for Talend Data Preparation Free Desktop on Windows.
  • Library/Application Support/Talend/dataprep/logs/app.log for Talend Data Preparation Free Desktop on MacOS.

As a best practice, it is recommended to change the default locations. The location can be configured by editing the logging.file property of the application.properties file.

Best Practice 17 –  Know the Data Storage Location

Depending on the version of Talend Data Preparation that you are using, your data is stored in different locations. Here is a quick overview:

  • Talend Data Preparation
    • If you are a subscription user, nothing is saved directly on your computer. Sample data is cached temporarily on the remote Talend Data Preparation server, in order to improve the product responsiveness. In addition, CSV and Excel datasets are stored permanently on the remote Talend Data Preparation server.
  • Talend Data Preparation Free Desktop
    • Talend Data Preparation Free Desktop is meant to be able to work locally on your computer, without the need of an internet connection. Therefore, when using a dataset from a local file such as a CSV or Excel file, the data is copied locally, in one of the following folders depending on your operating system:
    • Windows: C:\Users\<your_user_name>\AppData\Roaming\Talend\dataprep\store
    • OS X: /Users/<your_user_name>/Library/Application Support/Talend/dataprep/store

Best Practice 18 – Always Backup

Backing up Talend Data Preparation and Talend Data Dictionary on regular basis is important to recover from a data loss scenario or any other causes of data corruption or deletion. Here is how to backup both:

  • Talend Data Preparation
    • To have a copy of Talend Data Preparation instance, backup MongoDB, the folders containing your data, the configuration files and the logs.
  • Talend Data Dictionary
    • Talend Data Dictionary Service stores all the predefined semantic types used in Talend Data Preparation. It also stores all the custom types created by users and all the modifications done on existing ones. To back up a Talend Dictionary Service instance, you need to back up the MongoDB database and the changes made to the predefined semantic types.

Best Practice 19 – Know Your Dataset

As you would be dealing with the raw data and probably the very data kind of input, it is recommended to build knowledge while you are analyzing the data. Here are some suggestions:

  • Discover and learn data relationships within and across sources; find out how the data fits together; use analytics to discover patterns
  • Define the data. Collaborate with other business users to define shared rules, business policies, and ownership
  • Build knowledge with a catalog, glossary, or metadata repository
  • Gain high-level insights. Get the big picture of the data and its context

Best Practice 20 – Document Knowledge

While it is important to build and enhance the knowledge, it is equally important to document the gained knowledge. In particular, every project must maintain a document for

  • Business terminology
  • Source data lineage
  • History of changes applied during cleansing
  • Relationships to other data
  • Data usage recommendations
  • Associated data governance policies
  • Identified data stewards

Best Practice 21 – Create a Data Dictionary

A data dictionary is a metadata description of the features included in the dataset. As you analyze and understand the data it is recommended to have it stored in the data dictionary. It will help the other users to identify the data they are working with and to establish the relation between various data.

Figure 6: Data Dictionary in Data Preparation

Best Practice 22 – Set Up a Promotion Pipeline

When using Talend Data Preparation, you should consider is to set up one instance for each environment of your production chain. Talend only supports promoting a preparation between identical product versions. To promote your preparation from one environment to the other, you have to export it from the source environment, and then import it back to your target environment. For the import to work, a dataset with the same name and schema as the one which the export was based on must exist on the target environment.

Best Practice 23 – Hybrid Preparation Environments

Sometimes the transformation are either too complex or bulky to be created in a simple form. To help you in such scenarios, Talend offers you a hybrid preparation environment. You could use either the dedicated Talend Preparation service or the Talend jobs to create data preparations. Leverage tDatasetOutput component as output in Create mode.

Figure 7: Creating a dataset from Talend Studio

Best Practice 24 – Operationalizing a Recipe

The tDataprepRun component allows you to reuse an existing preparation made in Talend Data Preparation, directly in a data integration Job. In other words, you can operationalize the process of applying a preparation to input files that have the same model.

Figure 8: Executing Talend data preparation in a Talend Studio job

Figure 9: Executing Talend data preparation in a Talend Studio job using dynamic selection

Note: In order to use the tDataprepRun component with Talend Data Preparation Cloud, you must have the 6.4.1 version of Talend Studio.

Best Practice 25 – Live Dataset

What if your business doesn’t need sample data but real live data for analysis? Because the Job is designed in Talend Studio, you can take advantage of the full components palette and their Data Quality or Big Data capabilities. Unlike a local file import, where the data is stored in the Talend Data Preparation server for as long as the file exists, a live dataset only retrieves this sample data temporarily.

It is possible to retrieve the result of Talend Integration Cloud flows that were executed on a Talend Cloud engine, as well as remote engines. Live datasets

  • Use a preparation as part of Data integration flow or Talend Spark Batch or Streaming job in Talend Studio.
  • The live dataset feature allows you to create a Job in Talend Studio, execute it on demand via Talend Integration Cloud as a Flow, and retrieve a dataset with the sample data directly in Talend Data Preparation Cloud.

Note: In order to create Live datasets, you must have the 6.4.1 version of Talend Studio, patched with at least the 0.19.3 version of the Talend Data Preparation components.

Conclusion

And with this, I come to an end of the two-part series blog. I hope these best practices are helpful and you would embed these while working with data preparation.    

Join The Conversation

1 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

  1. Giribabu Madathala says:

    Very well documented!