Data Preparation and Wrangling Best Practices – Part 2
This is the second part of my blog series on data preparation and data wrangling best practices. For those of you who are new to this blog, please refer to Part 1 of the same series ‘Talend Data Preparation. In this blog, I want to touch on some advanced best practices. So with that, let’s jump right in!’ and for those who are following my blog series, a big thank you for all the support and feedback! In my last blog, we saw the first ten best practices when working with your data sets in
Best Practice 11 – Identify Outliers
Outliers are problematic because they can seriously compromise the outcome of a data set. Outliers are data points that are distant from the rest of the distribution in your data set. They are either very large or very small values compared with the rest of the dataset. When faced with outliers, the most common strategy is to delete them. However, it depends on the individual project requirements.
The image below shows data preparation identification of outliers on the Talend Data Preparation UI:
A single outlier can have a huge impact on the value of the mean. Because the mean is supposed to represent the center of the data, in a sense, this one outlier renders the mean useless. Keep an eye out for these!
Practice 12 – Deal with Missing values
A missing value in your dataset could cause potential risk to the data being analyzed. As a best practice, I always recommend handling the missing values rather than ignoring them. The best way to resolve missing values in your dataset really depends on the project but you could either:
- Replace the missing values with a proper value.
- Replace it with a flag to indicate a blank.
- Delete the row/record entirely.
Best Practice 13 – Share and Reuse Preparations
Re-usability is music to the ears of anyone that lives in the coding world. It saves a lot of effort, time and eases out the whole Software Development Life Cycle (SDLC). With Talend Data Preparation, users can share the preparations/datasets to either individual users or a group of users. As a best practice, always create a shared folder and place all the shareable data preparations in the folder.
Best Practice 14 – Mask Sensitive Data
When manipulating sensitive data, such as names, addresses, credit card numbers or social security numbers, you will probably want to mask the data. To protect the original data while having a functional substitute, you can use the mask data (obfuscation) function directly in Talend Data Preparation.
Best Practice 15 – Use Versioning
Adding versions to your preparation is a good way to see the differences that have been made to the dataset over time. It also ensures that your data is always the same state of a preparation that is used in Talend Jobs, even if the preparation is still being worked on. Versions can be used in Data Integration as well as Big Data Jobs.
Best Practice 16 – Check the Log File Location
Using logs in Talend Data Preparation allows you to analyze and debug the activity of Talend Data Preparation. By default, Talend Data Preparation will log in two different places, namely the console, and in a log file. The location of this log file depends on the version of Talend Data Preparation that you are using:
- <Data_Preparation_Path>/data/logs/app.log for Talend Data Preparation.
- AppData/Roaming/Talend/dataprep/logs/app.log for Talend Data Preparation Free Desktop on Windows.
- Library/Application Support/Talend/dataprep/logs/app.log for Talend Data Preparation Free Desktop on MacOS.
As a best practice, it is recommended to change the default locations. The location can be configured by editing the logging.file property of the application.properties file.
Best Practice 17 – Know the Data Storage Location
Depending on the version of Talend Data Preparation that you are using, your data is stored in different locations. Here is a quick overview:
- Talend Data Preparation
- If you are a subscription user, nothing is saved directly on your computer. Sample data is cached temporarily on the remote Talend Data Preparation server, in order to improve the product responsiveness. In addition, CSV and Excel datasets are stored permanently on the remote Talend Data Preparation server.
- Talend Data Preparation Free Desktop
- Talend Data Preparation Free Desktop is meant to be able to work locally on your computer, without the need of an internet connection. Therefore, when using a dataset from a local file such as a CSV or Excel file, the data is copied locally, in one of the following folders depending on your operating system:
- Windows: C:\Users\<your_user_name>\AppData\Roaming\Talend\dataprep\store
- OS X: /Users/<your_user_name>/Library/Application Support/Talend/dataprep/store
Best Practice 18 – Always Backup
Backing up Talend Data Preparation and Talend Data Dictionary on regular basis is important to recover from a data loss scenario or any other causes of data corruption or deletion. Here is how to backup both:
- Talend Data Preparation
- To have a copy of Talend Data Preparation instance, backup MongoDB, the folders containing your data, the configuration files and the logs.
- Talend Data Dictionary
- Talend Data Dictionary Service stores all the predefined semantic types used in Talend Data Preparation. It also stores all the custom types created by users and all the modifications done on existing ones. To back up a Talend Dictionary Service instance, you need to back up the MongoDB database and the changes made to the predefined semantic types.
Best Practice 19 – Know Your Dataset
As you would be dealing with the raw data and probably the very data kind of input, it is recommended to build knowledge while you are analyzing the data. Here are some suggestions:
- Discover and learn data relationships within and across sources; find out how the data fits together; use analytics to discover patterns
- Define the data. Collaborate with other business users to define shared rules, business policies, and ownership
- Build knowledge with a catalog, glossary, or metadata repository
- Gain high-level insights. Get the big picture of the data and its context
Best Practice 20 – Document Knowledge
While it is important to build and enhance the knowledge, it is equally important to document the gained knowledge. In particular, every project must maintain a document for
- Business terminology
- Source data lineage
- History of changes applied during cleansing
- Relationships to other data
- Data usage recommendations
- Associated data governance policies
- Identified data stewards
Best Practice 21 – Create a Data Dictionary
A data dictionary is a metadata description of the features included in the dataset. As you analyze and understand the data it is recommended to have it stored in the data dictionary. It will help the other users to identify the data they are working with and to establish the relation between various data.
Best Practice 22 – Set Up a Promotion Pipeline
When using Talend Data Preparation, you should consider is to set up one instance for each environment of your production chain. Talend only supports promoting a preparation between identical product versions. To promote your preparation from one environment to the other, you have to export it from the source environment, and then import it back to your target environment. For the import to work, a dataset with the same name and schema as the one which the export was based on must exist on the target environment.
Best Practice 23 – Hybrid Preparation Environments
Sometimes the transformation are either too complex or bulky to be created in a simple form. To help you in such scenarios, Talend offers you a hybrid preparation environment. You could use either the dedicated Talend Preparation service or the Talend jobs to create data preparations. Leverage tDatasetOutput component as output in Create mode.
Best Practice 24 – Operationalizing a Recipe
The tDataprepRun component allows you to reuse an existing preparation made in Talend Data Preparation, directly in a data integration Job. In other words, you can operationalize the process of applying a preparation to input files that have the same model.
Note: In order to use the tDataprepRun component with Talend Data Preparation Cloud, you must have the 6.4.1 version of Talend Studio.
Best Practice 25 – Live Dataset
What if your business doesn’t need sample data but real live data for analysis? Because the Job is designed in Talend Studio, you can take advantage of the full components palette and their Data Quality or Big Data capabilities. Unlike a local file import, where the data is stored in the Talend Data Preparation server for as long as the file exists, a live dataset only retrieves this sample data temporarily.
It is possible to retrieve the result of Talend Integration Cloud flows that were executed on a Talend Cloud engine, as well as remote engines. Live datasets
- Use a preparation as part of Data integration flow or Talend Spark Batch or Streaming job in Talend Studio.
- The live dataset feature allows you to create a Job in Talend Studio, execute it on demand via Talend Integration Cloud as a Flow, and retrieve a dataset with the sample data directly in Talend Data Preparation Cloud.
Note: In order to create Live datasets, you must have the 6.4.1 version of Talend Studio, patched with at least the 0.19.3 version of the Talend Data Preparation components.
And with this, I come to an end of the two-part series blog. I hope these best practices are helpful and you would embed these while working with data preparation.