Data Preparation and Wrangling Best Practices – Part 1
Talend Data Preparation Cloud is a self-service application that enables information workers to cut hours out of their workday by simplifying and expediting the time-consuming process of preparing data for analysis or other data-driven tasks. If you are brand new to data preparation, take some time to go through my earlier blog An Introduction to Data Preparation or What Is Data Preparation? to get the basics and learn a little bit about how it can come in handy as a self-service data preparation tool. In this blog, I want to highlight some best practices that I’ve come across as I've worked with Talend Data Preparation. So, without further delay let's jump into the topic.
Data Preparation Best Practice 1 – Folder Naming Convention
A “best practice” for naming conventions really depends on the person or organization. However, following some sort of naming convention structure each and every time makes it significantly easier for subsequent users of the data to understand what the system is doing and how to fix or extend the source code for new business needs. In my experience, the best practice is primarily to follow the naming standards agreed upon the folders. Here are a few suggestions to consider when coming up with naming conventions:
- Use CamelCase
- Use underscores to separate information
- Avoid whitespaces
- Use only alphanumeric characters
- Avoid general folder names
- Avoid short forms
Data Preparation Best Practice 2 – Datasets, Preparation & Context Naming Convention
Typically, preparations and datasets are tied to a specific project. Hence the naming conventions for preparations and datasets could be set either globally at the organization level or at the project level. You should do your best to ensure that the naming conventions are strictly followed. Here are a few tips from my own experience:
- Use the "Extracted Source" name
- Use a Prefix or Suffix dataset extracted date
- Ensure business usage rules are applied
Now, let's talk about context variables. Context variables are user-defined variables provided by Talend whose value can be changed at runtime. Providing the values of the context variables at runtime allows jobs to be executed in different ways with different parameters. Context variables should also follow standard naming conventions. Here are a couple more suggestions around context variables:
- Create additional contexts for project specific requirements
- Limit the number of additional contexts you create. Limit use to less than three new additional contexts per project
- Ensure context variables are descriptive
- Avoid 1-character context variable, for example, a, b, c.
- Avoid generic names like var1, var2, etc
Data Preparation Best Practice 3 – Folder Structure
Folder structures are used to group items of similar categories or behavior. As this is completely related to individual needs, I recommend having folder structures defined in the project’s initial phases. The screenshot below shows an example of a folder structure that might be used in a bank. Here the folders are divided by the unit of the module. Some recommendations for folder structures are things like business modules, data sources, rules applied or intake areas.
Data Preparation Best Practice 4 – Pick the Right Data
There’s a saying that I quite like that goes, “It’s not about having a lot of data, it’s about having the right data”. Data selection is about finding the data that’s needed right now, but it should also make it easier to find data later when similar needs arise. Listed below are some of the best practices for picking up the right data:
- Explore and find the best suited data for a specific purpose
- Avoid data with multiple null or same or repeated values
- Select values close to the source and avoid calculated or derived values
- Avoid intermediate values
- Extract data across multiple platforms
- Determine data suitability (eg: discovery, reporting and monitoring)
- Filter data to select a subject that meets rules and conditions
- Know the source of data so that you can source it repeatedly
Data Preparation Best Practice 5 – Understand the Data
Understanding data is very important to assessing data quality and accuracy. It is also important to check how the data fits within your organization’s governance rules and policies. Once you understand the data, you can then determine the right level of quality for the purpose of the data. A couple of key points to remember here are:
- Learn the data, file and database formats
- Use visualization capabilities to examine the current state of the data
- Use profiling to generate data quality metrics and statistical analysis of the data
- Be aware of the limitations of the data
Data Preparation Best Practice 6 – Verify Datatypes and Formats
Data preparation always starts with a raw data file. Raw data files come in many different shapes and sizes. Mainframe data is different than PC data, spreadsheet data is formatted differently than web data, and so forth. We live in the age of big data, so naturally, there is going to be a lot more variance in source files. Hence it becomes important to make sure that you can read the files in correct format. You also want to ensure that the data types used are accurate. Here, you need to actually look at what each field contains. For example, it would good to check that if a field is listed as a number, it actually contains a number, not the phone number or postal codes or a character filed doesn't contain all number data. This may seem like a no-brainer, but you’d be surprised by how often it is overlooked.
Note: with the usage of data dictionary you can also set the type needed for every column. We will take a look at this in detail little later.
Data Preparation Best Practice 7 – Improve the Data
Once you have assessed data quality and accuracy and have also determined the right level of quality for the purpose of the data, it's now time to improve the data. There are numerous ways to go about this, but here are a few key checkpoints I make sure to take:
- Note missing data
- Perform identity resolution
- Refine and merge-purge the data
Data Preparation Best Practice 8 – Integrate Data
A very powerful feature of Talend’s Data Preparation tool is the ability to integrate data sets. This takes the data preparation processes to the next level by enabling you to perform simple join and lookup functions while preparing rules. As a best practice integrate data to suit the following needs:
- Validate new sources
- Blend data with data from other sources
- Restructure data according to needed format for BI, integration and blending and analysis.
- Transpose the data
Data Preparation Best Practice 9 – Transform Data
Talend Data Preparation is a very powerful tool that gives business users the power to transform the data. Most of the simple data transformation tasks can now be applied with a few simple clicks. As a best practice, I recommend transforming data to:
- Create generalized rules to transform data.
- Apply transformation functions to structured and unstructured data
- Enrich the data
- Determine levels of aggregation needed to answer business questions
- Filter the data to tailor it for reports or analysis
- Incorporate formulas for manipulation requirements
The picture below shows data transformations applied to a dataset.
Data Preparation Best Practice 10 – Verify Data Accuracy
While making preparations, ensure that the data being used is accurate and that it makes sense. This is quite an important step and requires some knowledge of the subject area that the dataset is related to. There isn't really a cut-and-dried approach to verifying data accuracy.
The basic idea is to formulate some properties that you think the data should exhibit and test the data to see if those properties hold. Essentially, you're trying to figure out whether the data really is what you've been told it is. In the example: the "id" field always has to be in an 18 digit number and hence there is a preparation to validate the "id" length.
In this blog, we’ve started with Data preparation best practices. Hopefully, it has been useful. My next blog in this series will focus on some of the advanced topics in data preparation so keep watching and happy reading. Until next time!