What’s new in Talend Data Preparation 1.3?
In early October 2016, we introduced the 1.3 release of Talend Data Preparation for its Free Desktop edition as well as for the Commercial edition.
If you haven’t heard of Data Preparation before, I would suggest you watch this video before continuing this article and downloading the tool.
If you are a new user and you wish to discover the software, you can download and begin your data preparation journey with the Free Desktop version of Talend Data Preparation on our website.
In this blog, we walk you through 4 major features available for Free Desktop and for the Commercial edition, that will have a huge impact on user productivity as a direct business benefit. Let’s now take a deeper look into them.
Interact with Large Data Volumes through Selective Sampling
Data Preparation is an interactive experience. Because data is exposed to data workers in a spreadsheet-like user interface, they can easily and rapidly find out the needed actions to fix its quality, and enrich and shape it to fit their context.
This experience works fine with relatively small sets of data, but the challenge is to make it scale with larger sets. Data sampling is critical to address this challenge, and this is a feature that we introduced in our commercial version. The latest release of Talend Data Preparation brings this capability to a new level with selective sampling. It allows the data worker to specify the sample that they want to interact with.
Suppose, for example, you want to cleanse your 32,000 rows contact data from Salesforce.com, and more particularly the US state. By default, Talend Data Preparation will retrieve a sample of the data set for interactive preparation. Through its semantic dictionary, not only it understands that one column refers to a state but also drives the user attention to the invalid values for that datatype. The user can then selects the rows with invalid state within that sample, corrects ‘Texas’ to ‘TX’ a single cell and then applies it to all the rows. But, there might other invalid values for state columns in the dataset that were not considered in the sample. Through selective sampling, Talend Data Preparation selects more rows that match the current filter on an invalid state to refine the preparation: this operation allows to correct all invalid data, for example highlighting a data quality issue related to the Iowa State (IA). Selective sampling: optimized data accuracy.
Fix Data Across Columns Faster
Because Talend Data Preparation can automatically discover the semantics of your data (For example, understand that the first column of your data set is the first name; the second, the last name and the third, an e-mail, and the fourth a phone), it can highlight the invalid data that doesn’t conform to those data types automatically. This capability can be very helpful in improving the productivity of data workers when fixing errors in their datasets.
The latest release of Talend Data Preparation lets you immediately point out the set that needs to be fixed by applying a filter on all the rows with invalid or empty values in one simple action. When combined with smart sampling, this function is extremely useful to manage data quality in large datasets.
In the following video, the user wishes to keep only business e-mails in a marketing leads list. After having extracted e-mail parts, he deletes in a single operation every ‘gmail.com’ and ‘yahoo.com’ e-mail address from the date set. Multi-filter: time saved, personal productivity increased.
Another productivity accelerator provided by Talend Preparation is the ability to avoid repetitive actions when you need to implement the same standardization on multiple columns of information. this is a productivity accelerator that many of our 30,000 early adopters had on their wishlist: the ability to select multiple columns by using <Ctrl><Click> or <Shift><Click> and apply functions across these columns.
In this following video, the user notes that 2 columns are date columns and that both contain unnormalized data. Talend Data Preparation allows the user to standardize both columns in a row. The user selects both columns and applies 1 single time the “change date format” function. Cleansing time divided by 2”
Work with Locations, IBAN and Temperatures
When working with Iso2 country codes (with the commercial version), your data is displayed in the form of a world map in the chart tab. Like any charts in this tab, it is interactive, which means that you can click on a value to drill down. We also introduced an interactive map of the United States in the commercial version when working with two-letter US States.
IBAN are supported, and we deliver more than just controlling the pattern and standardizing the formatting: the algorithm of IBAN validation is embedded. Indeed, our data masking capability fully applies for this very sensitive data.
For those working with weather data or sensor data, there’s also a new “convert temperature” to switch the measurement unit of your temperature data between Celsius, Fahrenheit, and Kelvin.
Design and Maintain your Dataset Preparation
Designing a preparation is an ad-hoc experience. In some cases, especially when working with a presentation that needs dozens of steps, you might want to add a new step, but then realize that it needs to be applied earlier in your preparation sequence. Now you can dynamically move this step up to the right sequence, and even reorder the preparation steps at any time while maintaining your preparation. This makes maintenance of complex preparations with dozens much easier and is particularly useful when standardizing data against a lookup file.
Here the user wants to identify the store brand products amongst a full list of products. As usual, he uses the look-up function to blend the core data set of products catalog and an external data set listing the store brands products. In theory, 1 single step is needed to get requested result. But, today, it seems there are still unmatched values after look-up execution. It is due to white spaces in some cells. So, the user cleanses these white spaces and then, reorders the steps of the recipe to anticipate the cleansing. Reorder: optimized combination of cleansing steps.
And now it’s your turn to play
The magic of open source comes from the voice of the community. 1 year after its launch, Talend Data Preparation has been downloaded more than 40,000 times, and we are getting great feedback in our forum. This inspires our roadmap and fuels the rapid evolution of Talend Data Preparation. Please keep on sharing what you like, dislike or would like to see in the future versions of the product, as well as your use cases. You’ll get rewarded with a better product.