In its latest Market Guide for Self-Service Data Preparation, Gartner predicts that “by 2019, data and analytics organizations that provide agile, curated internal and external datasets for a range of content authors will realize twice the business benefits of those that do not”.
Organizations today are swimming in data, but most companies are only utilizing a fraction of what they collect. By implementing a self-service data preparation strategy, companies can enable more widespread use of data throughout their organization and move towards creating a data-driven culture. But it’s not easy. IT leaders who aspire to become data heroes and transform their organization into a data-driven business tend to feel threatened when they learn that data workers spend most of their time—an estimated 500 hours and $22,000 per year—collecting, correcting and formatting the data before they can turn them into insights. Additionally, there is the concern about the risk of uncontrolled proliferation of data sources.
The newest release of Talend Data Preparation provides an alternative. It not only empowers customers to achieve demonstrable business benefits faster and easier but also enables them to expand the reach of their modern data platform to a broader audience. This is designed to allow everyone within the organization—from IT developers to business information workers, including data analysts, stewards or scientists—to benefit from increased access to corporate information to inform their day-to-day tasks. Additionally, Talend Data Preparation is designed to help balance the need for broader data access and collaboration to improve employee productivity and business insight, with IT-controlled data governance.
Desktop, On Demand or Enterprise?
I’m regularly asked about the differences between the various versions of Data Preparation Talend offers. Thus, I’d like to take this opportunity to provide some guidance on the various ‘flavors’ of Talend Data Preparation so you can decide which the best fit is for your business information workers’ needs.
In a nutshell:
– If you want to get hands-on on self-service data preparation, and/or to be self-sufficient in your data-driven tasks, look for the Free Desktop, or alternatively the On Demand AWS version in case you have a cloud-first strategy.
– If you are working as a team to maximize the value of data in your activities, or if you aim to establish a managed self-service access to data to a community of data workers, you should definitively consider the subscription version.
Desktop: A Personal Productivity Booster!
Think of the Free Desktop as a personal productivity tool. Business workers install it on their Mac or PC to fix and work on personal datasets they have at their disposal (i.e. a tradeshow leads list, a monthly financial forecast, a compensation measurement tracker, etc.). This type of data is typically available in an Excel or CSV file. Once the Excel file is ‘cleaned up’ using Talend Data Preparation, it can be exported as a CSV or Excel file, or into Tableau. As long as your desktop resources can handle the data volume, you’ll be fine with the Open Source version of Talend Data Preparation. In our experience, business users are typically able to work interactively with tens of thousands of rows, which is the reason why we have set a 30,000 rows limit by default. But you might fall short with your desktop resources when trying to handle larger datasets.
On-Demand: Access through the Cloud!
We also recently introduced a version of Talend Data Preparation for Amazon Web Services. It is a free, single-user edition that doesn’t require any installation on your desktop: you just connect to it remotely through your browser, and then it provides capabilities that very similar, if not identical, to the Free Desktop version. If you’re familiar with Amazon Web Services and have an active user account, this version is worth a try. Stay tuned for an upcoming blog that dives deeper into the capabilities of the AWS version.
Enterprise: Enabling Data Collaboration and Governance
The enterprise or subscription version of Talend Data Preparation delivers a governed, self-service platform for the entire company. This version provides role-based access and collaboration capabilities for sharing and reusing dataset preparations between data workers. You can see it in action in this video.
Through the Talend platform, it can connect to almost any data source in your enterprise and expose that file as a self-service dataset in batch or real-time. As mentioned earlier in this article, the enterprise edition of Talend Data Preparation can work on large datasets through server-based processing and sampling. Last but not least, any user-defined preparation can be pushed back to Talend Data Fabric platform. Here it can be connected to every cloud or on-premises data source across the enterprise and combined with high-end capabilities provided by the Fabric such as Data Masking, advanced mapping, or complex matching, and then process the data on a scheduled basis or applied to real-time data flows.
Onboarding with this version is pretty straightforward. If you are an existing Talend customer, you are entitled to two free named-user licenses as part of your Talend subscription. We are also offering a half-day on-demand training session and a 2-day quick start consulting package to learn how to implement, administer and use Talend Data Preparation. As a special offer for early adopters, the on-demand training session is free of charge throughout 2016.
If you are a new user and you wish to discover the software, you can download and experiment with the Free Desktop version of Talend Data Preparation on our website.
Whatever version you choose to utilize, there is a laundry list of business benefits that your organization stands to gain. Let’s now take a deeper look into some of these.
Interact with Large Data Volumes through Selective Sampling
Data Preparation is an interactive experience. Because data is exposed to data workers in a spreadsheet-like user interface, they can easily and rapidly find out the needed actions to fix its quality, and enrich and shape it to fit their context.
This experience works fine with relatively small sets of data, but the challenge is to make it scale with larger sets. Data sampling is critical to address this challenge, and this is a feature that we introduced in our commercial version. The latest release of Talend Data Preparation brings this capability to a new level with selective sampling. It allows the data worker to specify the sample that they want to interact with.
Suppose, for example, you want to cleanse your 32,000 rows contact data from Salesforce.com, and more particularly the US state. By default, Talend Data Preparation will retrieve a sample of the data set for interactive preparation. Through its semantic dictionary, not only it understands that one column refers to a state but also drives the user attention to the invalid values for that datatype. The user can then selects the rows with invalid state within that sample, corrects ‘Texas’ to ‘TX’ a single cell and then applies it to all the rows. But, there might other invalid values for state columns in the dataset that were not considered in the sample. Through selective sampling, Talend Data Preparation selects more rows that matches the current filter on invalid state to refine the preparation: this operation allows to correct all invalid data, for example highlighting a data quality issue related to the Iowa State (IA). Selective sampling: optimized data accuracy.
Fix Data Across Columns Faster
Because Talend Data Preparation can automatically discover the semantics of your data (For example, understand that the first column of your data set is a first name; the second, a last name and the third, an e-mail, and the fourth a phone), it can highlight the invalid data that doesn’t conform to those data types automatically. This capability can be very helpful in improving the productivity of data workers when fixing errors in their datasets.
The latest release of Talend Data Preparation lets you immediately point out the set that needs to be fixed by applying a filter on all the rows with invalid or empty values in one simple action. When combined with smart sampling, this function is extremely useful to manage data quality in large datasets.
In the following video, the user wishes to keep only business e-mails in a marketing leads list. After having extracted e-mail parts, he deletes in a single operation every ‘gmail.com’ and ‘yahoo.com’ e-mail address from the date set. Multi-filter: time saved, personal productivity increased.
Another productivity accelerator provided by Talend Preparation is the ability to avoid repetitive actions when you need to implement the same standardization on multiple columns of information. this is a productivity accelerator that many of our 30,000 early adopters had on their wishlist: the ability to select multiple columns by using <Ctrl><Click> or <Shift><Click> and apply functions across these columns.
In this following video, the user notes that 2 columns are date columns and that both contain unnormalized data. Talend Data Preparation allows the user to standardize both column in a row. The user selects both columns and applies 1 single time the “change date format” function. Cleansing time divided by 2”
Work with Locations, IBAN and Temperatures
When working with Iso2 country codes (with the commercial version), your data is displayed in the form of a world map in the chart tab. Like any charts in this tab, it is interactive, which means that you can click on a value to drill down. We also introduced an interactive map of the United States in the commercial version when working with two-letter US States.
IBAN are supported, and we deliver more than just controlling the pattern and standardizing the formatting: the algorithm of IBAN validation is embedded. Indeed, our data masking capability fully applies for this very sensitive data.
For those working with weather data or sensor data, there’s also a new “convert temperature” to switch the measurement unit of your temperature data between Celsius, Fahrenheit, and Kelvin.
Design and Maintain your Dataset Preparation
Designing a preparation is an ad-hoc experience. In some cases, especially when working with a presentation that needs dozens of steps, you might want to add a new step, but then realize that it needs to be applied earlier in your preparation sequence. Now you can dynamically move this step up to the right sequence, and even reorder the preparation steps at any time while maintaining your preparation. This makes maintenance of complex preparations with dozens much easier and is particularly useful when standardizing data against a lookup file.
Here the user wants to identify the store brand products amongst a full list of products. As usual, he uses the look-up function to blend the core data set of products catalog and an external data set listing the store brands products. In theory, 1 single step is needed to get requested result. But, today, it seems there are still unmatched values after look-up execution. It is due to white spaces in some cells. So, the user cleanses these white spaces and then, reorders the steps of the recipe to anticipate the cleansing. Reorder: optimized combination of cleansing steps