Delivering on Data Science with Talend: Getting Quality Data
Today, we are in the information age with a tremendous amount of data being created (as much as 90% of data being created in the last two years alone). This data comes from a wide range of sources and takes many different forms: human-generated documents and social media communications; transactional data that we use to run our businesses; and there is an ever-increasing proliferation of sensors producing streams of data.
It has been said that data is the new soil in which discoveries grow, and the potential for Data Scientists to make breakthroughs and to drive positive outcomes using machine learning and deep learning is unprecedented. But new opportunities always come with challenges.
In this blog series, I'll show you how Talend can help solve common challenges with data science. First, let's start by focusing on how to get clean and relevant data available.
Breaking the 80/20 rule of Data Science
One of the things you might have heard a hundred times while working on Data Science projects is the 80/20 rule, where 80% of a data science effort is spent on data compilation, getting clean relevant data in the right format and where it’s needed and only 20% on actual analysis. According to the 80/20 rule of data science, four days of each business week is spent on gathering data, while only one day is spent on running algorithmic models. This rule has been confirmed by Data Scientists themselves in a recent report of CrowdFlower (currently known as figure eight).
But what if data scientists already had the data they needed?
Ingest any type of data
Let’s start with this thought, even the most experienced data scientist won’t help you too much without access to data. Moreover, they must get all the data, meaning that if there is a 20 years customer data sitting in a mainframe or an mqtt topic where sensor data is published, they must be able to collect it in order to unlock the value and potential of those information systems.
According to a CloudFlower survey, a Data Scientist spends at least one day of the week just collecting data. That’s where Talend data integration capabilities become handy, with more than 900 connectors and components that allow you to connect to Databases, Business and Cloud applications, Data formats and Metadata, Protocols and messaging, Cloud services, and much more.
And it's not going to stop there, Talend announced recently the acquisition of Stitch that has developed a simple, frictionless way for users to move data from cloud sources to cloud data warehouse quickly and easily.
Stitch enables Talend to immediately compete in a new and rapidly growing market segment for low-cost, self-service cloud data warehouse ingestion services and current customers will benefit from using both Talend and Stitch products in a near future. If you want to know more about Stitch visit the Stitch website and get started connecting your data in less than two minutes.
Driving Data Quality
Now that data scientists can access and collect the data they need with data integration tools, they’ll have to face the data quality challenge because many organizations’ data lakes have turned into dumping grounds. Coming back to the CrowdFlower report, data scientists often spend two to three days of their week working on cleaning and preparing their data.
A data scientists’ time is precious; Talend Data Quality can help them work to their full potential with our Data Quality suite that includes Data Masking capabilities as well as self-service applications for Data Preparation and Data Stewardship. Talend brings a unified platform to make data integration and data quality a team sport through collaboration and by empowering business users building up the company ground truth.
But more importantly, with Talend you will be able to automate, scale, and industrialize data integration, quality and anonymization processes, making your data scientist life easier by providing constant quality data.
Because in the end having a lot of data is good but not enough, for data science the quality of the data is key to build performant machine learning models.
Cataloging Your Data With a ... Data Catalog
Even when they can get their hands on the right data, data scientists need to spend time exploring and understanding it. For example, they might not know what a set of fields in a table is referring to at first glance, or data may be in a format that can’t be easily understood or analyzed. There is usually little to no metadata to help, and they may need to seek advice from the data’s owners to make sense of it.
Talend offers tools to automate and simplify data discovery, curation, and governance. Intelligent search capabilities help data scientists find the data they need, while metadata such as tags, comments, and quality metrics help them decide whether a data set will be useful to them and how best to extract value from it.
With Talend Data Catalog, our goal is to deliver trusted data at scale in the digital era. We do this by empowering organizations to create a single source of trusted data. Talend Data Catalog achieves this objective by:
- First, crawling your data landscape and using machine learning and smart semantics to automatically discover all your data
- Second, it can orchestrate data governance, so data curation becomes a team sport where you can collaborate to improve data accessibility, accuracy, protection, and business relevance
- Third, data consumers can find, understand, use, and share trusted data faster. Data Catalog makes it easy to search for data and visually present data relationships, then verify its validity before sharing with peers.
Integrated data governance gives data scientists confidence that they are permitted to use a given data set and that the models and results they produce are used responsibly by others in the organization.
All this will help break the 80/20 rule and shift it to 80/20. Data scientists could reclaim much of the time that they’re currently wasting on cleansing and spend more time on what they do best. Building more efficient predictive models as they’ll spend more time refining and comparing them.
And all of this can be done from the Talend Cloud and in a serverless fashion for the execution using Talend Cloud engines. Stay tuned for my next post where we will discuss how to scale and reduce costs using serverless technologies and how to deploy and leverage Machine Learning models in enterprise solutions with Talend and Databricks.