“My momma always said, “Life was like a box of chocolates. You never know what you’re gonna get.” Even if everyone’s life remains full of surprises, the truth is that what applied to Forrest Gump in the 1994 movie by Robert Zemeckis, shouldn’t apply to your data strategy. As you’re making the very first steps into your data strategy, you need to first know what’s inside your data. And this part is critical. To do so, you need the tools and methodology to step up your data-driven strategy.
Why Data Discovery?
With increased affordability and accessibility of data storage over recent years, data lakes have increased in popularity. This left IT teams with a growing number of diverse known and unknown datasets polluting the data lake in volume and variety every day. As a consequence, everyone is facing a data backlog. It can take weeks for IT teams to publish new data sources in a data warehouse or data lakes. At the same time, it takes hours for line-of-business workers or data scientists to find, understand and put all that data into context. IDC found that only 19 percent of the time spent by data professionals and business users can really be dedicated to analyzing information and delivering valuable business outcomes
Given this new reality, the challenge is now to overcome these obstacles by bringing clarity, transparency and accessibility to your data as well as to extract value from legacy systems and new applications alike. Wherever the data resides (in a traditional data warehouse or hosted in a cloud data lake), you need to establish proper data screening, so you can get the full picture and make sure you have the entire view of the data flow coming in and out your organization.
Know Your Data
When it’s time to get started working on your data, it’s critical to start exploring the different data sources you wish to manage. The good news is that the newly released Talend Data Catalog coupled with the Talend Data Fabric is here to help.
As mentioned in this post, Talend Data Catalog will intelligently discover all the data coming into your data lake so you get an instant picture of what’s going on in any of your datasets.
One of the many interesting use cases of Talend Data Catalog is to identify and screen any datasets that contain sensitive data so that you can further reconcile them and apply data masking, for example, to enable relevant people to use them within the entire organization. This will help reduce the burden of any data team wishing to operationalize regulations compliance across all data pipelines. To discover more about how Talend Data Catalog will help to be compliant with GDPR, take a look at this Talend webcast.
Auto Profiling for All with Data Catalog
Auto-profiling capabilities of Talend Data Catalog facilitate data screening for non-technical people within your organization. Simply put, the data catalog will provide you with automated discovery and intelligent documentation of the datasets in your data lake. It comes with easy to use profiling capabilities that will help you to quickly assess data at a glance. With trusted and auto profiled datasets, you will have powerful and visual profiling indicators, so users can easily find and the right data in a few clicks.
Not only can Talend Data Catalog bring all of your metadata together in a single place, but it can also automatically draw the links between datasets and connect them to a business glossary. In a nutshell, this allows organizations to:
- Automate the data inventory
- Leverage smart semantics for auto-profiling, relationships discovery and classification
- Document and drive usage now that the data has been enriched and becomes more meaningful
Go further with Data Profiling
Data profiling is a technology that will enable you to discover your datasets in-depth and accurately assess multiple data sources based on the six dimensions of data quality. It will help you to identify if and how your data is inaccurate, inconsistent, incomplete.
Let’s put this in context. Think about a doctor’s exam to assess a patient’s health. Nobody wants to be in the process of having surgery without a precise and close examination. The same applies to data profiling. You need to understand your data before fixing it. As data will often come into the organization as either inoperable, in hidden formats, or unstructured an accurate diagnosis will help you to have a detailed overview of the problem before fixing it. This will save your time for you, your team and your entire organization because you will have primarily mapped this potential minefield.
Easy profiling for power users with Talend Data Preparation: Data profiling shouldn’t be complicated. Rather, it should be simple, fast and visual. For use cases such as Salesforce data cleansing, you may wish to gauge your data quality by delegating some of the basic data profiling activities to business users. They will then be able to do quick profiling on their favorite datasets. With tools like Talend Data Preparation, you will have powerful yet simple built-in profiling capabilities to explore datasets and assess their quality with the help of indicators, trends and patterns.
Advanced profiling for data engineers: Using Talend Data Quality in the Talend Studio, data engineers can start connecting to data sources to analyze their structure (catalogs, schemas, and tables), and stores the description of their metadata in its metadata repository. Then, they can define available data quality analysis including database, content analysis, column analysis, table analysis, redundancy analysis, correlation analysis, and more. These analyses will carry out the data profiling processes that will define the content, structure, and quality of highly complex data structures. The analysis results will be then displayed visually as well.
To go further into data profiling take a look at this webcast: An Introduction to Talend Open Studio for Data Quality.
Keep in mind that not your data strategy should first and foremost start with data discovery. Failure to profile your data would obviously put your entire data strategy at risk. It’s really about analyzing the ground to make sure your data house could be built on solid foundations.