Five Pillars for Succeeding in Big Data Governance and Metadata Management with Talend
In the last installment of this series, we looked at six keys to turning your Big Data initiative into sustainable success with Data Governance. These six steps were identified by TDWI in its recent report titled, “Governing Big Data and Hadoop”. As an independent report, it highlights the challenges and best practices, but it doesn’t explain Talend’s specific ability to address those challenges. This part II of the blog explains how each of the core components of the Talend Data Fabric unified platform can address the six pressing issues outlined in part I. We call it the five pillars for managing metadata with Talend.
Metadata by Design with Talend Studio
Without metadata, there is no way to create a holistic and actionable view of the information supply chain. And this view is a prerequisite not only to manage change, provide auditability and traceability on data flows, but also to increase data accessibility through easy to use access mechanisms such as search or visual maps. Although metadata can be retro-engineered in some cases, it is much easier to collect, process, maintain and track metadata at its source as soon as it is created.
When using Talend, any data flow is designed with a visual, metadata driven environment. Not only does this bring faster development and deployment, but once the data flows are running, it provides detailed views of the information supply chain: where does the data come from, where is it stored, what are the dependencies between the different data points, etc.?
This is crucial in the Big Data World because many powerful data processing environments, such as Map Reduce or Spark, are not meta-driven, as compared to more traditional data management standards like SQL. Without tools like Talend Open Studio that provide high levels of abstraction, with a zero coding approach, your Hadoop data-driven processes can become very difficult to manage, govern and secure. Talend Open Studio and its centralized repository maintains an always up to date version of your data flows that you can share across data designers and developers, and export into other tools like Cloudera Navigator, Apache Atlas or Talend Metadata Manager to expose them to a wider audience of data workers. More details to come in this blog later on this last point.
Additionally, Talend also allows developers to unify all the disciplines of data management (Data Integration, Big Data Management, Application Integration, Cloud Integration, Data Quality and MDM, Self-Service Data Preparation) into a single platform. This allows IT to deliver a global view of data flows for both data at rest and data in motion, both traditional and big data, residing either on premises or in the cloud.
Synchronize Your Metadata across your Data Platforms with Talend Metadata Bridge
Talend Metadata bridge allows developers to import and export data from Talend Studio (and similarly from Talend Metadata Manager), as well as access metadata from virtually any data platform. With more than 100 connectors provided, Talend Metadata Bridge helps harvest metadata from modelling tools like Erwin or Embarcadero; ETL tools like Informatica or IBM DataStage; SQL and NoSQL databases; Hadoop; popular BI and Data Discovery tools like Tableau, Qlik or BusinessObjects; as well as XML or Cobol structures, etc.
The bridges allow developers to design data structures once and propagate them across various tools and platforms repeatedly. Then you can easily enforce standards, propagate changes, and facilitate migrations, since data formats can be translated from virtually any third party tool or platform to Talend. For example, you can take an Oracle table and import it into Talend, and then propagate it to another third party platform such as Redshift. Talend Big Data can also easily offload a traditional ETL job into a native Hadoop process.
Tackle Hadoop governance Challenges with Talend Big Data
By design, Hadoop accelerates data proliferation. Also, unlike traditional databases that provide a single point of reference for data, data manipulations, and their related metadata, Hadoop combines multiple storage and data processing options. Additionally, as part of its high availability strategy, Hadoop tends to replicate data across many nodes and to create intermediary copies of raw data between processing steps. Data Lineage therefore becomes critical to provide traceability and auditability of data flows inside Hadoop. All of these factors pose significant threats to data governance.
But the beauty of Hadoop is that it’s an open and extensible community based framework. Its weaknesses trigger innovation projects to address the issues and turn them into strengths. Apache Atlas and Cloudera Navigator are the most common Hadoop extensions to address the specific challenges of data governance within Hadoop.
Talend Big Data seamlessly integrates with Cloudera Navigator or Apache Atlas (for Hortonworks) and exposes the detailed metadata for its data flows to each of these third-party data governance environments. Through this capability, Talend enriches those environments with data lineage capabilities that go into much greater depth vs. if the data flows were directly hand-coded in Hadoop or Spark. Thanks to Cloudera Navigator and Apache Atlas, the metadata generated by Talend can be connected to other data points, searched, visualized as maps for data lineage, and shared with potentially any authorized users in the Hadoop environment, beyond Talend developers and administrators. They also make Metadata more actionable by triggering actions (such as auto classification of metadata, definition of retention policies…) for specific datasets based on arrival or scheduled intervals.
As an example, Talend was the first vendor to deliver field level data lineage for Spark in Cloudera Navigator, a critical capability for big data use cases in heavily regulated environments such as financial services or life sciences.
Democratize the Data Lake with Superior Data Accessibility
Until now, data governance may have been perceived as an administrative constraint rather than a value-add by business users, but in truth there are many benefits it can bring. For example, would you consume food from a retail store without first reading the label and ensuring it was properly packaged? Knowing the name, the origin, the ingredients, the weight and quantity, nutrition facts, etc., are crucial to understand before consuming any food item. The same principles should apply to data.
Talend provides a Business Glossary in Talend Metadata Manager to allow data stewards to maintain the business definitions for all data, link it to the tools and environments where it can be accessed (such as Hive tables in Hadoop or Tableau dashboards), and finally expose it to business users. Similarly, Talend Data Preparation provides its own dataset inventory to allow anyone to access, cleanse and shape data as a self-service. Because self-service is a key part of Talend’s market vision, stay tuned for more innovations in this area.
Manage and Monitor Data Flows beyond Hadoop with Talend Metadata Manager
Gone are the days of thinking it was feasible to manage every data source in one place. Legacy systems are here to stay; enterprise apps such has Microsoft, SAP and Oracle will continue to operate core business processes; cloud applications will continue to proliferate; and traditional data warehouse and departmental BI will coexist with more modern data platforms like Hadoop for some time.
Not only does this increase the need for environments such as Talend Data Fabric to manage the data flows across those environments, but it drives the need for a platform that provides a holistic view of the information chain, wherever data resides. Organizations that are operating in heavily regulated environments go so far as to mandate these capabilities for their audit trails.
Talend Metadata Manager gives your organization visibility and control of metadata so you can manage risk and compliance in enterprise-wide integration projects with end-to-end traceability. Metadata Manager connects all your metadata, whether its managed in Hadoop and/or in Talend, and/or potentially any data platform supported by the aforementioned Metadata bridge, with a visual information supply chain that provides full data lineage and auditability. Talend then turns this holistic view into a language and data map that everyone understands, including business users and the people responsible for data integrity, usability, and compliance.
Over time, we will continue to share key findings from this TDWI report, and further elaborate on how to evolve from a traditional, authoritative, top-down approach to data governance, to a modern, bottom-up, collaborative data governance structure where best practices can be crowdsourced from the people who know the data best. You can also register for our upcoming webinar where myself and David Stodder will present key takeaways from the report. I'd be happy to get your feedback as well, through this blog or through my twitter account (@jmichel_franco ).