How to create a business glossary on Talend Data Catalog using API Services and Data Stewardship
We often come across people talking about managing their data by one means such as a Data Lake, MDM or data governance. Modern data management is not only about managing your data but also about making data useful for the business. Furthermore, data management is also about providing the ability to relate frequently used business terminologies to data in the systems. Most of the big enterprises spend months to discover and identify the impact of any change on their entire data supply chain.
For example, introducing a change to a business terminology could cause domino effect of change on the dependent systems, companies usually spend a large amount of time accurately identifying the impact of such a change on the downstream systems and even that does not guarantee 100% success rate. Incorrect impact analysis would usually result in breaks in the lineage and propagation of incorrect information across the data supply chain. Talend Data Catalog is a one-source platform which can help you leverage the single source of truth with the flexibility of using API’s to add or remove new terminologies and relate it to data within the systems.
With the introduction of Talend Data Catalog API we can now leverage REST API calls to automate actions on business terms such as create, update, delete terms etc.
A sample job shown in Figure 5 demonstrates how we can use Talend data catalog’s REST APIs through a Talend DI Job to set attributes, custom attributes for new business terms in Talend Data catalog glossary model as needed. Additionally, we can rely on Talend data stewardship to accept or reject changes made to terms in the business glossary.
Figure 1 below shows the Swagger documentation of the Talend Data Catalog APIs available. As seen from the documentation, Talend Data Catalog APIs provide a rich feature set to programmatically access and manipulate metadata content.
Talend Data Stewardship for business terms
Data Stewardship plays a critical role for a successful data-driven glossary across the enterprise. Data Stewards perform a significant contribution in cleaning the data, refining the data and approving the data. Talend provides a data stewardship portal and Studio components to leverage stewards for validating and approving terminologies that should be part of the enterprise glossary. The work of a data steward is dictated by two core components called campaigns and tasks. There are four types of campaigns: Arbitration, Resolution, Merging or Grouping.
To begin, we will use a Resolution campaign to create a data model in the data stewardship portal. The data model we will create needs to have attributes such as name, glossarypath, categorypath, and description as shown in Figure 2. Data stewards will then explore the data that relates to their tasks, resolve the tasks on a one to one basis or for a whole set of records.
In the below example we have created a new Talend Data Integration job to fetch business terms and their corresponding description from database and assign for approval to data stewards. As shown in Figure 3, we can levarage enterprise databases having a predefined table with terms and their definitions and push them through data stewards for changes/approvals or we can pass it through a file.
In tStewardshipTaskOutput component put the correct URL of data stewardship portal and corresponding user credentials. Create column in schema as created on data stewardship data models attribute. Select the campaign type as resolution. You can assign the task to a particular steward or select “No Assignee” as shown below.
Create another job with components to connect Talend Data Catalog using the REST API call as shown in Figure 5. Then select and add approved terms by data stewards into data catalog glossary. We can also export all terms as CSV file using the export API call. We can also update or add custom attributes to the terms. Finally close the REST API call connection and delete the task on data stewardship if you want to.
As shown in Figure 6, create a tRestClient connection to access the data catalog portal. Provide the correct URL and HTTP method as GET and Accept Type as JSON. Provide the query parameters such as user, password and forceLogin as “true”.
Extract the JSON fields and map it to “Session_token” column to store the connection approved token for future access as shown in Figure 7.
Set the global variable with the access token and add another key as “id” corresponding to object_id of the glossary in Talend Data Catalog as shown in Figure 8.
Stewards should select the tasks assigned to them and approve them by clicking on corresponding rows and validating their choices to approve or reject as shown below in Figure 9.
Select all the terms approved by stewards corresponding to a particular steward or any assignee which has State set to “ Resolved” or custom state such as “Ready to Publish” using tDataStewardshipTaskInput component as shown below in Figure 10.
Use a tREST component to add term to the data catalog glossary. As shown below in Figure 11, provide the correct URL with parameters of API token and accept type as Json.
Once you have added terms into glossary you can either export them as CSV as shown using tREST component in Figure 12 or you can add custom attribute to a term as shown in Figure 13.
In summary, Talend Data Catalog Rest API feature provides lot of flexibility for business to populate business terminologies into Talend Data Catalog glossary by various means and a platform to incorporate data governance and regulatory compliance with the involvement of right stakeholders and data stewards. This blog is a starting point for exploring ways to make use of Talend Data Catalog Rest API’s for Talend Data Catalog.