Talend Data Catalog provides intelligent data discovery that delivers a single source of trusted data into a centralized data catalog. Talend Data Catalog provides the capability for doing impact analysis and/or tracing lineage by harvesting Talend data integration Jobs. For example, you can find the use of a specific attribute or column from the source to the destination of the data flow within the scope of a Talend data integration Job.
This blog post explores the methods to design Talend data integration Jobs that help maximize the benefits of using Talend Data Catalog, focusing primarily on a file system (Windows) and relational databases as either data source or target storage for data. Please note, however, that Talend Data Catalog supports harvesting data from a variety of sources, and not just file systems and relational databases.
If you’re interested in basic guidelines on how to develop better Talend data integration Jobs, Dale Anderson has written a four-part series on Talend “Job Design Patterns” and Best Practices. This blog post offers some guidelines that are particularly relevant when it is necessary to harvest the DI Jobs in Talend Data Catalog.
Note: This article uses Talend Studio 7.1.1 and Talend Data Catalog version 7.1
Best practice #1: Using “Repository” Property Type Instead of “Built–In”
For most of the components in Talend Studio, there is a ‘Property Type’ attribute that allows the user to define a data input physical path source. This source could be defined as part of the component – as ‘Built-In’ OR defined as part of metadata repository – as ‘Repository’ (Repository helps in using the data input in other DI Jobs). For Talend Data Catalog, I suggest configuring the Property Type to ‘Repository’ for better results. Let take the example of the tMap component. The sample Job below is a very simple Job that stores data from a CSV file into MySQL database.
tMAp is expanded here:
Ensuring the tFileInputDelimited and tDBOutputs components have the Property Type defined as ‘Repository’ (as highlighted in image below). This is necessary for the correct lineage or impact trace.
Below is an example of what this would look like in Talend Data Catalog.
Below is a view of Data Flow to and from a Talend DI Job, with highlighted areas of connection from data source and data storage (target) models.
Lastly, let’s look at a working data impact analysis of column ‘lastname’ getting stored as ‘customername’ due to the tMap configuration in the Talend DI Job, highlighting the tMap concatenation function.
Best practice #2: Using data mapping specifications in custom code components like tJavaRow
Talend data integration components like tJava or tJavaRow allow users to write custom code to implement a choice of logic. This custom code could either break lineage or produce everything-to-everything dependencies (Cartesian product), in Talend Data Catalog. To avoid such consequences, it is recommended to create data mapping specifications in the Documentation section of the component. Here is an example explaining the recommended process, with a sample Job that uses tJavaRow for a simple conditional check of quantity ordered by a customer, given the quantity is less than 2 for any row assign null value (as less than is considered negligible in the given use case).
If the data mappings are not included, there could be different results depending on the custom code, either lineage could break or a cartesian product of mappings could form. The image below depicts the possible scenarios in case of missing mapping specifications.
Let’s understand the cartesian product with an example. Consider a Job without mappings — just a custom code for the example use case.
For this Job, without the mappings defined (and not all columns used for processing), results in a cartesian product. See the image below.
That, in turn, results in a lineage break.
You can add the data mapping specification to the Documentation section of the component, as highlighted in the image below.
This solves the lineage issue — refer to the images below.
Best practice #3: Using context variables for dynamically generated SQL queries
It is a very common practice to do string concatenations while forming SQL queries programmatically. Talend DI Jobs allow for such string concatenations in SQL related components like tDBRow and custom code components like tJavaRow. Depending on the component where the concatenation is done to form a SQL query, this could create difficulties for tracing lineage. The solution here is to use context variables for the dynamic part of the SQL query within SQL components instead of custom code. To understand this scenario better, let’s use an example Talend Integration Job. The Job below is a simple Job with only a tDBInput component using a context variable defined for table name — refer to the image below.
And the data lineage works well without issues.
The second sample Job contains a tJavaRow where the SQL query is generated with use concatenation, refer to the image below. This uses the same context variable for the table name as the first Job.
With tDBRow having the output row ‘sql’, this is defined for a query. Refer to the image below:
Though the data flow depicts links throughout, the data lineage breaks. Refer to the images below:
Best practice #4: Using files instead of lists for iterating through lists of datasets or data inputs
Consider a use case where a set of SQL queries need to be executed as part of a Talend data integration Job. Such a situation requires iterating through the list of SQL queries, either listed manually – for example using tFixedFlowInput component — or using an external template. This situation could break lineage as Talend Data Catalog cannot access SQL queries. It is a similar problem with a list of datasets.
The recommended approach here is to use tFileOutput components for saving individual SQL statements in separate SQL script files. And then, you harvest the SQL scripts to trace lineage or do impact analysis.
Let us walk through an example with a set of SQL update statements defined in tFixedFlowInput, that are executed using tDBRow. Notice the flow of the Job in the images below. There is no data input/output to the Talend DI Job here; this Job is intended to execute a set of SQL statements. For this kind of scenario, it is recommended to store the SQL statement into a .sql file, and harvest the formed SQL script for lineage.
Note that the tDBRow is deactivated in the images below. It is not necessary to have an activated component while harvesting the DI Job in Talend Data Catalog, though it is needed for successful DI Job execution.
There are different bridges supported by Talend Data Catalog to harvest SQL scripts written for a specific database like Teradata, Microsoft SQL Server, etc. We will use the Oracle bridge for the MySQL database used in this example. Refer to the images below with the import settings:
Though a DI Job without the .sql file would execute the SQL statements (and implement a use case), it would be difficult to trace lineage for it. Refer to the image below:
To conclude, these four guidelines are basic ways to start a better data governance journey with Talend Data Catalog as well as effective use of Talend Data Catalog. There are other best practices that might be useful, depending on use case–to–use case or particular scenario–to–scenarios. Sometimes certain guidelines help with a particular use case, perhaps for a successfully tracing lineage or doing an impact analysis, and at other times there could be a set of different practices for a different use case. This blog is a starting point for exploring ways to make use of Talend Data Catalog for harvesting Talend data integration Jobs.