DOWNLOAD : The Cloud Data Integration Checklist from TDWI

How to Implement a Job Metadata Framework using Talend

How to Implement a Job Metadata Framework using Talend

  • Vijay Madduru
    Vijay is a Strategic Architect at Talend, implemented several Enterprise Information Management solutions which include technologies like Big Data, Master Data Management, Data Integration, Data Quality, and Enterprise Data Warehouse. Previously, he held senior leadership positions in product development and in professional services at leading business intelligence and the data management product companies.

Today, data integration projects are not just about moving data from point A to point B, there is much more to it. The ever-growing volumes of data, the speed at which the data changes presents a lot of challenges in managing the end-to-end data integration process. In order to address these challenges, it is paramount to track the data-journey from source to target in terms of start and end timestamps, job status, business area, subject area, and the individuals responsible for a specific job. In other words, metadata is becoming a major player in data workflows. In this blog, I want to review how to implement a job metadata framework using Talend. Let’s get started!

Metadata Framework: What You Need to Know

The centralized management and monitoring of this job metadata are crucial to data management teams. An efficient and flexible job metadata framework architecture requires a number of things. Namely, a metadata-driven model and job metadata.

A typical Talend Data Integration job performs the following tasks for extracting the data from source systems and loading them into target systems.

  1. Extracting data from source systems
  2. Transforming the data involves:
    • Cleansing source attributes
    • Applying business rules
    • Data Quality
    • Filtering, Sorting, and Deduplication
    • Data aggregations
  3. Loading the data into a target systems
  4. Monitoring, Logging, and Tracking the ETL process

Figure 1: ETL process

Over the past few years, the job metadata has evolved to become an essential component of any data integration project. What happens when you don’t have job metadata in your data integration jobs? It may lead to incorrect ETL statistics and logging as well as difficult to handle errors occurred during the data integration process. A successful Talend Data Integration project depends on how well the job metadata framework is integrated with the enterprise data management process.

Job Metadata Framework

The job metadata framework is a meta-data driven model that integrates well with Talend product suite. Talend provides a set of components for capturing the statistics and logging information during the flight of the data integration process.

Remember, the primary objective of this blog is to provide an efficient way to manage the ETL operations with a customizable framework. The framework includes the Job management data model and the Talend components that support the framework.

Figure 2: Job metadata model

Primarily, the Job Metadata Framework model includes:

  • Job Master
  • Job Run Details
  • Job Run Log
  • File Tracker
  • Database High Water Mark Tracker for extracting the incremental changes

This framework is designed to allow the production support to monitor the job cycle refresh and look for the issues relating to job failure and any discrepancies while processing the data loads. Let’s go through each of piece of the framework step-by-step.

Talend Jobs

Talend_Jobs is a Job Master Repository table that manages the inventory of all the jobs in the Data Integration domain.

Attribute

Description

JobID

Unique Identifier to identify a specific job

JobName

Job Name is the name of the job as per the naming convention (<type>_<subject area>_<table_name>_<target_destination>

BusinessAreaName

Business Unit / Department or Application Area

JobAuthorDomainID

Job author Information

Notes

Additional Information related to the job

LastUpdateDate

The last updated date

Talend Job Run Details

Talend_Job_Run_Details registers every run of a job and its sub jobs with statistics and run details such as job status, start time, end time, and total duration of main job and sub jobs.

Attribute

Description

ID

Unique Identifier to identify a specific job run

BusinessAreaName

Business Unit / Department or Application Area

JobAuthorDomainID

Job author Information

JobID

Unique Identifier to identify a specific job

JobName

Job Name is the name of the job as per the naming convention (<type>_<subject area>_<table_name>_<target_destination>

SubJobID

Unique Identifier to identify a specific sub job

SubJobName

Sub Job Name is the name of the sub job as per the naming convention (<type>_<subject area>_<table_name>_<target_destination>

JobStartDate

Main Job Start Timestamp

JobEndDate

Main Job End Timestamp

JobRunTimeMinutes

Main Job total job execution duration

SubJobStartDate

Sub Job Start Timestamp

SubJobEndDate

Sub Job End Timestamp

SubJobRunTimeMinutes

Sub Job total job execution duration

SubJobStatus

Sub Job Status (Pending / Complete)

JobStatus

Main Job Status (Pending / Complete)

LastUpdateDate

The last updated date

Talend Job Run Log

Talend_Job_Run_Log logs all the errors occurred during particular job execution. Talend_Job_Run_Log extracts the details from the Talend components specially designed for catching logs (tLogCatcher) and statistics (tStatCacher).

Figure 3: Error logging and Statistics

The tLogCatcher component in Talend operates as a log function triggered during the process by one of these components: Java exceptions, tDie or tWarn. In order catch exceptions coming from the job, tCatch function needs to be enabled on all the components.

The tStatCatcher component gathers the job processing metadata at the job level.

Attribute

Description

runID

Unique Identifier to identify a specific job run

JobID

Unique Identifier to identify a specific job

Moment

The time when the message is caught

Pid

The Process ID of the Job

parent_pid

The Parent process ID

root_pid

The root process ID

system_pid

The system process ID

project

The name of the project

Job

The name of the Job

job_repository_id

The ID of the Job file stored in the repository

job_version

The version of the current Job

context

The Name of the current context

priority

The priority sequence

Origin

The name of the component if any

message_type

Begin or End

message

The error message generated by the component when an error occurs. This is an After variable. This variable functions only if the Die on error checkbox is cleared.

Code

 

duration

Time for the execution of a Job or a component with the tStatCaher Statistics check box selected

Count

Record counts

Reference

Job references

Thresholds

Log thresholds for managing error handling workflows

Talend High Water Marker Tracker

Talend_HWM_Tracker helps in processing delta and incremental changes of a particular table. The High Water Tracker is helpful when the “Change Data Capture” is not enabled and the changes are extracted based on specific conditions such as “last_updated_date_time” or ‘revision_date_time.” In some cases, the High Water Mark relates to the highest sequence number when the records are processed based on the sequence number.

 Attribute

Description

Id

Unique Identifier to identify a specific source table

jobID

Unique Identifier to identify a specific job

job_name

The name of the Job

table_name

The name of the source table

environment

The source table environment

database_type

The source table database type

hwm_datetime

High Water Field (Datetime)

hwm_integer

High Water Field (Number)

hwm_Sql

High Water SQL Statement

Talend File Tracker

Talend_File_Tracker registers all the transactions related to file processing. The transaction details include source file location, destination location, file name pattern, file name suffix, and the name of the last file processed.

Attribute

Description

Id

Unique Identifier to identify a specific source file

jobID

Unique Identifier to identify a specific job

job_name

The name of the Job

environment

The file server environment

file_name_pattern

The file name pattern

file_input_location

The source file location

file_destination_location

The target file location

file_suffix

The file suffix

latest_file_name

The name of the last file processed for a specific file

override_flag

The override flag to re-process the file with the same name

update_datetime

The last updated date

Conclusion

This brings to the end of the implementing Job metadata framework using Talend. The following are key takeaways from this blog:

  1. The need and the importance of Job metadata framework
  2. The data model to support the framework
  3. The customizable data model to support different types of job patterns.

As always – let me know if you have any questions below and happy connecting!

Join The Conversation

1 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

  1. Ivan says:

    Hi,
    I would like to ask where in Talend instalation I can reach all these tables to be able to make my own Metadata framework.
    Or that are not the existing tables?

    Thank you

    Ivan