The 2019 Gartner Magic Quadrant for Data Quality Tools : Talend named a leader

Provisioning and executing Talend ETL Jobs on Serverless platforms using Apache Airflow

Provisioning and executing Talend ETL Jobs on Serverless platforms using Apache Airflow

  • Ramkumar Chinta
    Ramkumar Chinta is Customer Success Engineer at Talend and his core expertise are in ESB, Data Integration and Cloud technologies. He has more than 13 years of IT experience and is a Certified Java/J2EE Developer. His areas of interest also includes Webservices, Microservices, Docker & Kubernetes. Prior to joining Talend, Ramkumar worked in the design and development of business solutions for leading Telcos at Techmahindra India Pvt Ltd.

Talend 7.1 brings in many new features, one such feature which needs a special discussion is containerization, which opens the doors to design and implement new Architectures namely Microservices. With a Maven plugin the Studio is now able to Build and Publish standard Jobs as Docker images into a Docker registry. The Talend Jobs which are built as Docker images provide us portability and ease of deployment on Linux, Windows and macOS and allows us to choose between on-premise and cloud infrastructures.

This blog illustrates with two examples how we can run containerized Talend ETL jobs on Amazon cloud leveraging the Container Services (CaaS) namely EKS and Fargate. If you are interested in containerizing Talend microservices and orchestrate on Kubernetes please read my KB article.

Scheduling and Orchestrating containerized Talend Jobs with Airflow

While we are comfortable running our containerized Talend jobs using Docker run command or as a Container Services on ECS, for running more complex jobs we should additionally address some more challenges to

  • run several jobs with a specific dependency/relationship
  • run jobs sequentially or in parallel
  • skip/retry jobs conditionally when an upstream job succeeded or failed.
  • monitor running/failed jobs
  • monitor the execution times of all the tasks across several runs

Airflow is an open source project to programmatically create complex workflows as directed acyclic graphs (DAGs) of tasks. It offers a rich user interface which makes it easy to visualize complex pipelines, tasks in a pipeline (our Talend jobs/containers), monitor and troubleshoot the tasks.

 

 Publish Talend ETL Jobs to Amazon ECR

In Talend Studio, I created 2 standard ETL Jobs tmap_1, tmap_2 and published to Amazon Elastic Container Registry.

Title: Talend jobs published to Amazon ECR

 

Example 1:

Provision and execute ETL Jobs on Amazon EKS

The Airflow KubernetesOperator provides integration capabilities with Kubernetes using the Kubernetes Python Client library. The operator communicates with the Kubernetes API Server, generates a request to provision a container on the Kubernetes server, launches a Pod, execute the Talend job, monitor and terminate the pod upon completion.

Logical Architecture Talend Docker Kubernetes

Title: Logical Architecture

 

A simple DAG with two Talend Jobs tmap_1, tmap_2

Title: DAG with KubernetesOperator

 

A Graph view of DAG run with Kubernetes tasks and execution status

Title: DAG Graph view with Execution status

 

Kubernetes Dashboard with Talend Jobs

Title: Kubernetes Dashboard - Pods overview

 

Example 2:

Provision and execute standard ETL Jobs on Amazon Fargate

The Airflow ECSOperator launches and executes a task on ECS cluster. In this blog the Fargate launch type is discussed since it supports the Pay-as-you-go model.

Logical Architecture Talend Kubernetes Fargate

Title: Logical Architecture

In the previous example with KubernetesOperator we defined our task, and from where it should pull the Docker image directly in our DAG. However, the ECSOperator ease our efforts by directly retrieving the Task Definitions from the Amazon ECS Service.

Using the AWS console, I created two Task Definitions with Fargate launch type and repository URLs of my Docker images tmap_1 & tmap_2

Fargate Task Definitions

Title: Fargate Task Definitions

Then I created my DAG in Airflow leveraging the ECSOperator.

Observe the definition of the DAG it is much simpler and require only the name of the task_defintion, which was actually created in ECS (in the previous step). The operator communicates with ECS using the cluster name and subnet settings.

DAG with ECSOperator

Title: DAG with ECSOperator

 

A Graph view of DAG run with ECS tasks and execution status

DAG Graph view with Execution status

Title: DAG Graph view with Execution status

 

When the DAG runs, the operator provision containers for tmap_1 and tmap_2, executes the jobs and after completion stops and deprovision the containers.

Running container - Task_tmap_1  

Title: Running container - Task_tmap_1  

provisioning container - Task_tmap_2  

Title: provisioning container - Task_tmap_2  

 

Conclusion

Thanks for the Maven and assembly features of the Talend Studio we can build our Talend jobs also as Docker images. The above two examples also illustrate with orchestration tools like Airflow how we can construct complex workflows with containerized jobs, provision and deprovision containers on EKS and Fargate without worrying about the managing the infrastructure.

 

References

Airflow on Kubernetes (Part 1): A Different Kind of Operator

Airflow concepts

Docker Apache Airflow

 

Join The Conversation

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *