Thanks to all dedicated and avid readers who have been following my journey on transitioning over to Talend from Informatica PowerCenter. If you are new here and haven’t read any of my previous posts, you can start reading them here - “Part - 1” and “Part - 2” . The first two parts of this series provided an overview of the architectural differences between PowerCenter and Talend and what goes on under the hood when it comes to processing the data. It also provided a primer on Talend Studio – our unified development environment – and a mapping between some of the most popular PowerCenter transformations and their equivalent Talend components.
Based on the feedback I have received so far, the third part of this series will focus on SDLC; scalability & high availability; and parallelization. Each of these are big topics in their own right and my goal is not to drill down into the details. But, it’s my sincere hope that after reading this blog, you will be able to appreciate how Talend’s architecture allows you to automate, deploy, scale and parallelize your data integration platform.
SDLC is a software engineering process that divides the software development work into distinct phases to improve design, product management and project management. There are various SDLC methodologies that you will be familiar with as a software developer – waterfall, iterative, prototyping, and most recently agile. Once you pick your SDLC process as an organization, the next step is to automate as much of it as possible. The goal is to get the software as soon as possible into the hands of QA testers or for end users in the production environment after the developer has completed his or her work. Some related concepts like build automation, continuous integration (ability for multiple developers to integrate their code with others more frequently) and version control systems help with the management of this automation. The aim is to design a regular and continuous build and deployment followed by automated end-to-end testing to verify the current code base.
PowerCenter provides its proprietary version control system but it lacks in the areas of automated testing and continuous integration and deployment. Talend provides native integration with industry standard version control systems like Git and SVN. If your organization is already invested in GIT/SVN, you can use that single repository to store code for your data integration jobs as well as other artifacts. Also, your developers don’t need to learn another version control system just for the data integration jobs they build.
Talend also provides complete support for Continuous Integration and Deployment which makes the entire SDLC process much more efficient. You can read all about it here – Talend Software Development Life Cycle Best Practices Guide.
One of the key considerations of selecting a data integration platform is its ability to scale. As the number of sources or targets increase, or the volume of data being processed grows exponentially, the architecture of the platform should enable such growths in volume without much overhead in the administration of these systems.
If you look at the simplified versions of PowerCenter and Talend architecture (Figure 1), you will notice the similarities. PowerCenter’s Integration Service performs the heavy-lifting integration operations that move data from sources to targets. The Job Server in the Talend architecture performs this same function. It is this component that needs to scale with increasing number of sources, targets and data volumes.
Scalability can be achieved in multiple ways (Figure 2):
Pros and cons of the above-mentioned scaling approaches are discussed extensively on the Internet and you should spend time researching and understanding it. You need to pick the right option based on your requirements and budget.
Talend also provides the option of grouping a set of physical servers into a “virtual server” (Figure 2). A virtual server is a group of physical servers from which the best-rated server will automatically get preferred at Job execution time. Once you set the execution task onto a virtual server, Talend determines the best physical server to execute the task and sends the request there. This decision on which server to pick is based on a rating system that leverages information on CPU/RAM/disk usage of each of these physical servers.
If you have additional grid computing requirements, you should consider using our Big Data Platform that leverages Hadoop for grid, clustering and high availability.
Parallelization can be implemented to (i) meet high throughput requirements, and/or (ii) troubleshoot processing bottlenecks. There are several mechanisms to enable code parallelization in Talend and they fall into one of two categories:
Let’s look at the options to parallelize process flows –
You can achieve parallelization of data flows a couple of different ways –
If you want to read up the details of all these options, you can find it in Talend Help or request your CSM to provide a link to an expert session recording of a deep dive into parallelization.
For PowerCenter developers moving to Talend, here’s a guide that maps PowerCenter parallelization options to Talend.
That concludes this blog series on my journey from PowerCenter to Talend. It’s been a very positive experience and I have been able to appreciate the benefits of embracing a modern data integration paradigm. Hope you had as much fun reading it as I had putting it together. I would love to hear from you – so feel free to leave your thoughts and comments below.
Browse our most popular resources - You can never just have one.
Don't miss out on new content! Sign up for our newsletter.