Thanks to all dedicated and avid readers who have been following my journey on transitioning over to Talend from Informatica PowerCenter. If you are new here and haven’t read any of my previous posts, you can start reading them here – “Part – 1” and “Part – 2” . The first two parts of this series provided an overview of the architectural differences between PowerCenter and Talend and what goes on under the hood when it comes to processing the data. It also provided a primer on Talend Studio – our unified development environment – and a mapping between some of the most popular PowerCenter transformations and their equivalent Talend components.
Based on the feedback I have received so far, the third part of this series will focus on SDLC; scalability & high availability; and parallelization. Each of these are big topics in their own right and my goal is not to drill down into the details. But, it’s my sincere hope that after reading this blog, you will be able to appreciate how Talend’s architecture allows you to automate, deploy, scale and parallelize your data integration platform.
Software Development Life Cycle a.k.a. SDLC
SDLC is a software engineering process that divides the software development work into distinct phases to improve design, product management and project management. There are various SDLC methodologies that you will be familiar with as a software developer – waterfall, iterative, prototyping, and most recently agile. Once you pick your SDLC process as an organization, the next step is to automate as much of it as possible. The goal is to get the software as soon as possible into the hands of QA testers or for end users in the production environment after the developer has completed his or her work. Some related concepts like build automation, continuous integration (ability for multiple developers to integrate their code with others more frequently) and version control systems help with the management of this automation. The aim is to design a regular and continuous build and deployment followed by automated end-to-end testing to verify the current code base.
PowerCenter provides its proprietary version control system but it lacks in the areas of automated testing and continuous integration and deployment. Talend provides native integration with industry standard version control systems like Git and SVN. If your organization is already invested in GIT/SVN, you can use that single repository to store code for your data integration jobs as well as other artifacts. Also, your developers don’t need to learn another version control system just for the data integration jobs they build.
Talend also provides complete support for Continuous Integration and Deployment which makes the entire SDLC process much more efficient. You can read all about it here – Talend Software Development Life Cycle Best Practices Guide.
One of the key considerations of selecting a data integration platform is its ability to scale. As the number of sources or targets increase, or the volume of data being processed grows exponentially, the architecture of the platform should enable such growths in volume without much overhead in the administration of these systems.
If you look at the simplified versions of PowerCenter and Talend architecture (Figure 1), you will notice the similarities. PowerCenter’s Integration Service performs the heavy-lifting integration operations that move data from sources to targets. The Job Server in the Talend architecture performs this same function. It is this component that needs to scale with increasing number of sources, targets and data volumes.
Scalability can be achieved in multiple ways (Figure 2):
- Vertical Scaling – replacing the current server on which the Job Server runs with a bigger and more robust server with more processors and more RAM, or
- Horizontal Scaling – adding more similar servers that each run one or more Job Servers, or
- a combination of the two above
Pros and cons of the above-mentioned scaling approaches are discussed extensively on the Internet and you should spend time researching and understanding it. You need to pick the right option based on your requirements and budget.
Talend also provides the option of grouping a set of physical servers into a “virtual server” (Figure 2). A virtual server is a group of physical servers from which the best-rated server will automatically get preferred at Job execution time. Once you set the execution task onto a virtual server, Talend determines the best physical server to execute the task and sends the request there. This decision on which server to pick is based on a rating system that leverages information on CPU/RAM/disk usage of each of these physical servers.
If you have additional grid computing requirements, you should consider using our Big Data Platform that leverages Hadoop for grid, clustering and high availability.
Parallelization can be implemented to (i) meet high throughput requirements, and/or (ii) troubleshoot processing bottlenecks. There are several mechanisms to enable code parallelization in Talend and they fall into one of two categories:
- Process Flow Parallelization – the ability to run multiple jobs/subjobs in parallel
- Data Flow Parallelization – the ability to break down a set of rows within a subjob into smaller sets – each of which can be processed by a separate thread or process.
Let’s look at the options to parallelize process flows –
- Execution Plan – multiple jobs/tasks can be configured to run in parallel from the TAC. Each task in the plan can run on its own job server.
- Multiple Job Flows – once you enable “Multi-Thread Execution” in your job, all subjobs run in parallel.
- tParallelize component – this is very similar to the previous option but the difference this time is that the parallel execution is orchestrated by a component, rather than a job-wide setting. The key advantage of this approach is that it gives you control over which parts of your job execute in parallel
- Parent/Child Jobs – you can use the tRunJob component to call a child job. This child job can run on its own JVM if the “Use an independent process to run subjob” option is selected.
You can achieve parallelization of data flows a couple of different ways –
- Auto Parallel – you can right click on a component and “Set Parallelization”. This prompts you to select the number of threads and (optionally) select a key hash for partitioning of data
- Manual Parallel – you can use Talend components – tPartitioner, tCollector, tDepartitioner and tRecollector – to achieve the same thing as auto parallel above.
If you want to read up the details of all these options, you can find it in Talend Help or request your CSM to provide a link to an expert session recording of a deep dive into parallelization.
For PowerCenter developers moving to Talend, here’s a guide that maps PowerCenter parallelization options to Talend.
That concludes this blog series on my journey from PowerCenter to Talend. It’s been a very positive experience and I have been able to appreciate the benefits of embracing a modern data integration paradigm. Hope you had as much fun reading it as I had putting it together. I would love to hear from you – so feel free to leave your thoughts and comments below.