Talend Performance Tuning Strategy
As a Customer Success Architect with Talend, I spend a significant amount of my time helping customers with optimizing their data integration tasks – both on the Talend Data Integration Platform and the Big Data Platform. While most of the time the developers have a robust toolkit of solutions to address different performance tuning scenarios, a common pattern I notice is that there is no well-defined strategy for addressing root causes for performance issues. Sometimes not having a strategy fixes some of the immediate issues but in the longer term, the same performance issues resurface because the core issues in the original design were not addressed. And that’s why I recommend customers to have a structured approach to performance tuning of their data integration tasks. One of the key benefits of having a strategy is that it is repeatable – irrespective of what your data integration tasks do, how simple or complicated they are, and the volume of data that is being moved as part of the integration.
Where is the bottleneck?
The first step in a performance tuning strategy is to identify the source of the bottleneck. There could be bottlenecks in various steps of your design. The goal is not to address all the bottlenecks at the same time but to tackle them one at a time. The strategy is to identify the biggest bottleneck first, find the root causes creating the bottleneck, find a solution and implement it. Once the solution has been implemented, we look for the next biggest bottleneck and address it. We keep iterating through all the bottlenecks until we have reached an optimal solution.
Here’s an example to help you understand. You have a Talend Data Integration Standard job that reads from an Oracle OLTP database, transforms it in tMap and loads it into a Netezza data warehouse.
If this task is not meeting your performance requirements, my recommendation would be to break down this task into three different parts:
- Read from Oracle
- Transform in Talend, and
- Write to Netezza
One or more of the tasks listed above maybe causing a slowdown of your process. Our goal is to address one at a time. A simple way of finding out what is causing the bottleneck is to create three test Talend jobs to replicate the functionality of the one Talend job. This would look something like this:
1. Job 1 – Read from Oracle: this job would read from Oracle using tOracleInput and write to a file in the local file system of the Talend Job Server using tFileOutputDelimited. Run this job and capture the throughput (rows/second). If the throughput numbers do not look reasonable, the query from Oracle source is one of your bottlenecks.
2. Job 2 – Transformation: Read the file created in Job 1 using tFileInputDelimited, apply your tMap transformations and write out to another file using tFileOutputDelimited into the same local file system. How do the throughput numbers look? Are they much faster or much slower or the same compared to Job 1?
3. Job 3 – Write to Netezza: Read the file created in Job2 and load it into the Netezza database and look at the throughput numbers. How do they compare to Job 1 and Job 2?
There are couple of things you need to pay attention to when running these jobs:
- First, these test jobs should be writing to and reading from a local file system – this is to make sure we eliminate any possible network latency.
- The second thing - throughput (the rate at which data is read/transformed/written) – is a more accurate measure of performance than elapsed time. Our goal is to reduce elapsed time and we address that by increasing the throughput at each stage of the data integration pipeline.
Let’s assume that this was the outcome of running our tests:
Based on the scenario above, we can easily point to Netezza being the bottleneck in our scenario since it has the lowest throughput*.
If the outcome was something like below, we can conclude that we have bottlenecks both in the read from Oracle and write to Netezza and we need to address both*.
* In my simple use case above, I assume that the row lengths do not change across the entire pipeline i.e. if we read 10 columns from Oracle, the same 10 columns are passed through the Transform and Write jobs. However, in real life scenarios, we do add or drop columns as part of the pipeline and we need to pick alternate measures of throughput like MBs/sec.
Let’s eliminate those bottlenecks
In the previous section, I talked about identifying “where” the bottleneck is. In this section, we will provide you a summary of “how” we can eliminate the different types of bottlenecks.
- If your source is a relational database, you can work with your database administrators to make sure that the query is optimized and executing based on the best query plan. They can also provide optimizer hints to improve the throughput of the query. They should also be able to add new indexes for queries that have an GROUP BY or ORDER BY clause.
- For Oracle and some other databases, Talend allows you to configure the Cursor Size in the t<DB>Input component. The Cursor Size defines the fetch size of the result set. Once the result set is retrieved from the database, it is stored in memory for faster processing. The ideal size for this is defined by your dataset and requirements. You can also work with the database administrators to increase the network packet size which allows for larger packets of data to be transported over the network at one time.
- For very large reads, create parallel read partitions as multiple subjobs using multiple t<DB>Input components with non-overlapping where clauses. Pick columns that are indexed for the where clauses – this will enable an equal distribution of data across the multiple reads. Each of these subjobs can run in parallel by enabling “Multi thread execution” in the job properties
- For files sources stored on the network shared storage, please make sure that there is no network latency between the server on which Talend Job server is running and the file system in which the files are hosted. The file system should ideally be dedicated to storing and managing files for your data integration tasks. In one of my assignments, the file system where the source files were stored were shared with mail server backups – so when the nightly email backups would run, our reads from the filesystem would significantly slow down. Work with your storage architect to eliminate all such bottlenecks.
- Most modern relational databases support bulk loading. With bulk loaders, Talend bypasses the database log and thus improves performance. For some of the databases, we also provide the option to use named pipes with external loaders. This eliminates the need to write the intermediate files to disk.
- Sometimes dropping indexes and key constraints before load helps with the performance. You can recreate the indexes and constraints after the load has successfully completed
- For updates, having database indexes on the same columns as the ones that are defined as Keys in the t<DB>Output component will improve performance
- For file targets on a network shared storage, follow the guidelines from above for source files stored on network shared storage
- Reduce the volume of data being processed by Talend by eliminating the unnecessary rows and columns early in the pipeline. You can do this by using tFilterRows and tFilterColumns components
- For some memory intensive components like tMap and tSortRow, Talend provides the option to store intermediate results on disk. A fast disk that’s local to the Job Server is recommended. This reduces the requirement for adding more memory as data volumes grow.
- Sometimes transformation bottlenecks happen because of a large monolithic job that tries to do many things at once. Break down such large jobs into smaller jobs that are more efficient for data processing.
There are some additional optimization techniques to address bottlenecks at a job level (like Parallelization, ELT, Memory optimization etc.) that are not discussed as part of this blog but you can find information on them and other techniques on Talend Job Design Patterns and Best Practices – Part 1, Part 2, Part 3 and Part 4.
The key element to successfully tune your jobs for optimal performance is to identify and eliminate bottlenecks. The first step in the performance tuning is to identify the source of bottlenecks. And yes, it does involve creating additional test jobs. But don’t be discouraged that you have to put in additional effort and time to build these out. They are well worth the effort based on my experience doing this for 20+ years. A strategic and repeatable approach to performance and tuning is a lot more efficient than a tactical trial and error method. You can also incorporate the lessons learnt into your process and improve it over time. I hope this article gets you started in your performance tuning journey and wish you the best.
Ready to get started with Talend?
More related articles
- Big Data and Agriculture: A Complete Guide
- Big Data and Privacy: What Companies Need to Know
- 5 Ways to Optimize Your Big Data
- Big Data for Supply Chain Management
- Big Data in Government
- Defining Big Data Analytics for the Cloud
- Big Data in Media and Telco: 6 Applications and Use Cases
- 7 Ways Big Data is Changing E-commerce
- How Big Data is Changing Healthcare
- 2 Key Challenges of Streaming Data and How to Solve Them
- Big Data for Small Business: A Complete Guide
- How Does Big Data Work for HR Analytics?
- What is Hadoop?
- What is MapReduce?
- What is Big Data? Free Guide and Definition
- Beginner’s Guide to Batch Processing
- An Intro to Apache Spark Partitioning
- What is Apache Spark?
- Big Data in Finance - Your Guide to Financial Data Analysis
- Stream Processing Defined
- Big Data in Marketing 101
- Big Data in Retail: Common Benefits and 7 Real-Life Examples