Big Data – Four Performance Strategies
In today's blog, I am covering a preview of the latest performance benchmark that the R&D and Talend Labs has run based on the TPC-H benchmark tests.
As ever, it is Talend’s mission to provide easy to use big data integration tools with the industry’s highest performing, most scalable integration code running natively on Hadoop.
As a part of this mission, we put every product release through a rigorous set of performance and scalability tests, including a performance benchmark developed by the Transaction Processing Performance Council, known as TPC-H.
In the latest release of Talend Big Data, we have implemented some key performance strategies and optimisations in Talend Studio that ensure that the Java code that is generated for MapReduce is already optimised. In previous versions these optimisations were possible, however it was incumbent upon the Talend Developer to implement them, or even know that the patterns and good-practice approach existed.
Talend has taken time to embedd the following optimisations in the Studio Design Time, the benefits of this generated output (deployed natively onto the Hadoop nodes), results in an performance uplift of 67 percent as compared to version 5.4.1.
- Move less data
- Improve performance, reduce errors, remove latency and ensure consistency of execution
- Execute natively
- Generate code that executes natively within Hadoop to remove any redundant network, parsing, unpacking, wait-times, cpu or disc cycles
- Remove any need to traverse logical environments, store data, execute logic or use network to execute a query
- Optimise at design time
- Build in ‘know-how’ and developer hints tips and tricks into the tool
- Reduce serialization and deserializations
- For specific scenario: Use RawComparator comparing keys by byte as opposed to deserializing the intermediary keys to perform a comparison
- Focus on overall productivity not just ‘raw throughput’
- Performance is perceived differently from a myriad perspective and roles
- Performance from training developers, to trouble shooting production environments all has an impact on end to project delivery and performance therein
The details of this TPC-H benchmark will be published as part of the 5.5.1 release.
In the meantime, ask your other integration vendors how they implement the performance strategies above in a graphical tool without the need to expert knowledge and man-years of experience... see what answers they can give :-)