When it Comes To Big Data – Speed Matters

article in Frencharticle in German

Talend vs Informatica – The Big Data Benchmark

If you’ve spoken to a Talend sales representative or read some of my team’s marketing material, then you’ve undoubtedly heard our claims that when it comes to Big Data, Talend offers some significant speed advantages over the competition.

As an example, here’s a slide we used as part of our Talend 6 media deck.

Concerned that some folks might dismiss this content as marketing hype, I thought it would make sense to create some more concrete evidence to substantiate our claims. We utilized the skills of MCG Global Services, a leader in information management, to conduct some benchmark tests on our behalf comparing Talend Big Data Integration against Informatica Big Data Edition.

I believe MCG did a really nice job on the benchmark and defining a common set of use cases and questions that would be highly relevant to many organizations.

Questions included:

- What impact does customers’ views of pages and products on our website have on sales? How many page views before they make a purchase decision (whether online or in-store)? (Use Case 1)

- How do our coupon promotional campaigns impact our product sales or service utilization? Do our customers who view or receive our coupon promotion come to our website and buy more or additional products they might not otherwise without the coupon? (Use Case 2)

- How much does our recommendation engine influence or drive product sales? Do customers tend to buy additional products based on these recommendations? (Use Case 3)

As you’ll note below, the benchmark confirms our speed advantage claims. If you are interested in a more detailed view of the conditions and outcomes of the benchmark, you may download the full benchmark here.

 

Here’s a snapshot of the overall gains with Talend and how they increase as data volumes rise.

In the case of Talend versus Informatica, it’s relatively straightforward to explain why the gap is so startling. Clearly, by leveraging the in-memory capabilities of Apache Spark, Talend users can integrate datasets at much faster rates. Spark uses fast Remote Procedure Calls for efficient task dispatching and scheduling. It also leverages a thread pool for execution of tasks rather than a pool of Java Virtual Machine processes. This enables Spark to schedule and execute tasks at rate measured in milliseconds, whereas MapReduce scheduling takes seconds and sometimes minutes in busy clusters.

With Informatica Big Data Edition, which doesn’t support Spark directly, how Hive-on-Spark behaves and performs is up to the Hadoop engine and how it is configured.

Again, if you want to learn more about the benchmark tests, you may download the full report here.

Related Resources

With Talend, Speed Up Your Big Data Integration Projects

Products Mentioned

Talend Big Data

Share

Comments

John Haddad
So that readers are not misled by this benchmark report I’d like to point out a few deficiencies. First and foremost this benchmark compares the new version of Talend released in September 2015 to an older version of Informatica Big Data Edition that was released over two years ago. The new Informatica Big Data Management (BDM) version 10 was released in November 2015 (just two months after the release of Talend’s current version) and provides many new capabilities over the former Big Data Edition as relates to integration, governance, and security. If this was truly an independent benchmark then MCG Global Services should have contacted Informatica for some guidance. Second, what this benchmark primarily points out are some differences in performance of MapReduce vs. Spark which are already well-known. The new version of Informatica BDM uses the Informatica Blaze engine on YARN (read more about the Informatica Blaze engine in the blog referenced at the end of this comment) which has been shown to run 2-3 times faster than Spark and 11-20 times faster than MapReduce for batch ETL transaction processing. The new Informatica Blaze engine also addresses current deficiencies in the Spark project as relates to multi-tenancy and resource utilization that can result in performance degradation (this is also discussed in the blog referenced in this comment). Third, the datasets used in this benchmark (only 2GB) are hardly on the order of what most organizations consider to be Big Data. The benchmarks are not run on what would even be considered worthy of a proof-of-concept (POC) environment. A single node (4 CPU, 30.5 GB memory, and 200 GB storage) VM server is certainly not considered a real-world Big Data cluster. MCG Global Services chose a custom benchmark instead of the industry standard TPC benchmarks which provide vendor-neutral evaluation for performance and price-to-performance ratio. To learn more about Big Data benchmarks in the real-world please read the blog, “Top Three Reasons Why We Love Informatica Big Data Management?” at http://infa.media/1V7Ty5l

Leave a comment

Comments

Comment: 
So that readers are not misled by this benchmark report I’d like to point out a few deficiencies. First and foremost this benchmark compares the new version of Talend released in September 2015 to an older version of Informatica Big Data Edition that was released over two years ago. The new Informatica Big Data Management (BDM) version 10 was released in November 2015 (just two months after the release of Talend’s current version) and provides many new capabilities over the former Big Data Edition as relates to integration, governance, and security. If this was truly an independent benchmark then MCG Global Services should have contacted Informatica for some guidance. Second, what this benchmark primarily points out are some differences in performance of MapReduce vs. Spark which are already well-known. The new version of Informatica BDM uses the Informatica Blaze engine on YARN (read more about the Informatica Blaze engine in the blog referenced at the end of this comment) which has been shown to run 2-3 times faster than Spark and 11-20 times faster than MapReduce for batch ETL transaction processing. The new Informatica Blaze engine also addresses current deficiencies in the Spark project as relates to multi-tenancy and resource utilization that can result in performance degradation (this is also discussed in the blog referenced in this comment). Third, the datasets used in this benchmark (only 2GB) are hardly on the order of what most organizations consider to be Big Data. The benchmarks are not run on what would even be considered worthy of a proof-of-concept (POC) environment. A single node (4 CPU, 30.5 GB memory, and 200 GB storage) VM server is certainly not considered a real-world Big Data cluster. MCG Global Services chose a custom benchmark instead of the industry standard TPC benchmarks which provide vendor-neutral evaluation for performance and price-to-performance ratio. To learn more about Big Data benchmarks in the real-world please read the blog, “Top Three Reasons Why We Love Informatica Big Data Management?” at http://infa.media/1V7Ty5l

Add new comment

More information?
Image CAPTCHA
More information?