Talend vs Informatica – The Big Data Benchmark
If you’ve spoken to a Talend sales representative or read some of my team’s marketing material, then you’ve undoubtedly heard our claims that when it comes to Big Data, Talend offers some significant speed advantages over the competition.
As an example, here’s a slide we used as part of our Talend 6 media deck.
Concerned that some folks might dismiss this content as marketing hype, I thought it would make sense to create some more concrete evidence to substantiate our claims. We utilized the skills of MCG Global Services, a leader in information management, to conduct some benchmark tests on our behalf comparing Talend Big Data Integration against Informatica Big Data Edition.
I believe MCG did a really nice job on the benchmark and defining a common set of use cases and questions that would be highly relevant to many organizations.
- What impact does customers’ views of pages and products on our website have on sales? How many page views before they make a purchase decision (whether online or in-store)? (Use Case 1)
- How do our coupon promotional campaigns impact our product sales or service utilization? Do our customers who view or receive our coupon promotion come to our website and buy more or additional products they might not otherwise without the coupon? (Use Case 2)
- How much does our recommendation engine influence or drive product sales? Do customers tend to buy additional products based on these recommendations? (Use Case 3)
As you’ll note below, the benchmark confirms our speed advantage claims. If you are interested in a more detailed view of the conditions and outcomes of the benchmark, you may download the full benchmark here.
Here’s a snapshot of the overall gains with Talend and how they increase as data volumes rise.
In the case of Talend versus Informatica, it’s relatively straightforward to explain why the gap is so startling. Clearly, by leveraging the in-memory capabilities of Apache Spark, Talend users can integrate datasets at much faster rates. Spark uses fast Remote Procedure Calls for efficient task dispatching and scheduling. It also leverages a thread pool for execution of tasks rather than a pool of Java Virtual Machine processes. This enables Spark to schedule and execute tasks at rate measured in milliseconds, whereas MapReduce scheduling takes seconds and sometimes minutes in busy clusters.
With Informatica Big Data Edition, which doesn’t support Spark directly, how Hive-on-Spark behaves and performs is up to the Hadoop engine and how it is configured.
Again, if you want to learn more about the benchmark tests, you may download the full report here.