It has been a couple of weeks since I got back from the Hadoop Summit in San Jose and I wanted to share a few highlights that I believe validate the direction Talend has taken over the past couple of years.
Coming out of the Summit I really felt that as an industry we were beginning to move beyond the delivery of exciting innovative technologies for Hadoop insiders, to solutions that address real business problems. These next-generation solutions emphasize a strong focus on Enterprise requirements in terms of scalability, elasticity, hybrid deployment, security and robust overall governance.
From my perspective (biased of course!), the dominant themes at the Summit gravitated around:
– Lambda Architecture and typical use cases it enables
– Cloud, tools and ease of dealing with Big Data
– Machine Learning
In this blog post, I’ll focus on the first one…the Lambda Architecture
Business use cases that require a mix of machine learning, batch and real-time Data processing are not new, they have been around for many years. For example:
– How do I stop fraud before it occurs?
– How can I make my customers feel like “royalty” and push personalized offers to reduce shopping cart abandonment?
– How can I prevent driving risks based on real-time hazards and driver profiles?
The good news is that technologies have greatly improved and with almost endless computing power at a fraction of yesterday’s cost, they are not science fiction anymore.
The Lambda architecture (see below) is a typical architecture to address some of those use cases.
Lambda Architecture. (Based on Nathan Marz design)
Spark (the champion) stands out from the crowd because of its ability to address both batch and near real-time (micro batch in the case of Spark) data processing with great performance through its in-memory approach.
Spark is also continuously improving its platform by adding key components to appeal to more Data Scientists (on top of MLlib for machine learning, Spark R was added in the 1.4 release) and expand its Hadoop footprint.
Spark projects in the Enterprise are on the rise and slowly replacing Map/Reduce for Batch Processing in the mind of developers. IBM’s recent endorsement and commitment to put 3500 researchers and developers on Spark related projects will probably accelerate Spark adoption in the hearts of Enterprise architects.
But, because there’s a champion, there must also be a contender…
This year, I was particularly impressed by the new Apache Flink project, which attempts to address some of Spark’s drawbacks like:
– Not being a YARN first class citizen yet
– Being Micro Batch (good in 95% of the cases) versus pure streaming
– Improved/easier Memory Management
If you look at Flink “marchitecture”, you can almost draw a one for one link between its modules and Spark’s. It the same story when it comes to their APIs, they are very similar.
So where is Talend in all of this?
With our Talend 5.6 platform, we delivered a few Spark components in Tech Preview, since then we have doubled down on our Spark investments and our upcoming 6.0 release will see many new components to support almost any use case, batch or real-time. From a batch perspective, with 6.0, it will be easier to convert your MapReduce jobs into Spark jobs and gain significant performance improvements along the way.
It’s worth highlighting that the very famous and advanced tMap component will be available for Spark Batch and Streaming, allowing advanced Spark transformation, filtering and data routing from single or multiple sources to single or multiple destinations.
As always, and because we believe native code running directly on the cluster is better than going through proprietary layers, we are generating native Spark code, allowing our customers to benefit from the continuous performance improvements of their Hadoop data processing frameworks.