Talend with Spark and Hadoop

An acceleration engine for your integration platform

The speed and scale of data processing unleashed by Apache Spark on Hadoop brings the promise of Big Data closer than ever. Talend Big Data provides the platform to take advantage of it today.

Big Data wants big storage and fast processing

Hadoop made collection and storage of massive amounts of data affordable. Spark unlocked the speed and scale to process it. Talend provides a single data integration platform to connect these innovative technologies to the decision making applications and tools transforming every industry.



IT needs a single platform to connect it together

Talend is the first Big Data integration platform built on Apache Spark and Hadoop. Graphical tools and wizards generate native code so you can start working with Apache Spark, Spark Streaming, Apache Hadoop, and NoSQL databases today.

  • Talend Big Data jobs running Spark are 5x faster than MapReduce* to deliver real-time results.
  • Talend optimized connectors and components combine in-memory analytics, machine learning and caching components to deliver high performance jobs without tuning Spark by hand.
  • Talend visual tools enable you to build Spark jobs faster than hand coding to run on Hadoop, standalone, or in the cloud.
  • Transform your MapReduce jobs to Spark with the push of a button in Talend.
* Validated by independent TPC-H integration benchmarks.



Optimize for the speed and scale of Spark on Hadoop

Talend generates native code to optimize the features of Spark that deliver the speed and scale of Big Data and the Internet of Things.

  • Optimal management of distributed computing: partition up front for better performance.
  • Optimal management of memory: data can be cached in memory to be reused by multiple sub jobs or use Tachyon to cache data outside of Spark/JVM for use in other applications.
  • Unmatched performance without storing in HDFS: massively parallel streaming of data straight from the data source (Oracle, Teradata, MySQL, Cassandra, etc.), data kept in memory for reuse using compressed column storage.
  • Mix messaging and batch at scale with connectors for Kafka and more from Talend for an end-to-end distributed solution for large scale messaging systems.
  • A new category of connectors native to Spark enable ingestion from RDBMS.
  • End-to-end data mapping pipeline keeps data and processing in Spark.



Leverage the full power of Spark Machine Learning

Spark combines batch and streaming in a single run-time using the Lambda Architecture, and Talend provides a single tool and code base to build batch and real-time applications using high-speed messaging, real-time data ingestion and processing, and fast NoSQL connectivity capabilities.

  • Spark's Lambda architecture provides the blueprint for data pipelines that combine historical batch data with real-time clickstream, geolocation or sensor data.
  • Talend helps you build the intelligent data pipelines, powered by Spark Machine Learning, that connect real-time and batch data to feed real-time analytics.
  • Pre-built drag-and-drop developer components leverage Spark MLlib (machine learning library) classifiers for logistic and linear regression, image classification, text analysis, decision tree classification, gradient-boosted tree forecasting, random forest, ALS, and Naïve Bayes, and clustering algorithms such as K-Means.
  • Data scientist can do everything in a single tool with appropriate tracking and governance to build Spark-based real-time analytics models for recommendations, customer segmentation, forecasting, classification, and regression analysis.
  • Use Talend's continuous delivery tools to put data science models into production with fast and frequent iterations for massive learning on batch processed data.



Stay current with the most up-to-date
Hadoop distributions for Spark

Talend is the only data integration platform that supports the latest Hadoop Distribution. Native Spark connectors in Talend optimize data feeds from external sources into Spark so you can ingest, load in parallel, and accelerate use of data.



Run on affordable, commoditized hardware, and deploy to your existing Hadoop cluster.

Manage the elasticity of your AWS EMR cluster within your job using Talend Studio.

Deliver Spark in the cloud via Google, Amazon, IBM, Oracle, MS Azure, Databricks.

Get started with over 100 drag-and-drop Spark components.

Track data used and apply security policies
in Cloudera Navigator.



© 2016 Talend All rights reserved.