Talend with Spark and Hadoop

An acceleration engine for your integration platform

The speed and scale of data processing unleashed by Apache Spark on Hadoop brings the promise of Big Data closer than ever. Talend Big Data provides the platform to take advantage of it today.


Big Data wants big storage and fast processing

Hadoop made collection and storage of massive amounts of data affordable. Spark unlocked the speed and scale to process it. Talend provides a single data integration platform to connect these innovative technologies to the decision making applications and tools transforming every industry.

Connect everything on a single platform

Talend is the first big data integration platform built on Apache Spark and Hadoop. Talend Studio provids graphical tools and wizards that generate native code so you can start working with Apache Spark, Spark Streaming, Apache Hadoop, and NoSQL databases today.

  • Talend Big Data jobs running Spark are 5x faster than MapReduce* providing real-time results
  • Talend optimized connectors and components combine in-memory analytics, machine learning and caching components to deliver high performance jobs without tuning Spark by hand
  • Talend visual tools enable you to build Spark jobs faster than hand coding to run on Hadoop, standalone, or in the cloud
  • Transform your MapReduce jobs to Spark with the push of a button in Talend

* Validated by independent TPC-H integration benchmarks

Optimize for the speed and scale of Spark on Hadoop

Talend generates native code to optimize the features of Spark that deliver the speed and scale of big data and the Internet of Things.

  • Optimal management of distributed computing: partition up front for better performance
  • Unmatched performance with massively parallel streaming of data straight from the source and data kept in-memory for reuse using compressed column storage
  • Mix messaging and batch at scale with connectors for Kafka and more from Talend for an end-to-end distributed solution for large scale messaging systems
  • A new category of JDBC connectors native to Spark enable ingestion from RDBMS using partitioned parallel read
  • In-memory windowing helps compare data values over a set period of time

Leverage the full power of Spark Machine Learning

Spark can combine batch and streaming in a single run-time, and Talend provides a single tool and code base to build batch and real-time applications using high-speed messaging, real-time data ingestion and processing, and fast NoSQL connectivity capabilities.

  • You can combine historical data with real-time clickstream, geolocation, or sensor data
  • Talend helps you build the intelligent data pipelines, powered by Spark Machine Learning, that connect real-time and batch data to feed real-time analytics
  • Pre-built drag-and-drop developer components leverage Spark machine learning classifiers for logistic and linear regression, image classification, text analysis, decision tree classification, gradient-boosted tree forecasting, random forest, ALS, and Naïve Bayes, and clustering algorithms such as K-Means
  • Developers and data scientists can do everything in a single tool with appropriate tracking and governance to build Spark-based real-time analytics models for recommendations, customer segmentation, forecasting, classification, and regression analysis
  • Talend's continuous delivery tools put data science models into production with fast and frequent iterations for massive learning on processed data

Stay current with the most up-to-date Hadoop distributions for Spark

Talend is the only data integration platform that supports the latest Hadoop distribution. Native Spark connectors in Talend optimize data feeds from external sources into Spark so you can ingest, load in parallel, and accelerate use of data.

Run on affordable, commoditized hardware, and deploy to your existing Hadoop cluster.

Manage the elasticity of your AWS EMR cluster within your job using Talend Studio.

Deliver Spark in the cloud via Google, Amazon, IBM, Oracle, and MS Azure.

Get started with over 100 drag-and-drop Spark components.

Track data used and apply security policies
in Cloudera Navigator and Hortonworks Atlas.