Talend Big Data Advanced – Spark Batch

Talend provides a development environment that lets you interact with many source and target Big Data stores, without having to learn and write complicated code.

This course covers Big Data batch Jobs that use the Spark framework.

Duration1 day
(7 hours)
Target audienceAnyone who wants to use Talend Studio to interact with Big Data systems
PrerequisitesCompletion of Talend Big Data Basics
Course objectives

After completing this course, you will be able to:

  • Develop a Big Data batch Job using the Spark framework
  • Execute Spark Jobs in YARN client and cluster mode
  • Enable Spark history server event logging
  • Copy data from a local file to HDFS
  • Copy data from MySQL to HDFS
  • Create a Hive table and copy data from HDFS to it
  • Import tweets to HDFS
  • Join, sort, and aggregate data
  • Use caches for faster processing
  • Query data from a Hive table using Hive QL
  • Query data from Spark datasets using Spark SQL
Course agenda

Spark in context

  • Concepts

Introduction to Spark

  • Developing and configuring a Big Data batch Job to use the Spark framework
  • Executing a Big Data Spark batch Job
  • Tracking a Big Data Spark batch Job execution

Sentiment analysis use case

  • Using the Twitter application programming interface (API) with Talend components
  • Loading tweets into HDFS
  • Processing tweets with a Big Data batch Job using the Spark framework
  • Enabling Spark history server event logging
  • Executing a Big Data Spark batch Job in YARN cluster mode
  • Deploying and scheduling Job execution from Talend Administration Center (TAC)

Download analysis use case

  • Retrieving RDBMS data from a Big Data Spark batch Job
  • Loading data into a Hive table and HDFS
  • Executing HiveQL queries from a Big Data Spark batch Job
  • Using caches for faster Spark batch Job processing
  • Performing download analysis with a Big Data Spark batch Job
  • Executing a Spark SQL query on data read from a NoSQL HBase table