Talend provides a development environment that enables you to interact with many Big Data sources and targets without having to understand or write complicated code.

Talend Big Data Basics is an introduction to the Talend components that are shipped with several products that interact with Big Data systems.

Duration 2 days (14 hours)
Target audience Anyone who wants to use Talend Studio to interact with Big Data systems
Prerequisites Completion of Introduction to Talend Studio, Talend Data Integration Basics, or Talend Data Integration Advanced
Course objectives
After completing this course, you will be able to:
  • Create cluster metadata
  • Create HDFS and Hive metadata
  • Connect to your cluster to use HDFS, HBase, Hive, Pig, Sqoop, and MapReduce
  • Read data from and write it to HDFS (HDFS, HBase)
  • Read tables from and write them to HDFS (Hive)
  • Process tables stored in HDFS with Hive
  • Process data stored in HDFS with Pig
  • Process data stored in HDFS with Big Data batch Jobs
Course agenda

Big Data in context

  • Concepts

Connecting to the Hadoop cluster

  • Creating cluster metadata in the repository
  • Creating HDFS metadata in the repository

Reading and writing data in HDFS

  • Storing a file in HDFS
  • Storing multiple files in HDFS
  • Reading data from HDFS
  • Storing sparse datasets with HBase

Processing Hive data in standard Jobs

  • Creating Hive connection metadata
  • Saving data as Hive tables
  • Processing Hive tables using a standard Job
  • Profiling Hive tables using Data Quality analyses

Processing data with MapReduce

  • Processing data stored in HDFS with Pig using standard Jobs
  • Processing data stored in HDFS with Big Data batch Jobs
  • Migrating a standard Job to a Big Data batch Job

Big Data use case: Clickstream

  • Setting up a development environment
  • Loading data files into HDFS
  • Enriching logs
  • Computing statistics
  • Understanding MapReduce Jobs
  • Using Talend Studio to configure a resource request to YARN