Talend Data Preparation with Big Data

Talend Data Preparation is a self-service application that enables information workers to prepare data for analysis and other data driven tasks. This course is designed to help you immediately access your data lake using Talend Data Preparation, and to combine preparation and integration tools to correct Big Data files stored in a Hadoop Distributed File System (HDFS). 

You learn how to create datasets from data stored on HDFS and export clean data to the cluster. You improve your knowledge of Data Preparation by cleaning up Big Data files. You also learn how to use Talend Studio to execute preparations on the Hadoop cluster using the Spark framework.

DurationHalf day (4 hours)
Target audience Anyone who wants to use Talend Data Preparation to clean up and structure Big Data files
PrerequisitesCompletion of Talend Data Integration Basics, Talend Data Preparation for Implementers, and Talend Big Data Basics
Course objectives

After completing this course, you will be able to:

  • Create datasets from data stored on HDFS
  • Create preparations to clean up Big Data files
  • Export preparations to HDFS
  • Execute a user-defined data preparation in a Spark batch Job
  • Execute a user-defined data preparation in a Spark streaming Job
Course agenda

Talend Data Preparation in a Big Data context

  • Concepts and purpose

Getting started

  • Monitoring the Hadoop cluster
  • Creating cluster metadata
  • Generating data on the cluster
  • Monitoring Big Data Jobs

Processing data on HDFS

  • Creating a dataset from an HDFS source
  • Updating a preparation
  • Exporting the preparation to HDFS

Running a preparation in a Big Data batch Job

  • Setting up a Spark batch Job
  • Updating the Spark batch Job

Running a preparation in a Big Data streaming Job

  • Importing streaming Jobs
  • Creating a dataset and preparation
  • Using a preparation in a Big Data streaming Job

Troubleshooting

  • Basic troubleshooting instructions