Talend provides a development environment that lets you interact with many source and target Big Data stores, without having to learn and write complicated code.

This course covers the implementation of machine learning algorithms in Big Data batch Jobs using the Spark framework.

Duration 1 day (7 hours)
Target audience Anyone who wants to use Talend Studio to industrialize machine learning algorithms
Prerequisites Completion of Talend Data Quality Essentials or Talend Big Data Basics
Course objectives

After completing this course, you will be able to:

  • Connect to a Hadoop cluster from a Talend Job
  • Use context variables and metadata
  • Read and write files in HDFS in a Big Data batch Job
  • Configure a Big Data batch Job to use the Spark framework
  • Create and test recommendation models
  • Create and test classification models
  • Use a machine learning algorithm to deduplicate data
Course agenda

Machine learning in context

  • Concepts

SMS classification use case

  • Exploring an SMS classification use case – decision trees
  • Creating an SMS classification model
  • Testing the SMS classification model

Movie recommendation use case

  • Exploring a movie recommendation use case – alternating least squares
  • Building a movie recommendation model
  • Testing the movie recommendation model

Iris classification use case

  • Exploring an iris flower classification use case – Naïve Bayes classifier
  • Building an iris classification model
  • Testing the iris classification model

Child care deduplication use case

  • Exploring a child care use case and dataset – matching
  • Setting up the environment
  • Pairing data
  • Building a matching model
  • Using the matching model
  • Merging groups of duplicates