Big Data Virtual Workshop

Introduction

The exercises included as part of this training manual are designed to give you a good introduction into Talend Studio’s Advanced Analytics capabilities. You will work hands-on with in the Talend Real-time Big Data Platform and interact with a Cloudera cluster and Cassandra and MySQL databases, among other technologies. At the end of this training session, you will be able to:

  • Manage metadata within Talend Studio and build a Standard data integration job in Talend Studio, utilizing some basic Data Quality and Data Masking components.
  • Build a Decision Tree Model using Spark and Spark MLlib, executed on a Cloudera Cluster.
  • Complete a Spark Streaming job that utilizes the Decision Tree Model for real-time analysis and results.
  • Execute an end-to-end Risk-Assessment Scenario using the jobs built through the exercises in this tutorial to determine, at login, whether a user is a candidate for a targeted marketing campaign.

Everything you need is available in your virtual environment so let’s get started!

Setup

With your virtual training environment up and running you can launch Talend Studio. To do so click on the Talend icon on the left bar of your desktop. Follow these steps the first time you run it:

  1. First you need to configure a connection. Click on Manage Connection and enter your email address and then click OK.

  2. Next you need to select the project you want to open. For this training exercise, choose 'ADV_ANALYTICS_WORKSHOP'.

  3. Once Talend Studio opens, you will be presented with a Welcome screen. Close the Welcome screen, and you will be presented with a pop-up to install additional packages. You need to keep the Required Third-Party libraries selected and also select Optional third-party libraries. Finally, click Finish.

  4. Accept all 3rd party licenses that need acceptance. Click the 'I accept the terms of the selected license agreement' radio button and click Accept All.

  5. Let the downloads complete before continuing (Be patient as the downloads can take a while)

Scenario

Throughout this tutorial, you will be building an end-to-end integration for an online banking firm. They are trying to mitigate their exposure and risk by targeting credit offers to only those customers whom are deemed low risk and most likely to accept the credit offer.

You will start by ingesting a customer data file and loading to the banking firm’s MySQL database as well as to HDFS. Using the data in HDFS, you will then train a Machine Learning Decision Tree Model that can then be used in a Real-time Spark Streaming job to determine a candidate’s eligibility of a targeted marketing campaign.

Finally, using some pre-existing web services and a mocked-up bank web portal, you can validate the results of your integration job to make sure it is running as expected.

 

Exercise 1

Summary

In this exercise we will learn some basics about Talend Studio and build a Standard Talend Job to cleanse, standardize and mask data so it can be shared and used in later exercises.

Purpose

Become familiar with Talend Studio by importing metadata into the repository and building a Standard Data Integration Job using Talend data quality and data masking components to ensure data privacy.

Procedure

Note: You can find a completed example to check your work under Job Designs -> Standard -> AdvAnalyticsWorkshop -> Exercise_01_Example

Import source metadata

  1. In the Repository, under Metadata, right-click on File Delimited and select Create File Delimited.
  2. Complete the form to document this metadata. We will name this object ‘Customers’. Typically, filling out the purpose and description will help in understanding the job later and it is recommended these fields be filled in during job creation. Click Next.
  3. Browse to find the source file (Customers.csv) that has already been stored on your virtual machine. It can be found in /home/talend/workshop/data
  4. Select the file and then click OK — Back in the Metadata wizard, you should see sample data displayed in the File Viewer. Click Next.
  5. Choose Comma for the ‘Field Separator’ dropdown and in the Preview tab, check the box to ‘Set heading row as column names’ and then click Refresh Preview to confirm the data is being parsed correctly. Click Next.

  6. Review the Schema Description and correct any incorrect information:
    1. Change the ‘Date Pattern’ of the dateofbirth field. Delete the existing ‘Date Pattern’ value then and press Ctrl+Space to see available options. ChooseMM-dd-yyyy
    2. Change ssn field type to String
    3. Make changes to the following field lengths:
       Field Length
      userId 10
      firstName 15
      lastName 20
      street 50
      city 30
       ssn 11
       email 50
       hasmortgage 7
       haspersonalloan 7
       numcampaigncalls 5

  7. Click Finish to import the Customers metadata into Talend Studio.

Build a Standard Job

  1. In the Repository, navigate to Job Designs -> Standard -> AdvAnalyticsWorkshop -> Exercise_1. Double-Click on the Job ‘Exercise_01_CleanAndLoadData’ to open this empty job in the Design Window.
  2. Add the Source file that we just imported into the Talend Repository from above by navigating to Metadata -> File delimitedDrag and drop the Customers metadata object onto the Designer window. When the Components pop-up appears, choose tFileInputDelimited and click OK.
  3. You can preview the source data right within Talend Studio by right-clicking on the new component and selecting Data Viewer. You should notice that the phone numbers are not standardized and that some of the email addresses are missing information. We will address these data quality issues prior to loading into our MySQL Data Warehouse. You should also notice that PII information (SSN, Email, First and Last Name, etc.) is readily available. We will mask this information before sharing this data with our data scientists. Close the Data Previewer.

  4. In the Palette along the right side of Talend Studio you will find all available components, categorized by work process. We want to add some Data Quality Components to our job. Within the Palette, select Data Quality. Drag and drop the tVerifyEmail component onto the Job Designer and place it to the right of the Customers component. Connect the two components and name the connection ‘customer_input’.
  5. Again, in the Palette, navigate to Data Quality -> Standardization. Drag and drop the tStandarizePhoneNumber component onto the Job Designer. Place it to the right of the tVerifyEmail component and connect the two components. Name the connection ‘standardize_data’.
  6. Configure the tVerifyEmail Component by double-clicking on it to open the Component tab. In the Component tab’s Basic Settings, click the ‘Column to Validate’ dropdown and choose email. This will validate the email field of our Source file based on the default regular expression populated within the component settings.

  7. Configure the tStandardizePhoneNumber Component. Choose phone1 from the ‘Phone Number’ dropdown. Next click the checkbox to ‘Customize the Country Code’Leave the default value of “US”. Finally, in the ‘Phone number format for outputchoose National from the dropdown.

  8. Next add a tMap Component to the Job Designer. You can do this by clicking anywhere within the Designer Window and typing “tMap”. Once you see the tMap component appear in the selection window, double-click it to add it to the Job Designer. Connect the tStandardizePhoneNumber component to the new tMap component. Name this connection ‘std_data_in’.
  9. Double-click on the tMap component to open the tMap Editor. In the tMap Editor:
    1. Click on the green plus sign (+) at the top right to add a new outputName the new output ‘std_data_out’ and click OKDrag and drop the first 21 columns (down to and including the field conversion) from the left input table to the new output table.
    2. In the ‘std_data_out’ output table, replace the expression value for the email field with the following expression:
      • std_data_in.VerificationLevel == "VALID" ? std_data_in.email : null
    3. In the ‘std_data_out’ output table, delete the expression value of the phone1 field and press Enter. Now drag and drop the StandardizedPhoneNumber field from the left input table to the phone1 expression of the ‘std_data_out’ output table.
    4. Click OK to save and close the tMap Editor.

  10. Add these additional components to the Job Designer in order from left to right:
    1. tReplicate
    2. tDataMasking
    3. tDataShuffling
    4. tHDFSOutput
  11. Connect the tMap component output to the tReplicate component by right-clicking on the tMap and selecting Row -> std_data_out and then clicking on the tReplicate component. Using the chart below, connect the remaining components:
     Connect... To Name Connection
     tReplicate tDataMasking data_out1
     tDataMasking tDataShuffling masked_data
     tDataShuffling tHDFSOutput data_output


  12. Configure the tDataMasking component. This component allows you to mask specific pieces of Personally Identifiable Information while still maintaining the integrity of the field. Below the ‘Modifications’ table, click the green plus sign (+) three times to add three rows to the ‘Modifications’ table. Configure each row using the table below:
     Input Column Function Extra Parameters
     email Mask email local part by character "x"
     ssn Replace n first chars "3, x"
     ssn Replace characters between two positions "5, 6, x"


  13. Configure the tDataShuffling component. This component allows you to shuffle data within columns and groups of columns, for example we will shuffle city, state and zip such that the locations will still be valid combinations but will be associated to different records. Below the ‘Shuffling columns’ table, click the green plus sign (+) five times to add five rows to the ‘Shuffling columns’ table. Configure each using the table below:
     Column Group ID
    firstName 1
     lastName 2
     city 3
     state 3
     zip 3


  14. Configure the tHDFSOutput component.
    1. If Prompted, Install the required tHDFSOutput external jars by clicking Install and then Download and install all modules available.

    2. In the tHDFSOutput component configuration, choose Repository from the ‘Property Type’ dropdown. Then click the ellipsis and traverse down the tree to select HadoopHDFS from the Repository Content pop-up. This will populate the Hadoop Cluster content into the component. In the ‘File Name’ field, enter "/user/talend/workshop/marketing_campaign.csv" and from the ‘Action’ dropdown, select Overwrite. Finally, click the checkbox to ‘Include Header’ in the HDFS Output file.
      img class="aligncenter wp-image-122872 size-full" src="/wp-content/uploads/Exercise1_step14_HDFSOut.png" alt="" width="785" height="623" />

  15. Finally, in the Repository, navigate to Metadata -> Db Connections and drag and drop the ‘MySQL’ connection onto the Job Designer. From the Components Pop-up, select tMySQLOutput and click OKAdd a second connection from the tReplicate component to the new MySQL component and name the connection ‘data_out2’.
  16. Configure the MySQL component. Update the ‘Table’ parameter to “Customers” and select Drop table if exists and create from the ‘Action on table’ dropdown.
  17. Save the Job. The completed Job should look like this:

Execute the Job

  1. Execute the job and review the output.
    1. To review the data written to the MySQL Database, right-click on the MySQL component and choose Data Viewer. Notice the Phone Numbers have been standardized, and invalid emails have been removed.
    2. To review the data written to HDFS, within the repository, navigate to Job Designs -> Standard and open the HDFS_DataViewer Job. This is a pre-configured job to read the newly created HDFS file and output to the Job Execution Window. Run the job and review the results. You will notice that email address and ssn have been masked and if compared to the data within the MySQL database, you should notice that firstName and lastName, along with the combination of City, State, and Zip have all been shuffled within the data. Now your Data Scientist can analyze this cleansed and real-life data for machine learning purposes without compromising Customer Identity.

    Exercise 2

    Summary

    In this exercise you will build and configure a job to train a machine learning model based on the Decision Tree Model. With Talend, this process is straightforward and very easy.

    Purpose

    Learn to train a machine learning model with Talend Studio components.

    Pre-Requisites

    • Complete Exercise 1 or Execute the Exercise_1_Example job to load the MySQL Database and generate the source file on HDFS.

    Procedure

    Note:  You can find a completed example to check your work under Job Designs -> Big Data Batch -> AdvAnalyticsWorkshop -> Exercise_02_Example

    Build the Job

    1. In the Studio Repository, navigate to Job Designs -> Big Data Batch -> AdvAnalyticsWorkshop -> Exercise_2. Double-Click on the job ‘Exercise_02_Train_MarketingCampaignData’ to open this empty job in the Design Window.
    2. Now in the Studio Repository, navigate to Metadata -> Hadoop Cluster -> Hadoop Connection 0.1 -> HDFS. In the HDFS folder, drag and drop the HadoopHDFS object onto the Job Designer. In the Components Pop-up, select tHDFSConfiguration and click OK. (Note: If you receive a pop-up for Hadoop Configuration Update Confirmation, Click OK.)
    3. From the Repository, under Metadata -> File delimited, drag and drop the Customers object onto the Job Designer. Confirm the box is checked to ‘Define a storage configuration component’ and keep the default value of the tHDFSConfiguration component. Next click on the ellipsis next to the ‘Folder/File’ parameter and in the HDFS Browse pop-up, navigate to /user/talend/workshop and select the marketing_campaign.csv file and click OK. Change the ‘Field separator’ to a comma (“,”).

    4. Add a tFilterColumns component and connect it to the Customers input component. Click on Edit Schema in the tFilterColumns component and add only the fields listed below to the Output Schema and click OK:
      1. age
      2. jobytpe
      3. maritalstatus
      4.  educationlevel
      5.  indefault
      6.  hasmortgage
      7.  haspersonalloan
      8.  numcampaigncalls
      9.  priorcampaignoutcome
      10.  conversion

    5. Add a tModelEncoder component and connect it to the tFilterColumns component. Configure the tModelEncoder component by first clicking on the Sync Columns button in the component configuration. Then choose to Edit Schema and add two new columns to the Output Schema. When done, click OK.
       Column Data Type
       MyFeatures Vector
       MyLabels Double


    6. In the ‘Transformations’ table of the tModelEncoder component, add a transformation and select RFormula (Spark 1.5+) from the ‘Transformation’ dropdown list. Add the below code as the ‘Parameters’ value:
      • featuresCol=MyFeatures;labelCol=MyLabels;formula=conversion ~ age + jobtype + maritalstatus + educationlevel + indefault + hasmortgage + haspersonalloan + numcampaigncalls + priorcampaignoutcome
    7. Finally, add a tDecisionTreeModel component and connect it to the tModelEncoder. Configure the tDecisionTreeModel component by checking the box to ‘Define a storage configuration component’ and keep the default value of tHDFSConfigurationUpdate the ‘Features and Labels’ by selecting MyFeatures from the ‘Features Column’ Dropdown and MyLabels from the ‘Label Column’ Dropdown. Update the ‘HDFS Folder’ location to "/user/talend/workshop/decisiontree". Leave the remaining default values.

    Execute the Job

    1. Save and Execute the job. If prompted, choose to Download and install all modules available. The job will use the Source Customers file to train a decision tree machine learning model, saving the resultant model on HDFS.

     

    Summary

    In this exercise you will finish building a Spark Streaming job. This job will use the Decision Tree Model that was created in Exercise 2 to determine if a user is a candidate for a targeted marketing campaign.

    Purpose

    Complete a Spark Streaming job in Talend Studio and execute the use-case.

    Pre-Requisites

    • Complete Exercise 1 or Execute the Exercise_1_Example job to load the MySQL Database and generate the source file on HDFS.
    • Complete Exercise 2 or Execute the Exercise_2_Example job to train a Decision Tree Model

    Procedure

    Note:  You can find a completed example to check your work under Job Designs -> Big Data Streaming -> AdvAnalyticsWorkshop -> Exercise_03_Example

    Finish Spark Streaming Job

    1. In the Talend Repository, navigate to Job Designs -> Big Data Streaming ->  AdvAnalyticsWorkshop -> Exercise_3Open the job Exercise_03_RealtimeTargetMarketing. This job has already been started but is missing some key components. You should notice it contains connection configuration components for Cassandra and HDFS. It also has a Kafka input component as well as a tReplace component and Cassandra Output component already configured.
    2. Add three additional components to the Job Designer.
      1. tWindow
      2. tExtractDelimitedFields
      3. tPredict

    3. Connect the Kafka input component to the tWindow component and configure the tWindow component. Set the ‘Window duration’ to 2000 and check the box to ‘Define the slide duration’. Set this value to 2000 as well.
    4. Connect the tWindow component to the tExtractDelimitedFields component and configure the tExtractDelimitedFields component. In the ‘Schema’ dropdown, select Repository, then click on the Ellipses to select the Repository item. In the ‘Repository Content’ pop-up, navigate to Generic schemas -> PredictionSchemaSelect the ‘metadata’ object and click OKUpdate the ‘Field Separator’ to a comma (“,”)
    5. Connect the tExtractDelimitedFields component to the tPredict component and configure the tPredict component. Check the box to ‘Define a storage configuration component’ and keep the default value. In the ‘Model Type’ dropdown, select Decision Tree Model and then update the ‘HDFS Folder’ path by clicking on the ellipsis and traversing the HDFS directory path to the decision tree model directory. It should be ‘/user/talend/workshop/decisiontree’.
    6. Finally, connect the tPredict component to the tReplace component. When asked if you would like to get the schema of the Target component, click NoSave the job. The final job should look like the below image:

    Execute the Complete Demo Scenario

    In this training, you have completed 3 exercises. In the first exercise you built a Standard Data Integration job to cleanse, standardize and mask a source file of Customer data. In the second exercise you used that file to train a Decision Tree Model to determine if a specific user should receive a targeted marketing campaign ad. The final exercise completed a Spark Streaming Job to use the trained model for targeted marketing of customers as they log into the system. To execute the full scenario, complete the following steps.

    1. Initialize the environment and start two pre-configured web service jobs. Navigate to Job Designs -> Standard -> Setup_and_Services.
      1. Open and Execute job A_EnvironmentSetup. This job creates a Kafka topic and initializes a Cassandra table.
      2. Open and Execute job B_LoginService. This job starts a web service that retrieves User Profile information from the MySQL database that was populated as part of Exercise 1 and provides it back to the web interface. This job will remain running throughout the demo.
      3. Open and Execute job C_AdService. This job starts a web service that determines if a user has received a targeted marketing campaign already stored in Cassandra. If the user has no score, their information is retrieved from the MySQL database and sent to the Kafka queue where the Realtime Target Marketing job completed in Example 3 will determine if the user should be targeted for a specific marketing ad. The results will be displayed on the web interface. This job will remain running through the demo.
    2. Execute the job Exercise_03_RealtimeTargetMarketing completed in Exercise 3. This job will take a few minutes to initialize as it is submitted to the Hadoop Cluster.
    3. Open Firefox web browser and navigate to the training home page. There will be a link to the Real-Time Target Marketing Campaign. Click the Access button.
    4. Enter a User Id (any number from 1-1546) and click the Login button. The Login button calls a web service that initiates the Spark Streaming job to determine if the logging-in user should receive a targeted marketing campaign. After a few seconds, the results should appear.
    5. Continue to Click Logout and then enter different user ids to see if targeted marketing is available. Try these user Ids — 569, 864, 912, 1427 — and note the results. If you log in twice with the same user id, the decision will be displayed without hesitation, because targeted ads are stored in Cassandra for quicker retrieval.

     

    Conclusion

    In this hands-on training environment, you were able to get real life experience with the capabilities of Talend Real-time Big Data Platform. Through the exercises you were able to build a complete end-to-end solution to a complex business problem using some of the latest technology in advanced analytics and machine learning. With Talend, you enjoy the benefits of a future-proof platform that enables you to embrace any innovation. Talend also simplifies and automates advanced analytics integration projects with on-demand serverless Spark and machine learning capabilities. Further, Talend leverages the full power of cloud and advanced analytics with native code generation. For more information and to explore more of what Talend can do for your business, visit www.talend.com.

| Last Updated: August 7th, 2019