In this tutorial, create a Big Data batch Job running on YARN, read data from HDFS, sort them and display them in the Console.
This tutorial uses Talend Data Fabric Studio version 6 and a Hadoop cluster: Cloudera CDH version 5.4. It reuses the HDFS connection metadata created in the tutorial entitled “Creating Cluster Connection Metadata.”
1. Create a new Big Data Batch Job using the MapReduce framework
For Big Data processing, Talend Studio allows you to create Batch Jobs and Streaming Jobs running on Spark or MapReduce. In this case, you’ll create a Big Data Batch Job running on MapReduce.
- Ensure that the Integration perspective is selected.
- To ensure that the Hadoop cluster connection and the HDFS connection metadata have been created in the Project Repository, expand Hadoop Cluster.
- In the Repository, expand Job Designs, right-click Big Data Batch, and click Create Big Data Batch Job.
- In the Name field, type ReadHDFS_YARN. From the Framework list, select MapReduce. In the Purpose field, type Read and sort customer data and in the Description field, type Read and sort customer data stored in HDFS from a Big Data Batch Job running on YARN and click Finish.
The Job appears in the Repository under Job Designs > Big Data Batch and it opens in the Job Designer.
2. Read data from HDFS and configure execution on YARN
Note: For Jobs running on YARN, the tFileInputDelimited component allows you to read data from HDFS. You will configure this component to read your data.
- From the Repository, under Metadata > HadoopCluster > MyHadoopCluster > HDFS, click MyHadoopCluster_HDFS and drag it to the Job Designer. In the Components list, select tFileInputDelimited and click OK. The Hadoop Configuration Update Confirmation window opens.
- To allow the Studio to update the Hadoop configuration so that it corresponds to your cluster metadata, click OK.
- In the Run view, click Hadoop Configuration and check that the execution is configured with the HDFS connection metadata available in the Repository.
Now, you can read data from HDFS and your Job is configured to execute on your cluster.
3. Configure the tFileInputDelimited component to read your data from HDFS
Customer data is stored as a delimited file on HDFS. You can configure the tFileInputDelimited component to read the data.
- To open the Component view of the tFileInputDelimited component, double-click the component.
- To open the schema editor, click Edit schema.
- To add columns to the schema, click the [+] icon three times and type the field names as CustomerID, FirstName, and LastName.
- To change the Type for the CustomerID columns, click the field and click Integer. Click OK to save the schema.
Alternative method: Use metadata from the Repository to configure the schema. To learn more about this, watch the tutorial: “Creating and Using Metadata.”
- To provide the location of the file to be read, click […] next to the Folder/File field, browse to find “user/student/CustomersData,” and click OK.
The tFileInputDelimited component is now configured to read customer data from HDFS. Other file types such as AVRO, JSON, and XML are also supported, and files do not have to be delimited.
4. Sort Customer data based on the customer ID value, in ascending order
- Add a tSortRow.
- Connect to the tFileInputDelimited component, named MyHadoopCluster_HDFS, to the tSortRow component using the Main.
- To open the Component view of the tSortRow component, double-click the component.
- To configure the schema, click Sync columns.
- To add new criteria to the Criteria table, click the [+] In the Schema column; select CustomerID. In the sort num or alpha? column, select num and in the Order asc or desc? Column, select asc.
The tSortRow component is now configured.
5. Display the sorted result in the console using a tLogRow component
- Add a tLogRow component and connect it to the tSortRow component using the Main.
- To open the Component view of the tLogRow component, double-click the component.
- In the Mode panel, select Table.
Your Job is now ready to run. It reads data from HDFS, sorts it, and displays it in the console.