In this tutorial, create a Big Data batch Job using the Spark framework, read data from HDFS, sort them and display them in the Console.
This tutorial uses Talend Data Fabric Studio version 6 and a Hadoop cluster: Cloudera CDH version 5.4. It reuses the HDFS connection metadata created in the tutorial entitled “Creating Cluster Connection Metadata.”
1. Create a new Big Data Batch Job using the Spark framework
For Big Data processing, Talend Studio allows you to create Batch Jobs and Streaming Jobs running on Spark or MapReduce. In this case, you’ll create a Big Data Batch Job running on Spark.
- Ensure that the Integration perspective is selected.
- Ensure that the Hadoop cluster connection and the HDFS connection metadata have been created in the Project Repository.
- In the Repository, expand Job Designs, right-click Big Data Batch, and click Create Big Data Batch Job.
- In the Name field, type ReadHDFS_Spark. On the Framework list, ensure that Spark is selected. In the Purpose field, type Read and sort customer data, and in the Description field, type Read and sort customer data stored in HDFS from a Big Data Batch Job running on Spark and click Finish.
The Job appears in the Repository under Job Designs > Big Data Batch and it opens in the Job Designer.
2. Use HDFS metadata definition to configure the connection to HDFS and the execution on Spark
Unlike YARN, Spark can be connected to different file storage systems such as HDFS, Amazon S3, or Cassandra. To read your data from HDFS, you should first configure the connection to HDFS. To do so, you can use the HDFS connection metadata available in the Repository. The metadata will also be useful in configuring the execution of your Job on Spark.
- From the Repository, under Metadata > HadoopCluster > MyHadoopCluster > HDFS, click MyHadoopCluster_HDFS and drag it to the Job Designer. In the Components list, select tHDFSConfiguration and click OK. The Hadoop Configuration Update Confirmation window opens.
- To allow the Studio to update the Spark configuration so that it corresponds to your cluster metadata, click OK.
- In the Run view, click Spark Configuration and check that the execution is configured with the HDFS connection metadata available in the Repository.
You can configure your Job in Spark local mode, Spark Standalone, or Spark on YARN. Local mode is used to test a Job during the design phase. Choosing Spark Standalone or Spark on YARN depends on the version of Spark installed on your cluster. For this tutorial, it’s Spark on YARN.
To configure it, the Studio uses the HDFS connection metadata, which also configures the cluster version, distribution, and resource manager address.
3. Configure the tFileInputDelimited component to read your data from HDFS
Now, you can connect to HDFS and your Job is configured to execute on your cluster. For Jobs running on Spark, the tFileInputDelimited component allows you to read data from various file storage systems.
- In the Job Designer, add a tFileInputDelimited.
- To open the component view of the tFileInputDelimited component, double-click the component.
- In the Storage panel, make sure that the tHDFSConfiguration component is selected as the storage configuration component.
- To open the schema editor, click Edit schema.
- To add columns to the schema, click the [+] icon three times and type the column names as CustomerID, FirstName, and LastName.
- To change the Type for the CustomerID column, click the Type field and click Integer. Click OK to save the schema.
Alternative method: Use metadata from the Repository to configure the schema. To learn more about this, watch the tutorial: “Creating and Using Metadata”.
- To provide the location of the file to be read, click […] next to the Folder/File field, browse to find “user/student/CustomersData” and click OK.
The tFileInputDelimited component is now configured to read customer data from HDFS.
Other file types such as AVRO, JSON, and XML are also supported, and files do not have to be delimited.
4. Sort customer data based on the customer ID value in ascending order
- Add a tSortRow.
- Connect the tFileInputDelimited component, named MyHadoopCluster_HDFS, to the tSortRow component using the Main.
- To open the component view of the tSortRow component, double-click the component.
- To configure the schema, click Sync columns.
- To add new criteria to the Criteria table, click the [+] In the Schema column, select CustomerID. In the sort num or alpha? column, select num and in the Order asc or desc? column, select asc.
The tSortRow component is now configured.
5. Display the sorted data in the console using a tLogRow component
- Add a tLogRow component and connect it to the tSortRow component using the Main.
- To open the component view of the tLogRow component, double-click the component.
- In the Mode panel, select Table.
Your Job is now ready to run. It reads data from HDFS, sorts it, and displays it in the console.
6. Run the Job and observe the result in the console
To run the Job, in the Run view, open the Basic Run tab and click Run. In the Job Designer, note that the Job percentage is 100% at the end of the execution.
The sorted data is displayed in the console.