Talend and Apache Spark: A Technical Primer
In my years at Talend as a Support Engineer, before I moved into the Customer Success Architect team, customers often asked about Talend’s capabilities with Apache Spark. When we talk about Spark the first thing that always comes to mind is the command Spark submit that we use to submit our Spark jobs. So, the question naturally comes up on how a Talend Spark job equates to a regular Spark submit. In this blog, we are going to cover the different Apache Spark modes offered, the ones used by Talend, and how Talend works with Apache Spark.
An Intro to Apache Spark Jobs
Apache Spark has two different types of jobs that you can submit. One of them is Spark Batch and the other is Spark Streaming. Spark Batch operates under a batch processing model, where a data set that is collected over a period of time, then gets sent to a Spark engine for processing.
Spark Streaming, on the other hand, operates under a streaming model where data is sent to a Spark engine piece by piece and the processing happens in real time. Talend supports both of those job types and allows you to create Spark jobs for each one of those types. Within the Talend Studio, depending on your license, you will be given the option for “Big Data Batch” to create Spark Batch Jobs and “Big Data Streaming” to create Spark Streaming Jobs.
Diving Deeper on Talend and Apache Spark
Before we proceed any further, I want introduce some key concepts that are going to be used throughout the rest of this blog:
- Spark Driver: It is responsible for sending over your application to Spark Master, creating and running your Spark Context
- Spark Master: It is responsible for requesting resources from YARN as defined by the Spark Driver and finding the hosts that will run your job
- Spark Executor: A process that is started on worker nodes that runs in memory or disk your job submission
To begin with, we will start with some context on how Spark jobs work either using Spark submit or Talend. In Spark jobs, there is always a “driver” that sets up and coordinates your Spark job. The Spark driver, in this case, sets up the configuration that is going to be used by your job such as the spark master to connect to or how much memory is going to be allocated to your Spark executors. Thus, Talend does the equivalent of a Spark submit as in the premise that there is always a Spark driver that sets up and coordinates your Spark job.
Now, when you do a Spark submit from within the Hadoop cluster, some of the configuration information is retrieved from your cluster configuration files. Since Talend Studio is not always on a Hadoop cluster, we need to provide this information within our job in the studio so that it is aware of the settings it can use.
In regards to defining the data transformations that take place in the Spark job, in Talend, it happens at the compiling of the job which is identical to what is done when the Spark submit procedure is used. Similarly to Spark submit, Talend also starts the job as the “driver” defined above, although the job is not run in the driver, but on Spark executors at the cluster level. Once the job is started, Talend monitors the job by listening to events happening at Hadoop cluster level to provide how the job is progressing which is similar to what happens when you use spark submit.
If either Spark submit or a Talend job is used to submit your job to Spark, there are three modes offered depending on your Hadoop cluster configuration. Based on the Spark documentation here is the 3three different modes (http://spark.apache.org/docs/latest/cluster-overview.html):
1. Standalone: In this mode, there is a Spark master that the Spark Driver submits the job to and Spark executors running on the cluster to process the jobs
2. YARN client mode: Here the Spark worker daemons allocated to each job are started and stopped within the YARN framework. The Spark driver as described above is run on the same system that you are running your Talend job from.
3. YARN cluster mode: When used the Spark master and the Spark executors are run inside the YARN framework. They start and stop with the job. In this case, the Spark driver runs also inside YARN at the Hadoop cluster level.
Now that we defined the modes offered by Spark, we are going to examine what Talend offers. Here are the different modes supported in Talend:
1. Local: When selected, the job will spin up a Spark framework locally to run the job. Your local machine is going to be used as the Spark master and also as a Spark executor to perform the data transformations.
2. Standalone: In this mode, as also defined above Talend will connect to the Spark Master defined in your Hadoop cluster and then run the job.
3. YARN client mode: As also defined above, Talend Studio will run the Spark “driver” to orchestrate your job from the location your job is started and then send the orchestration to YARN framework for execution and allocation of resources. This is the available selection for Hadoop distros like Hortonworks, Cloudera, MapR, Amazon EMR and so on.
4. YARN cluster: This mode is currently only supported for HDInsight and Cloudera Altus within Talend. In this mode as we mentioned above, Talend will run the Spark “driver” inside YARN at the Hadoop cluster level.
Top three questions on Talend and Apache Spark, Answered:
- Does Talend submit all the libraries and files needed for the job to the Spark Master at a single moment or certain information still runs on the studio?
Answer: All of the libraries are not necessarily sent to the Spark Master at one single moment. There is always a possibility that the executors may call back to the Spark “driver” to send necessary libraries at that point in time. In this case, the “driver” still runs while the job is getting processed in order to wait for the job to finish and then provide the status of the job back. However, like in any Spark job, there is some processing that may end up happening on the “driver” end. A good example of this is if you have a tLogRow in your job as it will have to collect all the information from the cluster and then print it to the console, or if you are using the tCacheIn and tCacheOut components where the “driver” will store the metadata information on where the files are located in memory. In this case, the Talend Studio is no different than if you run a spark submit.
- If I want to write a file that wont be in HDFS, but locally to the studio or jobserver, without providing a storage configuration, will the file be written?
Answer: A storage configuration needs to be provided to the job. It is recommended to that for job stability and it’s preferable to use one storage configuration for the whole job. It is not possible to use two HDFS storage locations that are for 2 different clusters or two different authentication methods. Using though HDFS and S3 as an example this would work. The reason why the job cannot write to the local filesystem is because the spark workers don’t have visibility to the local filesystem of the “driver” as that is why also the spark “driver” opens a server to send the libraries to the cluster. If this needs to be achieved the best option is to write a temporary file in HDFS, and then have a DI job that uses tHDFSGet to bring that file back to the local machine. This behavior is also not different from spark submit.
- Why the environment tab in Spark Master web interface takes me to the IP address of the studio?
Answer: This tab should always take you to the IP address of the Spark Driver. If you are running the spark submit on the cluster then you wont notice a re-direction, but if you do then you will notice it. In this case, Talend Studio is no different than a spark submit.
Hopefully, the information above has demonstrated that running jobs on Talend is no different from performing a Spark submit. Talend makes it easy to code with Spark, provides you with the ability to write jobs for both Spark batch and Spark Streaming and to use the Spark jobs you design for both batch and Streaming. I invite you to start writing your own Spark jobs using Talend and experience how easy is to get jobs running against your Hadoop cluster or standalone Apache Spark.