How to Configure ELK Stack for Telemetrics on Apache Spark


In this blog, I want to go over how to set up and deploy a Talend Spark Streaming job into a new Elastic Stack instance. Spark is the engine of choice for near real-time processing, not only for Talend but also for many organizations who have a need for large-scale lightning fast data processing. The Elastic Stack is a highly versatile and widely adopted suite of tools built for monitoring that works perfectly for this scenario.

The ELK Stack is comprised of Elasticsearch for indexing, Logstash of aggregating data, and Kibana for visualization. In the following tutorial we will install and configure the stack to read Spark Streaming metrics and display them as visualizations. Talend for Real-time Big Data is used to create the Spark Streaming processes. What we are trying to achieve overall is holistic insight into all our processes as an organization. Holistic insight is important because a single application may have hundreds of loosely coupled and interconnected processes and systems. Having the ability to recognize issues, like bottlenecks, data loss, underutilized resources, etc. without needing to switch between multiple pages, software, or charts, saves individuals and the organization measurable time and effort.

The way that we integrate Talend and Spark into this single solution is through JMX. JMX is a generic interface, not tightly coupled with the Elastic Stack and not exclusive to Talend or Spark, therefore it can be used as an interface from any Java based application to many monitoring tools. What this gives you is all available metrics for a Spark Streaming job synchronized, indexed, and visualized in near real-time to your analytics or monitoring tool of choice.


Getting Started with Logstash

Unzip to C:\elk\logstash-5.3.0.

Test the install

If everything is ok it will print the following with the current date and time:

2017-03-30T17:12:32.876Z MYHOSTNAME hello world


The JMX plugin is specifically for retrieving data from a running Java program. This does not come with Logstash and is maintained by a thirdparty.



First, edit the logstash configuration file.

Add the following lines to jmx.conf:


 Folder location where configurations are stored. REQUIRED


 The interval between two jmx metrics retrieval in seconds.


 Used as the type in Elasticsearch


 Number of threads used to retrieve metrics


Next, create a file for the plugin configuration. Add the following lines to jmx-input-conf:


 Can be found by opening Jconsole and selecting MBeans


 The values under the MBean that you would like Logstash to collect. Each is collected as a separate entry.


 Appears in Elasticsearch as a field value


A quick note before we continue, the object names will change depending on Spark version and distribution. Wildcards are possible, for instance we can have an object_name “kafka.consumer:type=*” to retrieve all of the objects under the kafka.consumer Mbean. This configuration may need to be changed for the tutorial after inspecting Jconsole during the Talend and Spark section.




Unzip to C:\elk\elasticsearch-5.0.0:

1)    START


Here we add an Analyzer to the field we are using to search our logs in Kibana. The reason for this is so that certain strings can be searched easier. The stop analyzer type breaks out the field based on characters like period and slash. This works well for the metric_path.

Run the commands using CURL or a REST client. Replace 2017.03.30 with the current date.

To reset an Elasticsearch Index run a XDELETE on the index and then run the XPUT again.

When you see the date 2017.03.30 replace it with the current date. Logstash automatically uses the date as a template for Elasticsearch. This can be changed in the jmx.conf.


Unlocking the Power of Apache Spark in Talend

This process requires Talend for Real-time Big Data. We must create a Streaming Spark Job to run this scenario.


Create the file C:\Talend\Spark\

Add the line:


The most important aspect of the job creation is to have a Kafka input and Kafka output. These two components are what the object_name property of the jmx-input-conf file is based on. These jobs do require that Kafka is installed.

The prebuilt Jobs can be downloaded HERE


Update the Spark Configuration for the job. Add the new items to Advanced properties:





Update the Advanced settings for the job. Check the Use specific JVM arguments box to add the following parameters:


After the jobs have been built, and settings changed, we can run them locally. When the Spark Streaming job is running you can open Jconsole to view the metrics.

From Run or CMD type Jconsole. This will bring up the application window if the Java environment variables are configured correctly.

Connect to a Remote Process with the URL localhost:54321


Unzip to C:\elk\kibana-5.0.0-windows-x86

1)    START

Go to http://localhost:5601/

When you first start Kibana you should be prompted to setup the Index Patterns. This should be gathered semi-autonomously from Elasticsearch. This is the reason we focus on Kibana last. If the other processes are not running and collecting information there will not be an initial index Pattern to select. Once the index has been set, we can open the Discover page to query the logs.


For each visualization, the METRIC and BUCKET will be the same. Set them to the following:


 Sum of metric_value_number


 Date Histogram of @timestamp

In the Search field, we need to enter filters for each chart that is being created. Enter the following filters, one per visualization:

  1. metric_path:(incoming AND byte AND rate)
  2. metric_path:(outgoing AND byte AND rate)
  3. metric_path:OneMinuteRate

Save each of the Visualizations so you can add them to a dashboard.

This is an example dashboard that incorporates 4 metrics from our Talend Spark Streaming job:


Looking at a black box wondering what is going on inside and how it’s affecting your operations is about as effective as looking at 10 separate dashboards and trying to combine the figures in your head. While each application generally has its own monitoring tools, having a single solution for attaining insight is the goal. A lot of organizations see this need, and have chosen the Elastic Stack to solve this challenge.

Metrics for Spark Streaming are available natively, but have their own separate interface which is not generally utilized by anyone outside of IT.  And while Talend provides monitoring capabilities for its own solution out of the box (including an Elastic Stack), it is much better for an organization to integrate its data management platform metrics into their own monitoring suite.


Example URL for Elasticsearch jmx mapping:


Reset Elasticsearch

To remove all of the indexed logs you need to delete the actual index from the REST API. Then reimplement the analyzer. Make sure you stop the Talend Spark Process first, so that no logs are written while you are attempting this.

Reset Kafka

To reset Kafka, you must stop the program and remove the log files. The location of these files are set via configuration and can be changed.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>