TALEND CONNECT 2018 : Get inspired by the movers and shakers in the big data world in NYC
How to Configure ELK Stack for Telemetrics on Apache Spark
How to Configure ELK Stack for Telemetrics on Apache Spark
In this blog, I want to go over how to set up and deploy a Talend Spark Streaming job into a new Elastic Stack instance. Spark is the engine of choice for near real-time processing, not only for Talend but also for many organizations who have a need for large-scale lightning fast data processing. The Elastic Stack is a highly versatile and widely adopted suite of tools built for monitoring that works perfectly for this scenario.
The ELK Stack is comprised of Elasticsearch for indexing, Logstash of aggregating data, and Kibana for visualization. In the following tutorial we will install and configure the stack to read Spark Streaming metrics and display them as visualizations. Talend for Real-time Big Data is used to create the Spark Streaming processes. What we are trying to achieve overall is holistic insight into all our processes as an organization. Holistic insight is important because a single application may have hundreds of loosely coupled and interconnected processes and systems. Having the ability to recognize issues, like bottlenecks, data loss, underutilized resources, etc. without needing to switch between multiple pages, software, or charts, saves individuals and the organization measurable time and effort.
The way that we integrate Talend and Spark into this single solution is through JMX. JMX is a generic interface, not tightly coupled with the Elastic Stack and not exclusive to Talend or Spark, therefore it can be used as an interface from any Java based application to many monitoring tools. What this gives you is all available metrics for a Spark Streaming job synchronized, indexed, and visualized in near real-time to your analytics or monitoring tool of choice.
Getting Started with Logstash
Unzip logstash-5.3.0.zip to C:\elk\logstash-5.3.0.
Test the install
If everything is ok it will print the following with the current date and time:
2017-03-30T17:12:32.876Z MYHOSTNAME hello world
1) INSTALL THE JMX PLUGIN
The JMX plugin is specifically for retrieving data from a running Java program. This does not come with Logstash and is maintained by a thirdparty.
2) CREATE THE CONFIGURATION FILES
First, edit the logstash configuration file.
Add the following lines to jmx.conf:
Folder location where configurations are stored. REQUIRED
The interval between two jmx metrics retrieval in seconds.
Used as the type in Elasticsearch
Number of threads used to retrieve metrics
Next, create a file for the plugin configuration. Add the following lines to jmx-input-conf:
Can be found by opening Jconsole and selecting MBeans
The values under the MBean that you would like Logstash to collect. Each is collected as a separate entry.
Appears in Elasticsearch as a field value
A quick note before we continue, the object names will change depending on Spark version and distribution. Wildcards are possible, for instance we can have an object_name "kafka.consumer:type=*" to retrieve all of the objects under the kafka.consumer Mbean. This configuration may need to be changed for the tutorial after inspecting Jconsole during the Talend and Spark section.
3) START LOGSTASH WITH CONFIGURATIONS
Unzip elasticsearch-5.0.0.zip to C:\elk\elasticsearch-5.0.0:
2) CREATE AN ANALYZER
Here we add an Analyzer to the field we are using to search our logs in Kibana. The reason for this is so that certain strings can be searched easier. The stop analyzer type breaks out the field based on characters like period and slash. This works well for the metric_path.
Run the commands using CURL or a REST client. Replace 2017.03.30 with the current date.
To reset an Elasticsearch Index run a XDELETE on the index and then run the XPUT again.
When you see the date 2017.03.30 replace it with the current date. Logstash automatically uses the date as a template for Elasticsearch. This can be changed in the jmx.conf. https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-manage_template.
Unlocking the Power of Apache Spark in Talend
This process requires Talend for Real-time Big Data. We must create a Streaming Spark Job to run this scenario.
1) CONFIGURE THE SPARK METRICS
Create the file C:\Talend\Spark\metrics.properties
Add the line:
2) CREATE THE JOBS
The most important aspect of the job creation is to have a Kafka input and Kafka output. These two components are what the object_name property of the jmx-input-conf file is based on. These jobs do require that Kafka is installed.
The prebuilt Jobs can be downloaded HERE
3) MODIFY THE SETTINGS
Update the Spark Configuration for the job. Add the new items to Advanced properties:
"-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=54321 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"
Update the Advanced settings for the job. Check the Use specific JVM arguments box to add the following parameters:
4) RUN THE JOBS
After the jobs have been built, and settings changed, we can run them locally. When the Spark Streaming job is running you can open Jconsole to view the metrics.
From Run or CMD type Jconsole. This will bring up the application window if the Java environment variables are configured correctly.
Connect to a Remote Process with the URL localhost:54321
Unzip kibana-5.0.0-windows-x86.zip to C:\elk\kibana-5.0.0-windows-x86
Go to http://localhost:5601/
When you first start Kibana you should be prompted to setup the Index Patterns. This should be gathered semi-autonomously from Elasticsearch. This is the reason we focus on Kibana last. If the other processes are not running and collecting information there will not be an initial index Pattern to select. Once the index has been set, we can open the Discover page to query the logs.
CREATE THE VISUALIZATIONS
For each visualization, the METRIC and BUCKET will be the same. Set them to the following:
Sum of metric_value_number
Date Histogram of @timestamp
In the Search field, we need to enter filters for each chart that is being created. Enter the following filters, one per visualization:
- metric_path:(incoming AND byte AND rate)
- metric_path:(outgoing AND byte AND rate)
Save each of the Visualizations so you can add them to a dashboard.
This is an example dashboard that incorporates 4 metrics from our Talend Spark Streaming job:
Looking at a black box wondering what is going on inside and how it's affecting your operations is about as effective as looking at 10 separate dashboards and trying to combine the figures in your head. While each application generally has its own monitoring tools, having a single solution for attaining insight is the goal. A lot of organizations see this need, and have chosen the Elastic Stack to solve this challenge.
Metrics for Spark Streaming are available natively, but have their own separate interface which is not generally utilized by anyone outside of IT. And while Talend provides monitoring capabilities for its own solution out of the box (including an Elastic Stack), it is much better for an organization to integrate its data management platform metrics into their own monitoring suite.
Example URL for Elasticsearch jmx mapping:
To remove all of the indexed logs you need to delete the actual index from the REST API. Then reimplement the analyzer. Make sure you stop the Talend Spark Process first, so that no logs are written while you are attempting this.
To reset Kafka, you must stop the program and remove the log files. The location of these files are set via configuration and can be changed.
Most Downloaded Resources
Browse our most popular resources - You can never just have one.