Talend and Splunk: Aggregate, Analyze and Get Answers from Your Data Integration Jobs
Log management solutions play a crucial role in an enterprise's layered security framework— without them, firms have little visibility into the actions and events occurring inside their infrastructures that could either lead to data breaches or signify a security compromise in progress.
Splunk is the “Google for log files” heavyset enterprise tool that was the first log analysis software and has been the market leader ever since. So lots of customers will be interested in seeing how Talend can integrate with their enterprise Splunk and leverage Splunk's out of the box features.
Splunk captures, indexes, and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards, and visualizations. It has an API that allows for data to be captured in a variety of ways.
Splunk's core offering collects and analyzes high volumes of machine-generated data. It uses a standard API to connect directly to applications and devices. It was developed in response to the demand for comprehensible and actionable data reporting for executives outside a company's IT department.
Splunk has several products but in this blog, we will only be working with Splunk Enterprise to aggregate, analyze and get answers from your Talend job logs. I’ll also cover an alternative approach where developers can also log customized events to a specific index using the Splunk Java SDK. Let’s get started!
Intro to Talend Server Log
Let’s start by introducing you to the Talend Log Server. Simply put, this is a logging engine based on Elasticsearch which is developed alongside a data-collection and log-parsing engine called Logstash, and an analytics and visualization platform called Kibana (or ELK).
These technologies are used to streamline the capture and storage of logs from Talend Administration Center, MDM Server, ESB Server and Tasks running through the Job Conductor. It is a tool for managing events and Job logs. Talend supports the basic installation but features like HA and APIs to read/write are beyond Talend scope of supportability.
To understand configuring Talend logging modules with an external Elastic stack please read this article.
Configure Splunk to Monitor Job Logs
Now that you have a good feel for the Talend Server Log, let’s set up Splunk to actually monitor and collect data integration job logs. After you log into your Splunk deployment, the Home page appears. To add data, click Add Data. The Add Data page appears. If your Splunk deployment is a self-service Splunk Cloud deployment, from the system bar, click Settings > Add Data.
The Monitor option lets you monitor one or more files, directories, network streams, scripts, Event Logs (on Windows hosts only), performance metrics, or any other type of machine data that the Splunk Enterprise instance has access to. When you click Monitor, Splunk Web loads a page that starts the monitoring process.
Select a source from the left pane by clicking it once. The page is displayed based on the source you selected. In our case we want to monitor Talend job execution logs, select "Files & Directories", the page updates with a field to enter a file or directory name and specify how Splunk software should monitor the file or directory. Follow the on-screen prompts to complete the selection of the source object that you want to monitor. Click Next to proceed to the next step in the Add data process.
To start, log in to your Talend Studio and create a simple job that will read a string via context variable, extract first three characters and displays both actual and extracted string.
Now that we’ve gotten everything set up, we’ll want to leverage the Splunk SDK to create custom (based on each flow in the Talend job) events and send it back to Splunk server. A user routine is written to make Splunk calls and register the event to an index. The Splunk SDK jar is set up as a dependency to the user routines so that leverage Splunk SDK methods
Here is how to quickly build the sample Talend Job Below:
- Splunk configuration is created as context and passed to routine via tJava component
- Job started and its respective event is logged
- Employee data is read and its respective event is logged
- Department data is read and its respective event is logged
- Employee and Department datasets are joined to form a de-normalized data and its respective event is logged
Switch back to Splunk and search with the index used in the above job – you’ll be able to see events published from job.
Using the exercise and process above, it is clear that Talend can seamlessly connect to Enterprise Splunk and push customized events and complete job log files to Splunk.