After having had years of experience developing enterprise data and software solutions, I often find that one of the most neglected areas is in monitoring and alert systems. Normally, the focus of IT is always on business functionality, development, testing methodologies. However, the importance of server monitoring and alert systems cant be forgotten. Unfortunately, the importance is often realized only after it’s too late and services go down causing a huge business impact.
In this blog, I’d like to talk a little bit about a few key strategies that IT teams should think about when implementing systems for monitoring and alerts.
Server Monitoring 101: Why It’s Important
For a moment, imagine that you are an administrator of a large IT infrastructure and your responsibility includes ensuring all systems are running smoothly. Or, imagine that you are a service provider and one of your services is down and customer complaints begin rolling in. What are your options to start debugging? And more importantly, where do you start?
Server Monitoring is a process to monitor server’s system resources like CPU Usage, Memory Consumption, I/O, Network, Disk Usage, Process etc. Server Monitoring helps understanding server’s system resource usage which can help you better your capacity planning and provide a better end-user experience. Talend’s tool set and connector ecosystem provide comprehensive monitoring and alert for all of our product stack as well as the software process developed using Talend.
There are many tools available on the market for server monitoring but for this blog, we will focus on Nagios as its open source and is widely used. Nagios is a tool for system monitoring. It has a modular approach which is a flexible solution in that it does not perform any host or service by its own but based on plugins. Nagios, utilizes these plugins to continuously check the health of monitored server host and services. Two main categories which are monitored by Nagios are, host and services. Talend’s runtime supports Nagios, Hyper HQ and any generic JMX based monitoring system. The following diagram depicts how Nagios functionality is integrated into Talend ESB.
Hosts are physical machines (servers where Talend modules are installed, routers, workstations, printers and so on), while services are functionalities, for example, a Talend runtime or even services deployed on these servers can be defined as a service to be monitored.
Nagios is very informative and easy to interpret and does not monitor the values but only server status. There are 4 main states to describe the status:
The limit for each of these values is defined by the administrator.
Another benefit is that a report states the number of services that are up and running in both a warning state and a critical state. This type of reporting offers a good overview of your infrastructure status. Nagios not only offers a core system for monitoring but extensive monitoring with separate standard plugins. For Talend ESB, jmx4perl and Jolokia plugin is used to fetch service status.
Nagios flexibility and simplicity is its strength. Its capability to monitor any IT infrastructure to meet any SLA for a non-functional requirement like availability or performance makes it an ideal tool. It also has a mechanism to automatically react to problems and a powerful notification system. Furthermore, many dashboards can be built on top of Nagios. Sarah Wells at Financial Times has described the problem they faced with alert and notifications and how they have solved using dashboards and some other visual equipment’s (screen, lights etc.) and tools like Nagios and Graphite /Grafana. Below are the main building blocks of Nagios system.
Commands are the main or entrance point to query the system. They define how Nagios should perform types of checks.
Time periods are the frequency and schedule when the system should perform check.
Contacts and contact groups are entities or people who should be notified, along with information on how and when they should be contacted.
Host are physical machines, along with information on who should be contacted, how checks should be performed, and when.
Services are various functionalities or resources to monitor a specific host. For example, services developed in Talend.
Host and service escalations as depicts from name are the rules that should be trigged at certain events. It defines the specific time-periods after which additional people should be notified.
After encountering multiple “system down” situations in my long IT career, I always recommend making “monitoring and alert” part of any Software Development Life Cycle. It is as important as development, testing, and deployment. Having a dedicated process in place for monitoring resources is not only useful for identifying problems; it can also save you from running into them. Nagios handles warnings and critical situations differently. This means that it’s possible to recognize potentially problematic situations quickly and proactively. As mentioned earlier in this article that Talend supports monitoring via JMX and it is compatible with JMX compliance monitoring tool. A comprehensive guide for Nagios integration for Services developed in Talend can be found at ‘Nagios Integration’ at Talend Help Center.
References: Learning Nagios 3.0 by Wojciech Kocjan
Sarah Wells, Financial Times, Monitoring a Microservices architecture at Financial Times.