In a recent Post we mentioned that we were working away on a new Nagios build that would form the base for our Nagios monitoring for the IBM i, well we are now very close to shipping the product and though we would share with you some of the details.
The original plan to use the NEMS for Linux product fell by the wayside as we tried to implement changes that we felt would be required by the users we were targeting. We know the Raspberry Pi solution would work for a number of the smaller IBM i shops especially those who are only interested in monitoring the IBM i and not all of the other platforms Nagios can monitor. This meant we needed to build a solution that could run on any Linux based solution in either a standalone instance or a VM/Container.
The first decision was the Linux distribution we would use. As we have been using Debian for many years internally, we decided it would be our distribution of choice for this solution. The version we use is Version 10 which is classed as stable and supported. We are not Linux Experts so using the limited knowledge we have helped with understanding the elements that built over the top of it, especially when it did not work.
Next we looked at what applications we would need to install to ensure we had everything required to run the monitoring services. NEMS for Linux came with a number of monitoring and configuration options but we wanted to deliver a simplified solution that provided the right applications which are current and maintained.
There are a number of packages that make up our monitoring solution on the Linux side, we went with NConf for the configuration interface, Nagios Core to provide the monitoring services, NagiosTV as the monitoring interface ( we could have stuck with just the output generated within Nagios Core but we felt NagiosTV provided a better user experience) and Cockpit for the terminal services and limited system monitoring.
Access
We have added a front end to our build so the user does not have to remember the url’s to access each of the elements that make up the solution. As with anything we design we wanted to make it as simple and easy to use as possible and as you can see below its simple and easy to use.
Configuration
Nagios configuration can be carried out using a number of tools provided by the community, we decided to go with the NConf package even though it has not been updated for over 10 years and is not actively maintained (something we will address in the future). We had to make a few minor changes to the package code before it would compile due to a couple of minor errors, we have pushed up the changes to our fork of the code on GitHub. It uses a MySQL database (MariaDB) for storing all of the configuration options and attributes, this took a little time to get used to at first, but now that we have gained some experience we have been able to implement pushover messaging into the notification process which falls in line with what we did for our EM4i product. Overall we feel its a very suitable package for configuring Nagios.
Monitoring
Nagios core provides a graphical user interface to view the status and set up of the various Nagios elements and while it is pretty solid it does not offer all of the bells and whistles that people want out of a monitoring solution. It is installed as part of the Nagios Core so there are no extra packages required to have access to it. (We have used the Nagios XI product which is the Nagios Core with the additional monitoring and configuration elements already built in, but as it is a chargeable option and they would not provide and development licenses we did not explore it too much). As you can see from the images below Nagios Core is pretty dated in terms of its design, but its practical and usable for day to day reviews. It provides no configuration interfaces. If you look closely at the screen below shows the example output you will see many of the checks we have built specifically for the IBM i.
Live Monitoring
For the Monitoring interface we went with NagiosTV. Having used the version installed with NEMS for Linux we felt is was pretty slick, but the supported version we installed in our build has a lot more capabilities and is fully supported by the original developer. The basic concept is to only show problems to the operator, if the checks return non critical/warning status they are not reported back to the monitoring screen, but the returned information can be accessed via the Nagios Core screens. The package was developed with the intention of having a single monitor (TV) in the control room which displays any problems over a number of platforms. It also has voice notifications which could help reduce the need to constantly look at the screen in those customers where few operators exist. As you can see below there are lots of filtering options to ensure you are only seeing the information you need. It cannot be used to resolve the issues at this time but it is something we are investigating.
System tools
We did not want to add lots of fancy tools and options to the base install (particularly as we wanted to keep the install image small enough for the Raspberry PI), but wanted to make sure you could access the Linux instance and carry out basic monitoring of the performance and access a terminal without having to resort to setting up SSH and remote terminals etc. Cockpit seems to fulfill most of those needs in a single package and as its browser based provided us with an interface that integrates with the Nagios and NConf interfaces.
IBM i monitoring
The decision to move away from the Java based requests used in the existing plug ins provided by IBM seems to be paying off. We began to notice that the history logs were filling up very quickly on the IBM i due to the large volume of jobs that are created. The use of a Client/Server process between the Nagios Server and the IBM i removes all of the job startup over head required for each request and reduced the history log creation back to a normal state. Now that we have over 65 touch points for status on the IBM i means running every request every 10 minutes would cause 390 new jobs to be launched and ended every hour, now we have 4 long running jobs that service all of the requests with minimal impact on the system. Its hard to determine exactly how much impact the requests are making on the system but having run all of the check commands against our systems on 10 minute intervals seems to have had little to no impact on the performance of other activities. We did find that the initial start of the Nagios check commands caused some to return with connection errors, we found out that when we submit all of the services under Nagios at once it will run each request on a separate thread, this meant 65 requests hit the IBM i at almost the same time. After a settling period of a few cycles everything started to respond normally and the requests were naturally spaced out within Nagios which resolved the issue naturally.
Configuring the IBM i side of things requires a few parameters to be set before the jobs are started. If you are running HA4i or EM4i we have special requests that retrieve specific status for those products, before the services can run we need to know the library the product was installed in. The only other requirement is how many responder jobs do you want to load and if the communication between the Nagios Server and the IBM i are to be secure. As we expect most customers will run Nagios monitoring within their own network, setting up secure communications is probably not required thereby reducing the effort and skills required to install it.
We will be announcing the product in the coming months after a few beta trails are completed and we have tidied up the manuals and documentation. The move away from the NEMS for Linux solution to our own build has helped us in many ways not only with understanding how Nagios works and how NConf fits into that, but also on being able to introduce our own additions to each of the products.
AAG/NAG4i will make a great addition to our Availability product suite providing the customers with better management and notifications of pending problems that could affect their application availability.
If you are interested in seeing a demo of the products running or want to have more information get in touch.