Nagios Monitoring for IBM i Update

We are almost ready to start shipments of our new AAG (At A Glance) solution for monitoring the IBM i via Nagios. The product is composed of 2 elements, a Nagios Core configuration running on Linux plus a TCP/IP server running on the IBM i that is used to collect and return the status.

The Nagios Core build uses all Open Source components, these can be built using the source and packages available via the various package repositories. On top of this we have programs and scripts that carry out the status retrieval from the IBM i. The following components make up the Open Source solution that can be used without the IBM i package to retrieve status from other infrastructure elements in your network. The IBM Java solution can be added if you are looking for a totally Open source option.

  • Debian Linux
  • Apache
  • PHP MyAdmin
  • MYSQL
  • Nagios Core
  • NConf
  • Nagios TV
  • Cockpit

Installing our plugin (its not an official Nagios plug in) is a simple copy of objects to a directory and running some configuration scripts to build the required commands and service entries.

Before you can run the Nagos Services we have to install and configure the IBM i side of things, each IBM i that you are going to monitor will need a copy of NG4i installed, this provides the server programs that will collect the status for return to the Nagios server. As with all of the Shield Products is is built as an IBM LPP which is installed using the RSTLICPGM command.

We were interested in just how the IBM i and the Linux side would be impacted with the running of these checks as we had seen some problems with the Java based checks particularly on the IBM i side. Both sides of the solution were built using C, so we hoped that it would significantly reduce the overhead plus we had opted for a long running server process on the IBM i instead of new jobs being spawned for each request.

Monitoring the IBM i impact was carried out via the WRKACTJOB process and simply pressing F10 to restart the statistics regularly. We also added a couple of monitors for the jobs that would be running into the Nagios checks so we could see what impact they would have on the temporary storage and memory usage. The problem with history logs filling up constantly and lots of jobs taking up processor had gone away with this solution.

We do use Userspace Objects in many of the APIs that we call so the QTEMP storage of the jobs did build up, but once all of the status checks had run at least once the QTEMP storage did not change any further. (we always reuse the userspaces in subsequent requests and they are stored in QTEMP). CPU usage was minimal, a couple of times we saw the jobs taking 7% of CPU for a few milliseconds as it called some of the heavier API’s to extract the status, but nothing that would indicate a normal IBM i installation would be even moderatey impacted by its use. We opted for 10 minute intervals for the checks which is probably over stated, we also ran all of the checks so our implementation is probable a lot harder on the system than most customers would end up implementing.

One item we should bring up is the way Nagios runs the checks against the IBM i. When we first installed the product we took the option to run all of the checks at once which causes each check to be run at the same time via separate jobs, this flooded the target IBM i which only had 3 jobs responding to the requests (default setting). This caused the Nagios server to log them as failed checks as the remote IBM i did not respond. Over time (probably because the resubmit time was spreading out due to the delays caused by the responses and failures) the checks all started to respond correctly and the notifications from Nagios declined, eventually to the point where no notifications were flagged. We looked at the responder jobs and found that once the process had settled the requests tended to be handled by a single responder with occasional calls to one of the others, after a restart of the IBM i NG4i jobs one of the responder jobs was never called again which would allow the number of responders to be reduced if necessary.

The Linux server is where we expect the highest utilization because it will be polling each of the remote servers to collect the data. We wanted to start by looking at a one to one set up where the Linux Server would be checking a single IBM i before adding further targets to evaluate the impact.

The Cockpit application we installed for terminal access to the system also comes with system monitoring capabilities, the following screens are captures of the output as it runs the checks against the IBM i from a Raspberry Pi appliance we built.

Raspberry Pi stats

As can be seen form the above a single configured IBM i and running all of the checks had very little impact. Next we added another 2 remote IBM i systems to see the effect it had on the Pi.

3 IBM i systems monitored

As can be seen above the CPU sees very little impact but the Disk IO and Network traffic did increase significantly due to the number of requests now being processed (Over 200 checks against the 3 systems). No real change on the memory utilization either which may be due to the use of shared libraries in the programs we provided?

Based on the above we feel the Raspberry Pi would be more than capable of running checks against all of our IBM i systems/LPAR’s with no problem in terms of performance. We will be checking the VM instance next but expect those results to be very similar to these.

The downloadable image for the Pi is built and in test, the VM image will come next after which we hope to have the ability to provide images via the internet that can be installed onto customer hardware.

More posts will be forthcoming about the Nagios solution that will show some of the checks we are running and how we feel they could be useful for improving IBM i Availability. Having the HA4i checks allows a single pane of glass view over many installed instances reducing the management overhead for the product, especially for MSP’s. As the product is used in customer environments we expect additional checks for the IBM i to be requested, but at the moment we feel the 65+ checks we have today will meet most of their needs.

If you would like to see a demo of the solution either running on Raspberry Pi or a hosted VM lets us know and we will be glad to set something up. Keep watching the Blog for more posts and possibly videos showing how the solution fits together.

Chris…

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.