We have been asked how AAG would work with a mod-gearman monitoring setup. In order to show that the AAG checks and services will interface with gearman without issue we decided to implement a gearman job-server into our monitoring setup. Typically using gearman you would have several remote workers running the checks against your hosts and reporting the information back to the job-server which in turn returns the information back to Nagios. However, for simplicity and time’s sake we have only used a single worker running on the same Linux box as our job-server. This does improve performance of checks being run over stock Nagios and shows our proof of concept but would not reduce the same load off the Nagios server as running remote workers. A nice feature we also learned about while implementing gearman is that we are now able to use the “schedule a check of all services on this host” option to mass run checks against a host. This was an issue before as base Nagios core would attempt to schedule all of the checks to run at the same time and flood the system, causing time outs and information not being returned. But, when we run the mass check using gearman the services are scheduled with an appropriate amount timing between each! This has made life while testing 10x better…
In the following screenshots we’ll go through submitting a mass check against a host, see it go through the gearman server/worker and see the information be returned to Nagios…
First, we will use the option from the host information screen to “schedule a check of all services on this host” just over half way down the menu on the right hand side of the screen.
For gearman workers to know which checks/hosts to pick up, you must add the hosts to hostgroups and services to servicegroups. Within the configs you will then add all hostgroups and service groups to the job-server config file and only add to the worker configs the groups you wish for that specific worker to pick up. (There are screenshots of the configs further down…) Using the “gearman_top” command we can see stats on each of the hostgroups and services groups… if you monitor this screen while submitting a mass check for a host you will see the number of jobs waiting increase as each of the jobs runs through the queue. AAG’s check_commands run so efficiently that this screenshot was difficult to capture as the system would check all 70 checks against a host before I was able to take the screenshot. Using remote workers, I am sure would add a small amount of time for data transmission.
You will see that we have 2 separate hostgroups and servicesgroups set up. Using these we separated out our two test systems, SAS2 is the only system in hosts1/services1 where as SAS3 is within hosts2/services2. This was so that we could use two workers and assign them to a single system each. Being as we only have one worker, both sets are being tackled by the same worker.
Once the jobs have completed, the number of jobs waiting and jobs running with go back to 0 and as each of the worker jobs time out they will also go back to the min number (which can be set using the config files) of 1.
We can run the test again this time running all checks against SAS3 and we will see the jobs waiting and running will be in the second set of groups instead…
Once all of the jobs have completed on both systems, we can see the information returned to Nagios in the services screen. For this quick demo we ran all 70 checks AAG offers for EM4i, HA4i, shield general and IBM i status points. This is too many to show on one display, however, you can see the overflow for SAS2 on the second screenshot.
As we can see here AAG has picked up some issues on our system. We can see the top 4 errors are showing us that Shields EM4i is not currently running on the system. This is a check for a check. Using AAG to back up EM4i we can be sure we don’t miss a message on the system. AAG’s checks for EM4i will also be able to check the time a message has been waiting and respond accordingly with a critical or warning alert.
We can also see Further down the update levels of AAG and NG4i(the IBM responder job for AAG) are not in sync. These products work hand in hand and are required to be in sync.
Further down we can also see our maintenance has expired for EM4i and it also lists that there are updates available. Ensure you are on top of your maintenance expiry and update levels of your products with AAG!
This is our config file for gearman’s job-server. The first line is dictating where the job-server is we can either set this to localhost or the loopback IP. We found online recommendations to use the loopback IP so we did, however, I am not entirely sure why… don’t take my word for it!
At the bottom you will see we have both hosts1 and hosts2 listed for our hostgroups. Same with the servicegroups. This is telling the job-server what checks to pick up and send to the workers.
This is our worker config. You will see here all of the same groups are set. This is because we have only one worker. If you wished to split the systems/groups between different workers you are able to tell this worker to only pick up hosts1 and services1, then tell another worker to pick up hosts2/services2.
It was our assumption that AAG would be able to be run using gearman with no issues but we don’t like to rely on assumptions. So here is the proof! One thing to note is the AAG objects need to be on all of the gearman workers but NG4i is only required for each IBM i system. If you have any questions or know something I missed shoot me an email any time!