Monitoring the IBM i with Nagios.

Using AAG to monitor your HA4i solution will ensure your replication is operating smoothly. With the ability to view the status  and history across all replication elements, an informed admin will be able to see patterns form that may hint toward an imminent failure or issue. This is the 2nd out of 3 posts in which we plan to outline how to monitor the IBM i using AAG. In our first post we went over how to implement different check commands from AAG in combination to ensure an application is running correctly. For this example, we will be using AAG’s check commands which were specifically developed to monitor Shield’s HA4i.

While HA4i already has a number of processes that can check the status of the replication processes and fix any problems that are captured during the check, there are still times when these checks may not run due to external problems.  For example, if the system is not running the notifications from the checks will not be sent, if the check fails but there are problems with the notification process the notifications will not be sent.

In one customer we found that the target system had not been running for weeks because the status check process had not been set up to notify when everything was OK, this appeared to indicate no news was good news!  Luckily the journal receivers were still available to work from and recover, but it took a significant amount of time and effort.

AAG would prevent these kinds of errors from ever occurring due to its notification running outside of the IBM i.

The majority of the checks for HA4i will be run against the *MGT side of the HA4i replication process due to the *MGT being the controlling side of the replication, only the remote journal apply process detail will be extracted from the target (*NET) system.

Audit Checks

HA4i audits are critical to ensure your replication is operating smoothly and there are no errors or gaps in your replication configuration. We are often asked what audits should be run and when because they can be very resource intensive and take a long time to run. We wanted to provide a way to ensure that audits complete within set time periods and that any errors are reported without the user needing to monitor the audits through to completion. AAG added that final piece because it can ensure the audits run in their designated time window and if any errors are reported a notification is sent out.

The  check_HA4i_AUDSTS is able to check all of the different types of audits HA4i can run and report on their progress plus notify any errors or time over runs.

There are a number of parameters that determine if a notification is required for specific out of scope problems such as “Running Severity” and “Not Running Severity” so if it is running and should not be and vice versa so we can check that an audit has finished in an appropriate amount of time. The following shows the result of an audit which ran within the required window but set a critical notification due to an error being logged.

Audit Status with 1 critical error

New Library and Devices

Another issue a customer brought to our attention is where developers would add libraries or devices to the system without any notification to the operations team. They would then have a mismatch between their production system and *NET system because the libraries are not defined to replication. For this we created the check_HA4i_NEWLIB and check_HA4i_NEWDEV commands.

HA4i does not automatically replicate all Libraries and devices to the target system. This is because some libraries and devices are temporary in nature, so replicating everything would cause significant overhead to the system for what could be irrelevant data. Also, libraries can be created by users for testing purposes and have no reason to be replicated.

HA4i can capture and log the creation/deletion of libraries and devices, AAG is able to extract the data from the logs and determine if a notification is required for new objects that have been created but not added to the replication process. These checks would be run at a frequency that ensures any critical objects are picked up notifications sent to check if they should be added to the replication process. The following is an example of a check against the device log.

Device Log check

HA4i Jobs

HA4i requires that certain jobs are running to allow replication to continue, there are a couple ways to do this. First, we could use check_Shield_JOBSRCH to look for each of these jobs separately which would result in 4 separate services. However, if we used the HA4i specific job check_HA4i_STATUS this will group the 4 jobs into one check. Using the latter as can be seen below, it reports the Command Server, Email Manager, Sync Manager and Profile Sync jobs are all running as expected.

HA4i Server jobs

Responder Jobs

HA4i uses a client server process for communicating between the source and target systems, this requires responder jobs to be running on both sides. The number of responder jobs required will vary based on each installation. By setting a critical and warning level based on the number of responder jobs we will be able to receive a notification if the total number of responder jobs is outside of the range as we can see from the example below.

Responder Jobs count


Roll swaps are complex functions that many are now automating to occur outside of normal hours. AAG provides the ability to view the status of a role-swap in process or show the final state, this ensures adequate actions can be taken if the role-swap failed or was held waiting for a message response. Having AAG run multiple checks during a scheduled role-swap window will ensure it is progressing as expected and provide early notification of any issues that need operator attention.

Roll-Swap status

As can be seen above, this check also reports the date and time of the last role-swap. This information can be used to verify the role-swap ran and completed on schedule.

Object/Spool File Replication

check_HA4i_OBJ and check_HA4i_SPLF will check the object replication and spool file replication statuses respectively. These checks will return the number of objects or spool files waiting to be replicated plus the difference between the last entry and the last entry read from the audit journal. The following shows the information returned for the Object check.

Object replication

The following showsand example of the spool file check and the information returned.

Spool file replication

Remote Journals

If the remote journal links are not active no data is being sent to the remote system, if a system failure occurs and the user needs to switch to the target the Database could be so far behind it will be unusable.

The service check _HA4i_IJRN will report back the number of journals that are configured  plus any that have the status of *INACTIVE. This will prevent issues caused when journals are configured but not actively sending the updates to the target system. The service check_HA4i_RJRN will not only check for *INACTIVE journals and return the severity level set by the admin but will also list the remote journal and apply job for each of the configured journals. The following example shows the information returned for such a check.

Remote Journals

Sync Manager

AAG offers the ability to monitor HA4i’s sync manager status which is used to replicate journaled objects between systems. Sometimes HA4i needs to replicate files and objects that have been corrupted outside of the replication process. The Sync Manager will carry out a save, send, and restore process for any objects added to its queue. These objects can be very large and take a significant amount of time to replicate to the target over the network (some customers have low bandwidth between remote systems) so knowing what is in the queue can be important information to determine how good your recovery position is.

You can configure AAG to have critical and warning levels for the queue depth or if the sync manager is running.

Sync manager status and Queue depth

Apply Check

The check_HA4i_APY service will query the *NET system and return the information for the apply jobs running plus a flag for errors and the difference between the last applied and last read (backlog). The following is an example of the data returned where everything is running normally.

Apply status and remote journals


HA4i is a First-Class High Availability solution, adding AAG will not only provide your admins better visualization of the replication process but also provide notification of any incidents which need attention. Check out our website for more information on HA4i and AAG.



Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.