We have been busy adding all of our internal systems to the AAG monitoring solution when we noticed that the job we run through the job scheduler for taking the backups had not run for a number of days. The problem was due to a job sitting on a job queue that had not completed so the save were backed up waiting for the job to finish.
This got us thinking about how we should manage the monitoring, we do have monitors that will look for jobs and monitors to verify that the job queues are not stacking up, but in this instance we wanted to make sure the job schedule entry had run and that it completed correctly. We decided that we needed another check that could reach out and pull back the specifics for any job schedule entry so we set about building a new check.
Here are some images of the check as it runs (we are running it too often at the moment but we will address that later) The first is overall status of all of the checks (we are currently running over 200 just against the IBM i LPARs we have set up) and as you can see the status is OK for that job now.
The following shows the information we returned for the job schedule entry.
As you can see from the above the job is Scheduled to run 06/05/2022, the last time it ran 05/05/2022 it completed successfully. We can add a lot more information to the monitor that is available from the API we used to extract the status. If the status does not show it submitted successfully we can then send out alerts to the admins to fix the problems and run again.
We are always looking for additional items that need to be monitored, if you have any suggestions or would like to try out the product lets us know.