Handling PWRDWNSYS with Nagios Monitoring

When the system is scheduled to go down for routine maintenance, it’s important that the appropriate actions are taken to notify stakeholders in advance. Too often, our phones light up with alerts simply because someone has powered down the system to perform maintenance without informing the team.

At the time, we are often unsure whether the system has genuinely gone down and requires urgent attention, or if it is simply undergoing a planned restart and will return shortly. There is also uncertainty around how long the downtime will last. This typically leads to emails being sent to stakeholders seeking clarification before any appropriate action can be taken, such as suspending monitoring. Over weekends, when fewer people are available, this can create unnecessary effort—even if it ultimately results in monitoring being suspended.

We needed to find an automated way to suspend monitoring when events carried out in a controlled manner (such as PWRDWNSYS) are executed, ensuring they do not generate a flood of unnecessary notifications to the monitoring team. This is where an Exit Point defined in the Registration Facility becomes useful. By leveraging the Exit Point, we can attach a program that stops monitoring until the system completes its IPL. We also include a call in the QSTRUPPGM to restart monitoring as soon as the system is back up, ensuring monitoring remains accurate and up to date.

IBM provides an Exit Point (QIBM_QWC_PWRDWNSYS) where you can register a program that is triggered when the PWRDWNSYS command is executed. While there are some considerations around its implementation and behavior, we were able to work through these and achieve a reliable solution. To simplify the process, we have added menu options to the NG4i configuration menu that allow you to register and deregister a program for this Exit Point. Alternatively, you can perform this directly within the Registration Facility and supply your own program—we’ve simply made it easier. Use Option 7 from the NG4iCFG menu to register the provided Exit Program (NG4I30/NG4I306).

When you check the Registration Facility for QIBM_QWC_PWRDWNSYS Exit Point you will see the program is correctly registered.

That is all you have to do, when the PWRDWNSYS command is used this program will be called and it will send out a request to the Nagios Server to suspend active checks for this system.

Note: If you use PWRDWNSYS *IMMED the exit program is called, but is often ended before any actions can be taken, we are requesting more information from IBM on this behaviour as it is not as detailed in the documentation. For now always use something like *CNTRLD DELAY(30) to ensure the program is called and has enough time to carry out the request.

You can check the results on the Nagios XI Tactical view where you will see that the services have been disabled.

The services need to be enabled as part of the QSTRUPPGM program, this is a simple matter of adding NG4i30 to the library list and then call command ENDHOSTSVC. The following is an example of some code we use internally to carry out start up after an IPL. We are starting the responder jobs before we tell the remote monitoring server to enable the services.

NG4I:                                                                                         
            CHKOBJ     OBJ(NG4I30/NG4I) OBJTYPE(*PRDDFN)                                      
            MONMSG     MSGID(CPF9810 CPF9801) EXEC(GOTO CMDLBL(DONE))                         
            ADDLIBLE   LIB(NG4I30)                                                            
            MONMSG     MSGID(CPF2103)                                                         
            SBMJOB     CMD(STRNG4I) JOB(STRNG4I) JOBD(NG4I30/NG4IJOBD)  JOBQ(QGPL/QS36EVOKE)   
            ENBHOSTSVC                                                                        
            RMVLIBLE   LIB(NG4I30)                                                            

Once this is done check the Tactical view again and you should see that All Services are Enabled again.

This results in the desired outcome, if we have not been informed of a PWRDWNSYS request it is automatically caught and handled reducing the false positive notification events we receive.

Reducing the effort required to maintain and manage your IBM i environment is essential for today’s organizations. With the decreasing availability of IBM i skills—and the rising cost of those resources—it’s more important than ever to leverage tools that minimize overhead and enable teams to use their expertise as effectively as possible.

Want to try out Nagios Monitoring? give us a call for a free 30 day trial. If you have any questions are need more detail contact us via our web portal.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.