We have removed the remote journals we now have a stable system again. The logs are back to normal and we are not seeing message queues wrapping every couple of hours.
The problem was due to the way we had set up the remote journals, when you set the journal receivers to System Managed and DELETE *YES the system will run a process to constantly (Every XXX minutes depending on what you setup, 10 minutes is the default) try to delete the receivers from the LOCAL journal. Because we had a Remote Journal attached to the LOCAL journal the receivers have to be replicated to the remote system before they can be deleted. Once a receiver has been identified as being ready for deletion there is no way to set it back to do not delete, so when we took down the Remote System for maintenance (its been down for a week due to a lack of time to install some new features) the replication could not occur. This means the delete will not be allowed for the receivers and every time we created a new journal receiver the detached receiver became suitable for delete etc etc etc.
We did not realize this was occurring until our system slowed down to a crawl and we started to look at what was happening! We had over 1,500 QHST log files and any message queue which was attached to the journal plus QSYSOPR was being wrapped every hour or so. DASD utilization was not high but the process of sending messages, wrapping messages queues, filling and changing QHST logs and trying to connect to the remote system constantly must have clogged the system up.
IBM’s suggested solution of removing the Remote Journal created an additional headache because it then allowed all of the journal receivers to be deleted. So we now have to copy the databases to the target once we start the system backup plus set up all of the remote journalling again.
If you are going to take down your target system for any period of time which puts you into a situation like we had ENSURE you change the Journal definition to DELETE *NO and do a journal receiver change BEFORE you bring the system down. This will ensure you don’t get into the loop we did. If you do forget to do it BEFORE you bring the target down do it as soon as you can after that to minimize the pain. Just inactivating the RJ link will not help… If your system goes down due to a failure make sure you change the DELETE option ASAP.
I have asked IBM to look at providing a fix to allow the delete cycle to be suspended or ended should this situation occur again, if they decide its worth fixing I will post the PTF details.