Journal Cache and its affects on HA solutions

I was involved in a forum question which had to do with the use of the Journal Cache (option 42) available from IBM and how it could affect the recovery of a system in the event of a failure on the source.

The person who posed the question detailed the performance benefit they saw once option 42 was turned on. They had a batch process which took approximately 8 seconds to run prior to turning on Journalling.  Once journalling was turned on they saw the batch process taking approximately 80 seconds to complete? The type of remote journalling was not discussed so this could be exagerated if they used Synchronous mode. They then turned on the journal Cache and saw the same batch process now taking approximately 9 seconds to complete.

The question they raised was how would the use of journal cache affect the recoverability in the event of a failure. A Tech Note explains the way the system handles the journal cache and can be found here Technnote journal caching  and it explains fairly clearly just how the process would handle the loss of the source system.

A couple of things I found interesting was the explanation about the Async mode of Remote Journal technology, the fact it holds onto entries until a suitable number are available is an important fact. I have not seen any information on how to manipulate that holding area so it looks like IBM is making informed decisions about when its right to send the data. Remote journalling cannot see the entries in cache! So if its in cache its not going to be on the remote system.  The only way to affect this is to turn on Synchronus mode which for a lot of people is not an option.  I like the idea of these management jobs which go out and ensure nothing is sitting in cache for too long!  The fact they improved the jobs to be more agressive on moving data out of cache is important especially in a HA environment.

I would suggest anyone who has implemented a HA solution read this technote, it helps define your recoverability and show exactly what you should expect when switching to a remote system when the source fails!  No amount of testing and planned switching will give you the understanding of what data will be missing when a hard failure occurs.  It will be up to you to decide if the data is in a useable state.  A planned switch will inevitably flush the cache because you are stopping all update activity well before the switch is performed.

One of the biggest challenges will be to determine what state the source system was in when the failure occured, you will have no idea what jobs were active and what data pertained to those jobs.

Good reading!

Chris…

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.