System recovery with LVLT4i

One of the questions we get asked a lot when discussing the technology within LVLT4i is how would a recovery of the customers system look. Obviously every recovery is going to be different but a basic set of guidelines and suggested technologies can help the users imagine how they can deliver a recovered system to their users. We have done a bit of research and looked at some of the technology we have implemented within Shield and come up with a basic design that could be used to build out a customers system.

An important factor in this recovery is the amount of data that is being replicated within the LVLT4i environment, the product was designed as a Library Vault offering which would allow customers to have a Recovery Point Objective of near 0 for any number of critical libraries. Our initial thoughts were that this would be an offering that would allow the small customer who has no need for High Availability but does need to ensure he has all of his critical data and objects stored as near to the latest update as possible on a remote system. Recovery could take anything from 4 – 12 hours dependent on the system configurations available at the remote site and just how the recovery is to be performed. The use of tape restores for the base operating system and clients non critical objects can be used where time or SLA requirements permit. This can result in a cost saving due to the license agreements IBM makes available (A system restore using a system save from a system that is no longer available allows the use of the license from the original system on the target system, you just need to contact IBM and set up the contractual requirements), however this will extend the recovery of the clients system due to the speed of tape technology.

It could be that there is a viable LPAR set up and waiting for an invocation so it just needs to have the relevant changes made prior to the LVLT4i data recovery steps, this will reduce the time taken for recovery but incur additional costs for the clients. In our test environment this is what we used to do our initial testing as described below.

Our preferred save method is to use image catalogs, they are very fast in terms of doing the save and subsequent recovery. We have set up a number of image catalogs which are used to save the content of the iASP’s used for each client. Every night a save is run that either saves the entire iASP content(weekly and Monthly) or just the changed objects(daily). Each daily save is a cumulative save of all changed objects since the last full save (Weekly or Monthly) so we only need one copy of the daily save volume, there are some iASP required objects which IBM adds to the iASP content which can be ignored on the save as they are not required for the recovery. We opted to use *TAP image catalogs for simplicity, we did not want to be monitoring the Optical saves and adding new Optical volumes when the initial ones filled, we could also generate physical tapes if required. As part of the recovery process it is important that a save of the changed objects is run as soon as an invocation is declared, this will ensure the data in the saves which are going to be used for recovery have the latest copies of the iASP objects.

The system where the recovery is to be made is on the same LAN as the clients backup system, this allows for a fast transfer of the image catalog entries between the systems (1GB LAN). We also decided to use NFS mounts to map the clients backup systems to the recovery system so the actual saves exist on the clients backup not on the recovery system, we would have to copy the objects between the systems and then create the catalog entries anyhow so we thought this would be an acceptable option. We did have a few issues with the content of the mapped image catalog not being refreshed after the addition was made (seems like the ADDIMGCLGE command does a copy of the actual content not the link itself) so we have to add the image catalog entry at recovery time. We also had to set the authority for the NFS export as part of the set up, its only read access that’s needed as the content is only ever read from the recovery system.

Once we had set up the image catalog on the target system and added the image catalog entries we were able to view the content and restore it using the standard *RST commands to the relevant libraries on the recovery system. As per usual the IFS has a few features which have to be managed when saving it, we did not want to save the IASP directory as part of the save such as ‘/CLIENTx’ so we had to manipulate the save commands to ensure this did not cause problems on the restore. I suppose we could have simply managed the removal of the ‘/CLIENTx’ directory on the restore but this method worked just as well. When you run the SAV command make sure you do not save the QSYS.LIB objects etc either, these are stored in the /CLIENTx’ directory.

The first action was to restore the data using the last full save, this will take the longest time as it has to restore everything in the iASP, once this is completed you can restore the last copy of the changed objects which results in a perfect copy of the iASP data within the clients recovery system. 

Now we have the recovery system restored using the data from the backups we just needed to run a few scripts to add the updates we had captured and replicated to the iASP, this recovers the profiles and system values to the latest values and even sets the user passwords to the last one captured on the clients source system. (The passwords are not readable and are a capture of the storage IBM uses so they cannot be read or changed). If there are other actions such as resetting the configuration objects, addling license keys, starting communications and subsystems these can be done as soon as the restores are finished. 

In our test we went from a bare bones system to a fully recovered system in a few minutes (we have very little data to recover and used a running system for the recovery so its not necessarily a true representation of a real recovery and not something a client should expect). Testing of the recovery using the clients data should provide an indication of the recovery time they can achieve in a real disaster and how much additional effort is required to bring the applications back online to the users. 

You can use this kind of process for many other activities such as testing OS upgrades and application upgrades, restoring the data to a recovery system does not have to affect the ongoing replication processes between the clients production system and the target iASP, this means if a disaster strikes while you are in the middle of a test your data and objects are still going to be at the latest level possible.

LVLT4i is not a High Availability product, it has been designed to cover those clients who need to be able to recover on a rebuilt system in 4 – 12 hours (getting to those times is dependent on a number of factors which have to be set by the Managed Service Provider). This is probably the same market as the Vaulting products compete within, but LVLT4i has an advantage, it has the ability to provide an RPO of near to zero data loss where as the competing products tend to use back up technology to provide recovery points, this means their RPO is only as good as the last successful save.

If you need this kind of solution (not everyone needs a High Availability solution but recovery in a reasonable time frame with no data loss is important) then get in touch, if you are a customer and do not have a Managed Service Provider who can offer this kind of solution let us know, we can put you in touch with one who does. If you are a Managed Service Provider and need to provide this kind of technology, call us, we will be happy to discuss the product and its capabilities.

Chris…