An unfortunate system loss at one of our customers highlighted a problem which resulted in the loss of data from a couple of objects. The system loss was experienced on the target side of the replication environment so no production data was lost but it did require a update to our software to ensure it was not a problem in the future.
The problem comes from the fact that the OS does not force a space object (User Space, User Index etc) to Auxiliary Storage (DISK) unless the memory space it exists within is needed for another object. This means the object may never be written to the disk until an IPL occurs (IPL”s cause a write of all temporary storage to the disk). We had always thought that any object which is used by a program would be flushed from memory once the program that is using it ends, but as IBM confirmed this is incorrect. This could be compounded even more by the fact that we have much more memory available on the IBM i than ever before so the chance of the space being needed to force a write to disk is even more remote. In the customers case the data in the userspaces was over 5 months old (the last IPL).
We raised a PMR with IBM to look at the problem and why these objects had never been updated and ask how we should go about fixing the issue. IBM recognizes this as a working as expected, one of the reasons these are so fast at being updated etc is because they are not forced to disk immediately. They have agreed to update the documentation to reflect these issues so people realize some of the side effects these kind of objects carry.
With a User Index you can use an option on the create to force every write to the object to be forced to disk immediately or asynchronously, we use User Indexes in our products and force each write to disk asynchronously. However the way we use them removes any of the overhead and speed issues because the content is only written once and any updates are very infrequent so they are perfect for our purposes which requires fast indexed reads.
Userspaces are different though, they have no inbuilt method to force them to disk. To force userspace content to disk you need to use an API or MI instruction. The MI instructions have a side effect which we found intriguing, when the MI instructions are used they do not register a change to the object, they only force the data content to disk so you need to take additional actions to ensure the updates are logged in the Audit journal.
We looked at the MI instructions first and decided that the SETACST MI instruction was probably the best option for us, so we coded up the MI instruction and tested it against one of our userspaces. We set the write flag to be asynchronous so our programs are not waiting for the write before continuing (one of the reasons we use the userspace is their speed of update). To verify the updates have occurred you can use the PEX instructions to log the writes to disk but we felt that IBM’s word was good enough for us so we did not test using them. This still left the problem of the updates not being logged. To verify this we set the Object Audit setting to *CHANGE for the userspace and after running the MI instruction checked the audit journal for an update entry which we did not see. To ensure an entry is logged we needed to use the QUSCHGUS API, this forces the object to disk and causes an update entry to be added to the audit journal. Care has to be taken where the QUSCHGUS API is used, in our programs doing it at the end of the program was enough to set the required information and cause the entry to be deposited in the audit journal.
If these objects are part of an application that you are going to replicate via a replication solution for High Availability/Disaster Recovery, the placement of the QUSCHGUS API is probably more important than the use of the MI instruction. As no entry is written to the audit journal until the API is run it is never going to be seen by the replication process as having been updated, so it will not be replicated to the target system. Running the QUSCHGUS API too often could cause the replication of the object to be carried out too frequently causing backlogs and bottlenecks in the replication process, so it is important to only run the API and important points in the process. The QUSCHGUS automatically causes the SETACST MI instruction to be run so you do not need to run both at the same time.
In a previous post we discussed the use of other replication processes such as only reacting to changes made by commands, this solution will not help in those environments as neither the API or the MI changes will be captured. So if you are using space objects in your applications and need to ensure they are replicated correctly, you need to consider the above and how you can implement a secure method of capturing changes that your replication software will handle.