Why the interestMy interest was peaked by a recent LinkedIn comment about how using the Audit Journal to capture replication requests was a poor design choice. I have been developing High Availability Products and solutions for a number of years using the Audit Journal as the trigger to capture an object creation/change/deletion, so I was interested to find out why this products technology was such a ‘leap ahead in the world of High Availability Disaster Recovery.’ according to the writer of the which can be read here.
Technology usedInitially I looked for information on how the company marketed its replication technology to see just how advanced it was, I have always known that it relies on using a library which has to sit above the QSYS library to allow command requests to be intercepted and then call their own processes to allow the object creation/change/deletion to be replication to a target system. Initially I could find no real in-depth description on how they were carrying out this process so my beliefs about the significance of the technology were based entirely on guesswork and how I would implement a process that does it that way. However, after some effort and lots of searching around the web I found a patent which has been flagged as intention to grant on Google. I could not access the actual patent so the images which show the actual process are not available but a fair concise description of the method is but it is logged by employees of the company that wrote the Blog post. My thoughts below are based on that described method.
I have to be honest and say that the way they have coded around the challenges associated with the problems associated with having duplicate commands is pretty smooth. Using the QIBM_QCA_RTV_COMMAND exit point to extract the actual command string and having a single command CPP for the duplicated commands is a very effective way of removing the problems associated with IBM command parameter changes etc. But I do see a number of problems which this does not address, maybe there are other methods not described in the patent which would cover these, but without knowing more it is difficult for me to say they do? This is probably not an exhaustive list of the issues related to a design such as this but is attempts to give an overall view of what my major concerns are?
Non command based changesMy first concern is for object creation/change/deletion which do not occur as a result of a command? I use API’s a lot for object creation deletion and updates, from what I have seen in the method description none of these would be picked up and replicated to the target system? The IBM i is changing constantly and so are the programs which are written to address the objects and their content so the use of commands as the main method of creating/changing and deleting an object are not necessarily the only ones being used. Maybe there are not a lot of objects being changed using API’s but it only takes a single object to be missed which is critical for the application to start-up correctly and everything else is wasted.
Qualified commands in programsAnother concern is the use of qualified commands, a programmer can qualify the command so that it runs his command instead of a system command such as FREDS/CRTMNU or even QSYS/CRTMNU. The programmer may work for a vendor who does not supply the source code with the programs, so finding out if this command structure is used can be difficult. Finding the duplicate in FREDS would be as simple as WRKOBJ *ALL/CRTMNU *CMD and seeing that FREDS/CRTMNU exists. However you may not be able to see where that command is called especially if you do not have the source code for all of your programs.
Inhouse programsThe next concern is the use of your own programs that contain commands that do some processing before calling the system command, maybe you have a program that creates/updates some objects using API’s before it calls a system command to change authority etc? even if you did not qualify the system command and it was called as part of the replication process of the vendor, it is only going to try to change the authority against an object which does not exist. Are you going to have to verify every program before you can implement this software?
Impact to command processingI have not done any testing so I am not sure of the impact the above process adds to run the system command? It may be that the time taken for each command if fairly minimal so it cannot be seen when the tests are done to show how a change to a single object is made and how it is replicated to the target system. What happens when it is being run many times as part of a batch process? does the time taken to run the above process against every command invocation add up to a significant impact on the system? I do know that we have hit a number of problems with this kind of process in the past when we tried to carry out a similar process with the SBMJOB command. If it is a problem, maybe the fact that the systems are so much faster now negates that impact anyhow. Personally using any technique which hijacks another programmers process is not something which should be encouraged.
SecuritySecurity is one thing that is constantly ignored on the IBM i, we have been constantly led to believe that it is so secure that its not something we need to worry about. That is changing as we expose some of the newer functionality on the system and allow remote access to the systems to use that functionality. We are hearing a lot more from the experts about just how exposed our IBM i systems are due to these beliefs that we have the most secure system there is. Here we have a program that is going to have complete access to the creation/deletion/change of system objects based on a character string passed in. I do not know the internals of the program but hopefully they have some level of addition security to ensure the process cannot be compromised.
Capture after changeOne of the concerns about using the Audit journal as described in the method is that changes can only be captured after they have occurred and therefore the object may be locked so a capture of the changes may not be possible. This may be true for objects which have exclusive locks but in our experience this is not a regular occurrence and using a save while active request against a locked object successfully captures a copy of the object without to much delay. When we look at all of the objects which are likely to be locked exclusively and which ones are updated regularly we see that most of those are the ones which can be journaled through the normal user journals, this could be why we see very little in terms of locking issues in the customers which are running our software?
Speed of replicationThere have been suggestions in the past that this method of object replication is far faster than using the Audit journal and it uses a lot less CPU to carry out. Personally I have yet to see any issues with the technology we implemented and the timing of object changes. Yes we capture the changes after the change has occurred and there is an impact in CPU with the retrieval of the audit journal entries which can be seen in the Active job displays, but consider the following:
- An object is created from source,
- the object is then changed to remove the textual description,
- the object is then changed to remove the source content links,
- it is then changed to add PTF information,
- the owner is changed and it is then changed to run under the owner profile at run time.
- Maybe the object is changed to have the audit information captured as it is a legal requirement
All of this is done for each of the objects in a library as part of an application update. Each of the above steps will be captured individually and sent across to the target system, each of these actions will create a delay in the creation of each object, plus each of these actions which require a save of the objects to the save file and then the IFS object before they are replicated to the target system. You have added journal bloat for each action carried out, each request will have generated an individual entry (maybe a save of the object in each entry?) in the journal which has to be transmitted to the target and store for processing.
With our process we will see each of the requests come through the journal, we will add a replication request for the actual program through a save and restore (we do not try to compile on the target system, even if the source was to be available and there are lots of reasons for that), we flag the fact that we are going to replicate the object in this manner so any additional requests that arrive before the object is replicated to the target are ignored, that way we are not replicating requests that are already captured as part of the initial request. Once the object has been saved and before it is sent to the remote system we unflag the object so new requests will be captured.
So is it better to replicate the object as a single request (less CPU, less journal content, less network overhead) or as the above design shows do 5 or 6 individual requests to get to the same result? The question of the object being in line with data changes is not even a consideration as we do not use these objects to address the data content. Remember if it is a data holding object such as a file, data area, data queue, IFS object we are going to mirror the creation of the object as part of the user journal based apply.
QAUDIT Journal startupIn the past one area which concerned me is the start up time when verifying the start point in the QAUDIT journal, however with proper management of the receiver chain and the improvements IBM has made in the OS I no longer have any concerns, a few seconds taken to get the process started is far less than the alternatives such as that described above. As for CPU overhead when the process is running, I do not think its a like for like comparison, the overhead of the replication process above is being added to the system process such as the generation of the object not as is seen in the QAUDIT process where it is part of the replication tools overhead.
Object synchronicity with Data changesThere has always been a discussion about the synchronicity between the data replication (Objects which are journaled to the User Journal) and the Object replication process. This solution appears to go some way to reduce that concern by using the User Journal as the transport method to the target system, but again there are some side effects which are not always discussed. I do not know how this synchronicity is managed in this solution? If the user journal which is used to capture is not the same one which is used to capture the data changes then all bets are off, it falls into the same bucket as every other technology. If it uses journals which are linked to the data changes it could also have issues because the data changes are possibly being generated by processes outside of the object creation processes, so the data changes could be out of sync to the objects. How do they determine which user journal to write the entries too? Just because you change a program does not mean it is going to be used to generate any data by the running application (QUSRPLOBJ) so why would it be important to synchronise them? I also noted that some time ago an article was published by the same company about how they had significantly improved the IFS replication throughput by adding additional feeds to the remote system, if this process relies on that same replication process then how are they ensuring that each request never precedes the request in another feed?
ConclusionIn my opinion this technology is just another option and it works, but the claims that it is far superior to that implemented by other ISV’s is just not justified when you consider the above. Out of all of the ISV’s that provide High Availability Software I believe that most use the QAUDIT journal as the basis for triggering object change capture for an obvious reason, if this design was such a significant improvement surely someone else would have seen the benefits and implemented something similar?
I had hoped that a subsequent post which was released by the author would shed more light on the benefits they are seeing with the technology )see part-2) but unfortunately I could not see any technological reason for the claims of superiority? Like I said to the author in a separate reply, sometimes the coolaid always tastes better at the source. I am sure he believes the technology is the best just as doubt it.
Our products are gaining in popularity and the technology is always being updated as IBM changes the capabilities of the OS. If we see something which promises to add a significant benefit for our customers we will endeavor to add it to the products, this technology is not one we see as a significant benefit.