We have been working on a new collection engine for the JobQGenie product for a number of months driven by a customer who saw some data collection errors with the product. The customers environment is (to say the least) a very demanding one, they produce over a million jobs through the system every 2-3 weeks. Some of these jobs will run for a few seconds which doesn’t allow for a long time to collect the information that JobQGenie requires to re-submit the jobs should it be required.
IBM provides a little information about the job via the Exit Points provided in the OS, the rest of the information has to be collected using API’s for each and every job as it runs. The initial design would sometimes fail to collect some information either due to our data storage process or the job would simply end before we had a chance to collect the information. Luckily the customer was very keen to iron out the kinks and make it work because they knew the product would be critical during their role swaps.
The first problem we had to overcome was record update clashes, the data files we use had to be re-aligned to allow independent programs to store the data without the possibility of overwriting data added by another program. As we collect information using multiple programs all trying to capture as much information as possible we also needed to reduce the possibility of missing data that is only available at certain times of the programs life cycle. Restrictions imposed by the OS for Exit Points played a big part in the design because we needed to react to the Exit Point information immediately yet we were limited by the number of programs attached to the Exit Points! All of this had to be done while keeping the products overhead to a minimum, after all its no use running a data collection program which interferes with normal job flow.
Another design issue was the filtering at the job queue level we had built into the product, the customer was not willing to remove the ability to filter for specific Job Queues. IBM has finally agreed to ship a PTF which allows every notification through the Exit Points to carry the Job Queue information for V6R1, but as the product runs from V5R3 upwards we had to design alternative methods to capture the related job queue for each job! Eventually the product will only support OS levels which have this feature enabled which will significantly reduce the loops we jump through to get the information.
Once we had the job data being collected constantly we then set about delivering tools within the product which allowed the customer to automatically re-submit the jobs on the target, the initial design required a user to select each job entry for re-submission which became a little tiresome when they could have thousands of jobs to process. Next tehy wanted to be able to reload the jobs by set criteria such as job priority or job queue etc.
We also beefed up the tools which provide critical information about the data each job has produced via the journals and any object they relate to. The tools allow a user to identify any data repairs which are required prior to a job being re-submitted, after a system failure this information can help speed up the recovery process significantly.
The customer has been running the product with the new updates for about 3 months, the last conversation we had with the customer was very gratifying as they said they had stopped checking the product every day because it just ran! The updates are now available from our website download area for other customers who are running the product or wish to trial the product.
Here are some statistics about the processing the product did in a normal day.
On Monday 1/4, there were 38,351 jobs processed by JQG.
Between 01:45 and 03:05, a relatively busy time, there were 7150 jobs/hour, or 479.3875 jobs/minute or 7.989792 jobs/second.
Between 19:52 and 19:57, there were 11939 jobs/hour, or 2387.8 jobs/minute, or 39.79667 jobs/second.
The Average CPU Utilization never went above 0.1% for any of the jobs during the processing periods, most used less than 0.05%.
We have seen it handle larger numbers at this customer due to special batch runs (over 100,000 jobs ran overnight) most running for less than a few seconds!
So we feel we now have a first class product that can work in some of the most toxic processing environments out there!
If you are running an Availability solution JobQGenie is a perfect companion which provides that missing part of the puzzle.