When we started to build the Crc Builder we used technology we had built into the RAP Auditing product as the base. While the Auditing functions work, we felt the speed needed to be improved so we set about looking for better ways to manage the process.
A forum post from Chris Edmonds got us looking at exactly what we were doing and why. He was looking to implement a file checker using the ZLIB source and wondered if the process could be used against 2 files on different systems. He had already built an RPG program which would check a single defined file and was looking for clarification of the results. We then developed the first program for Crc Builder to see what we could offer using the IBM QC3CALHA API.
The initial results did not look promising because the process of calling the API for every record really crawled along. We then looked at streaming the data into a memory block and passing it in to the API. While this did improve the situation it did not run as fast as his implemented solution, so we decided to look at the Adler32 CRC.
We took the same approach of reading the file a byte at a time into the buffer and passing it to the function supplied by ZLIB once the buffer was full. The results were certainly much faster than the IBM API but not as fast as Chris was seeing. So we had to look at how we read the file, using the record functions seemed to be the best way to read the data but experience had shown us that using a record level read and passing it into the IBM API really sucked! we saw a maximum throughput of 471,000 records which was against 30 seconds for 1.2million records using the blocked memory.
We played with programs to simply read the records to see if the C programs were the problem, I have to confess that RPG’s file processing is far superior to the C implementation. But if you look at what IBM does with its compiler spending its not surprising to see that. Come on IBM get the C functions to perform as fast as the RPG DB functions!
We also had to implement a context token for the IBM API’s to allow us to generate many calls to the API, our original process simply created a HASH list and generated a HASH over it for a total HASH Value. We think this has improved the CRC strength as it is using the context token to allow multiple calls to the CRC generation API using internal registers to hold the intermediate value between calls.
We also did a lot of tests to find out the best way for us to use the file read functions and the calls to the API’s. We tried using blocking and setting the type to record for the stream functions, we also experimented with using the internal buffers from the file read instead of copying the data to our buffers, but that seems to be a total failure? We didn’t seem to get much more out of the process but if this is going to be used over very large files a few seconds on our systems could end up as hours on yours.
In the end we have to take two separate routes, for the IBM API’s we will stick with blocked memory, but for the adler32 function we have the option of reading the data a record at a time or sticking with the blocking, our preference for simplicity would be to go with the blocking but the benefits of using record level checking seem to outweigh simplicity!
If you are in need of a simple CRC for data checking the adler32 certainly performs the best, but reading through the notes it does have some problems. The IBM HASH process is definitely a better CRC strength but it comes at a price!
We should have a new version available for download later this week.
Chris…