A stupid mistakes which took hours to resolve

Now that the CrcBuilder technology has been added to the next version of RAP the product has gone back into unit testing. Initially everything was going well until we forced some data errors into the audit functions. We started to see deadlocks in the programs where each was waiting for something from the other system.

After a lot of code reviews and hours of debug we finally found the culprits. Firstly, we have the target process running through a switch loop to determine which part the the process it is running and which code segment to run through. Somehow we had missed one of the stupid mistakes we had run into in the past, the atoi() function takes a character string and returns an integer based on the content. We use the value to determine the key which has been passed which is a 4 character value.
Here is a simplified version of what we were doing..


int function a(int sock, char*ptr) {
char key[4];

do {
   memcpy(key,ptr,4);
   switch(key) {
      case : 1 {
           do_something();
           send(something_back);
           break;
           }
      case : 2 {
          do_something_else();
          send(something_else_back);
          break;
          }
     default : {
         break;
         }
      }
   recv(ptr);
   }while(key != something);
}

This code worked fine until we changed the data which was passed from the source from pure character data to mixed. Then we started to somehow slip past the expected function and instead of hitting the send() functions we went straight through to the recv() function.

Here is the corrected code


int function a(int sock, char*ptr) {
char key[5]; // larger buffer
memset(key,'',5);  //NULL terminate
do {
   memcpy(key,ptr,4);
   switch(key) {
      case : 1 {
           do_something();
           send(something_back);
           break;
           }
      case : 2 {
          do_something_else();
          send(something_else_back);
          break;
          }
     default : {
         break;
         }
      }
   recv(ptr);
   }while(key != something);
}

Now it all works as it should. The problem is that atoi() works on null terminated strings, if you don’t terminate the string you could get the wrong return value sometimes and not others! Stupid mistake to make especially as we had seen this before, but somehow we missed it this time! The question is why should changing the data passed affect this so dramatically, we had no null termination between the elements before?

The next problem was easier to fix, well we found it quicker anyhow!
We are sending data between the source and target similar to the above and we take the data passed and map it to a data structure before we call API’s with the content. We found that the target system program was passing garbage into the API which of course resulted in the API failing and we had error status messages returned to the source system.

After some investigation we found the culprit, we have implemented the Adler_32 CRC checking to the data which resulted in changes to the structures we passed between systems.
Here are the original structures (Yes we only needed one but somehow during the development we didn’t do that. The module is called from many places which created the mismatch)

// source system
#typedef struct req_x {
         char key[4];
         char Obj_Name[10];
         char Obj_Lib[10];
         char Obj_Type[10];
         char CRC[16];
         } req_t;

int struct_size = 0;
req_t Request;
struct_size = sizeof(Request);

// target system
#typedef _Packed struct resp_x {
         char key[4];
         char Obj_Name[10];
         char Obj_Lib[10];
         char Obj_Type[10];
         char CRC[16];
         } resp_t;
iint struct_size = 0;
req_t Response;
struct_size = sizeof(Response);

The functions worked because originally the size of the _Packed and non _Packed structures were the same size (you cannot pack the character strings). However because we now had a mixed structures they were not.

// source system
#typedef struct req_x {
         char key[4];
         char Obj_Name[10];
         char Obj_Lib[10];
         char Obj_Type[10];
         uLong CRC;
         } req_t;

int struct_size = 0;
req_t Request;
struct_size = sizeof(Request);

// target system
#typedef _Packed struct resp_x {
         char key[4];
         char Obj_Name[10];
         char Obj_Lib[10];
         char Obj_Type[10];
         uLong CRC;
         } resp_t;

int struct_size = 0;
req_t Response;
struct_size = sizeof(Response);

The sizeof() function returned different values on each system, so when the data was moved into the structures from the recv buffer we offset the data in the structure every time due to the difference in structure size.

Now we have only one structure and it is _Packed…

Both of these problems are very stupid and should not have crept into the code, but with about 500,000 lines of code in RAP now, its sometimes easy to miss the obvious ones.

OK enough of the chatter, I have to get back to testing so we can get this release out to the customers!

Chris…

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.