Is HA really that hard to Manage?

I was reading the daily industry rags we get by email and noticed a story about a new offering from a company to manage your HA environment for you. If you got the email you will know which one I am talking about.

Anyhow this got me thinking about some of the content, here is my take on what they said

Firstly you have to understand this is a business which is offering a service to help you manage your HA environment, so some of the content is aimed at making you want to take the service! But a couple of things really stood out for me especially when you look at what HA is really about.

‘implementing and running HA software is not a trivial exercise’

This is a very true statement, HA is all about switching constantly between systems to remove planned as well as unplanned downtime. Planned outages such as OS or application upgrades are where you gain the biggest ROI. Unplanned are seen as a remote possibility, I don’t know of any HA implementation where they purposely yank the plug on the system regularly to make sure the target system will work? Switching is a team effort that could involve many departments and staff, having an external consulting group involved may have its own challenges?

‘has seen many an HA environment go south due to neglect’

Again I agree, but that is usually due to lack of understanding and or manpower to maintain it! The above statement adds weight to this, it can be a very complex environment to manage and its not made any easier by the HA products and their inbuilt ability to be everything to everyone. Having a specialist maintain the environment should help in this area. The problem I see is the companies involved don’t allocate any resources to the solution, spending $20k a year in house would probably be just as effective. Isn’t one of the major selling points used by the HA ISV’s the fact that this takes less than 10 minutes per day to manage?

‘SLAs won’t dictate a minimum amount of time to conduct the switches.’

Hmm this is where I start to worry. If this is a HA environment isn’t the real reason you paid out the $100,000’s was so you could limit your downtime to an acceptable level? Why would I now accept that even though I am paying a professional company to manage it for me I should accept an open ended recovery time? Doesn’t mention if they offer a maximum time so I may be jumping ahead of myself.

‘considering the thousands of borderline HA implementations around the world offering little or no protection’

How is an offering which provides no switching time guarantees better than borderline? I am sure the management of the HA environment will be better, but generally the problem lies outside of the operations department. The HA team wont have any control over the application team so how will they ensure they provide HA ready code and product? Now we have an external group involved, isn’t that going to add complexity and delays?

‘But the cost will never go to zero. Despite all the “autonomics” and other whiz-bangs vendors build to simplify HA management, good HA practice still comes down to having boots on the ground.’

Hmmm again not what the HA ISV’s sell? Autonomics is really there to cover bad management of the application environment. If you had a solid replication tool it should only change whats been changed on the source, the reason things go out of whack is because of external influences such as badly written process and poor application design. Improve the control at the environment level and you should see the need for autonomics reduce. That can’t be done by having a HA replication tool specialist watching over your systems and fixing up problems as they occur. Or will these boots on the ground be more than that?

‘Even though we’re monitoring and maintaining it on a regular basis, there are still problems with COM lines dropping. Things get kind of hairy sometimes. We see it all the time. The difference is we correct those issues as they come up, instead of putting them off.’

I would have thought 99.9% of companies do this? But that’s infrastructure problems not HA replication tool problems, or did I miss something?

Overall I applaud this company for the innovative approach they have taken to the problem of poorly managed HA implementations. There is a need for this only because of the complexity built into the HA products, why not turn this on its head and ask the question “why am I making this so complex?”

The other point is how should HA be defined? We have spent years differentiating the products by how complex they can be and yet how simple they are to manage. HA was pushed as the solution to PLANNED downtime with the additional coverage of being able to provide DR where a remote location was available (Its a bit hard to provide TRUE DR where both boxes sit in the same place!). If this is HA shouldn’t I be switching to the target system on a regular basis and leaving my users on it for extended periods of time? Having your role swaps attended by a HA product specialist (I assume that this is what the company is going to offer you) should help where changing the replication direction and understanding the replication products configurations are concerned, but how will they be able to help with your application issues, infrastructure and user access issues? In all of the role swaps I have been involved with this has been where the problems exist. The HA product was ready to go well before we had users online and the infrastructure ready. Post switch we could fix up the replication configurations so we picked up the odd object missing or profile error, but the rest was generally managed by others.

Obviously there are some questions still to be understood, and the news release is limited in what information it provides. I would look carefully at what you are trying to achieve before going full steam ahead with what seems like an easy get out of jail free card. Paying for a HA solution is a big budget item, the project sponsor should have understood the ongoing management costs and planned for those costs up front. Adding another $20k may not be as easy as it seems, or will your job be the cost saving?

Anyhow, the reason for the post was to give my views on the offering based on the content of the news release. I am sure there are lots of companies out there which gain some benefit from having a managed service for their Replication environment so this will suit them. I would just make sure you tie down the switching times if they are important to you.