The Decision to Fail
We have a number of automated technologies that we can us to seamlessly move from primary to secondary systems without human intervention. SQL Server incorporates a number of these, and many companies use them to ensure their applications are highly available. However things don’t always go as planned in a disaster and sometimes humans get involved.
Unless you are one of the companies with a very large budget and high risk of business issues when systems failover, you probably have some sort of high availability (HA) or disaster recovery (DR) process that requires human intervention. Log shipping, for example, usually requires that some human reconfigure the application to use secondary servers. Even with Availability Groups, clustering, or database mirroring, you may need to manually fail back to primary systems.
In those cases, it’s not always a clear decision to do so. Many of the switches are disruptive, or have the potential to be disruptive. Cluster fail-overs should not impact the application, but there is a brief period where clients may not connect. Outside of disasters, Management, and often technical people, usually want to schedule any failovers after they have prepared the end user for potential issues, however brief.
In disaster situations, when there hasn’t been a complete failure of a system, you may not want to have unscheduled failovers right away. This week I want to know:
How do you make the decision to fail over from one system to another?
I’m speaking to you, the data professional or the administrator. I would guess that most of you are not the one that ultimately makes the decision to leave your primary systems. Often I’ve found that someone in management has to make the decision, but with input from the technical people. In that case, think about how you present the situation and pros and cons of the failover. Do you give hard numbers, like latency and relative CPU power in failover machines or do you attempt to quantify the effects on the business when secondary systems are in use.
I’ve rarely had a large budget for secondary systems. Network bandwidth, CPU and memory, and more are sometimes sacrificed in secondary systems in order to align the cost of these systems with the risk of needing them. In many cases, we didn’t have automatic failover for many systems because we had to know our primary systems would be down for more than 5 or 6 hours before we would switch to the backup environment.
If you have similar guidelines or processes in place, let us know.
The Voice of the DBA Podcasts
We publish three versions of the podcast each day for you to enjoy.