In many disaster recovery programs that I have reviewed, I find that Recovery Time Objectives are assigned to applications, often in a tiered format. Tier 1 might be applications that have to have near instantaneous failover – these are often less “recovered” environments and more of a geographically displaced redundant, live/live architecture. Tier 2 might be applications that must be recovered in 4 hours or less; Tier 3 in 8 hours or less; and, so on.
The problem that I often encounter is that each individual application and its hardware environment are often tested and RTO validated in a stand-alone environment. Doing these one-by-one tests, the IT teams prove to their business partners that each application can be recovered within the parameters of the Tier it has been assigned. And, the business community is satisfied that, should a disaster occur in the data center, their applications will indeed be up and running in the requisite time-frame. The problem is that if the entire data center were to be compromised by a single event, not all applications within each tier is likely to be recovered in the defined timeframe. Yes, each of the individual applications can be brought up in 4 hours or less, but not all 20 (or however many there may be).
Disaster Recovery Plans seldom prioritize within a Tier to identify which applications should take priority if RTOs are in jeopardy or problems in the recovery process occurs. Meanwhile, the individual application users are under the impression that their application(s) will be up within the established timeframes – after all, it has been proven through several tests.
I challenge Disaster Recovery Program owners to ensure (and prove) that all applications within an RTO category can be recovered within that established time when and if the entire data center is compromised. I have personally witnessed, on more than one occasion, IT Teams having to work with the business community during the time of failure to determine which Tier X applications are really the most important to get up and running given they have resource constraints preventing them from getting all applications up within the established timeframe. These are not fun times and the business community feel that they have been misled, which, they have.
You may need to communicate a bi-Tier ranking for applications to indicate a reasonable RTO should only the one application environment (or suite) need to be recovered versus a second RTO should the entire data center need to be recovered. I do not mean to overcomplicate an already complicated process, but I think we do a disservice to our clients if we allow a false sense of security to occur because we do one-by-one application recovery tests. Make sure the business community understands the recovery differences between a single application failure and an entire data center failure.
And, yes, I understand that there are more complicated issues with reads and feeds and inter-application dependencies, but I will tackle that problem on another day in another blog.