Tag Archive for RTO

Another BCP Acronym

Yes, I realize that the last thing we need in Business Continuity Planning practices is another acronym, but, hey, what’s the fun in writing a blog if you can’t cause trouble?  So here goes – another BCP acronym …

I have been stating for a while now, that the BCP Methodology needs to be revisited.  I think that the tried and true practice of conducting BIAs is a bit flawed.  In practice, I think, the methodology attacks middle management and department level areas in the organization without first establishing corporate-wide and senior level objectives for business during a crisis.  When we ask people to establish RTOs and RPOs (more of those lovely acronyms – see the chart below) what are they basing their answers on?  When we ask for impacts of being down, to set those recovery objectives, what business objectives are they being designed to meet?

I think that the BCP Methodology needs to add a step in the beginning of our analyses in which we establish – are you ready for it, here it comes, the new acronym, in three, two, one – our ABOs, Adjusted Business Objectives.  I think part of the fallacy in our current process is that RTOs (or MADs if you prefer that acronym) are set with the assumption that the company is still aiming to hit its established business objectives for the year.  And, I think that is wrong.  During times of crisis, I think management’s expectations of what the company should achieve are adjusted.  During times of crisis, we may not have the same Income Targets, Profit Targets, Sales Targets, Margin Targets, Production Targets, etc.

Every company establishes business objectives for the year – assuming we operate in a normal business environment.  Once that “normal” environment is compromised due to a disaster, I think those business objectives get adjusted.  And, I think it is important to relay that information to the management team that is responding to our BIA questions.  We should be asking what the critical timeframes are for conducting business functions given we need to meet these Adjusted Business Objectives or ABOs.

Department objectives are, I hope, based on meeting the overall corporate objectives.  Once we know our ABOs we can translate that down to the department level and establish more meaningful RTOs, RPOs, MADs and what have yous.

The real challenge here is, however, getting senior management involved enough in the process to establish these ABOs.  One reason I think we don’t do that today is because it is much easier beginning the process with middle management.  The savvy manager, however, I think, is the one that asks, “During a time of crisis, what are my department’s objectives?  What is senior management expecting us to get done throughout the crisis period?”

So, there it is, a new BCP acronym – ABOs – just what we needed … NOT!

ACRONYMS USED IN THIS ARTICLE – FOR THE UNINITIATED

BCP – Business Continuity Planning

BIA – Business Impact Analysis

RTO – Recovery Time Objectives

RPO – Recovery Point Objectives

MAD – Maximum Acceptable Downtime

 

Business Objectives vs Business Continuity Objectives – The Missing Step

This blog article talks about a step in the Business Continuity Planning (BCP) Methodology that I think is missing – and, I happen to think it is a pretty important step.

One of the greatest challenges in the BCP methodology is in establishing the program’s recovery objectives.  Whether you label them as Maximum Acceptable Downtime (MAD); Recovery Time and Recovery Point Objectives (RTO & RPO); or some other creative anagram unique to your process, these program benchmarks are usually arrived at through a Business Impact Analysis (BIA) process or, at least, through some survey/interview with business managers and subject matter experts to establish what the critical business processes are; what timeframes they must be recovered; and what resources must be available in certain timeframes to enable our continuity or recovery of those processes.  Does this sound familiar?  I’m I right, so far?

But – you knew there was going to be a but – to achieve what end?  I mean, we do a great job defining business continuity objectives, but do we do so against established business objectives?

I always thought that the savvy business manager, when asked to complete a BIA questionnaire would ask the question, “What is Senior Management expecting me to achieve during the business interruption period?”  Sometimes, I think, we get close.  Many times I hear business continuity planning professionals say that the objective is to “survive” the disaster or “keep the company solvent”.  But do we ever define what that means – in business objective terms?

So, forget about operating in disaster situations for a second.  Just think about business as usual objectives.  Most every company and most every department within each company has established business or performance objectives.  There are defined revenue targets, income objectives, margin targets, production objectives, etc.  There are expected number of widgets to produce per week; sales targets; number of calls handled per hour; items sold; and so on and so on.

What I would want to know, if I were the business manager being asked what my critical processes are and how long can we go without performing those processes, is:  What adjustments are being made to my performance objectives during this incident you are asking me to plan for?  Am I expected to still achieve my revenue target, sales target, income target, margin targets?  Am I still being measured against growth?  How many widgets per day am I expected to still crank out?  If you can tell me what my management is expecting me to produce during this contingency period, I can then tell you what I need to do, when I need to do it and what I need to get it done.

Seems to me, we miss that step.  We make middle management guess at what our business targets are.  And, furthermore, we never ensure that their guesses are consistent with one another.  Each individual manager who completes the BIA makes their own assumptions about what the overall business objectives are during a business interruption event.  Seems a bit risky to me.

I understand why and how this happens.  It is primarily because middle management is more accessible in our planning process.  It is much easier to include middle management in the planning process, feed them the BIA questions and get them to assign MADs, RTOs and RPOs than it is to include Senior Management in the process.  But – there’s that damn word again – how can we really define viable business continuity objectives if we don’t first know our business objectives during time of an event?

I wonder what would happen if we tried?  I wonder … what if you posed that question to upper management?  What if we added that step in our BCP Methodology:  Define adjusted business objectives that must be achieved during a serious business interruption event.  IN BUSINESS TERMS – not in BCP terms.  Interesting.

Anyway, just a thought.  What do you think?

Establishing RTOs

I think there is a common mistake that we, as business continuity planners, make when working with our business partners to determine RTOs for processes and applications that support them.

I think we do a good job in using the findings from our Business Impact Analyses (BIA) to help identify the Most Critical, Critical and Essential business processes (or whatever labels you happen to use) to ensure that these processes are what we recover first, but, I think when we work with these areas to define Recovery Time Objectives (RTO) we do not properly establish the post-disaster performance objectives.  I think that most of us allow our business partners to establish their RTOs based on the assumption that they will be operating at or close to business as usual.

Sure, we instruct them to try to establish the minimum requirements and consider work arounds and the such … but, to achieve what end?  How many of us first ask senior management if there will be any changes to our management objectives following a serious business interruption event?  Will revenue or income targets be adjusted?  How much additional costs and expenses can we incur?  Will response or service targets be adjusted?  Margin targets adjusted?  ROI?  ROE?  Or, any other management metrics adjusted because we are in crisis mode of operations?

Although this goes against my overall philosophy of trying to simplify things, I think it would be beneficial to establish three modes of operation when establishing RTOs with our business partners.

  1. Survival Mode
  2. Sustain Mode
  3. Business as Usual Mode

The goal of Survival Mode operations is simply to keep the company solvent.  Forget trying to be profitable; forget growth targets; forget avoiding all penalties, fines and service interruptions – what, minimally, does the company need to do to not jeopardize the solvency of the firm?

The goal of Sustain Mode operations is to satisfy the commitments we have today with our current customer base.  What do we need to do to keep our current customer base satisfied and meet the regulatory and contractual obligations we already have in place.

And the goal of Business as Usual is … well, just what the words say.

I think if we could get senior management to define the management objectives for each mode of operation and how long the company can operate in each mode, the RTOs we establish will be much more realistic.

I work in many environments testing their RTO capabilities where, when short time-frames are missed, they report this as a failed exercise but, the business areas ultimately say, we could have lived with the delays.  I think our RTOs, in general, are much tighter than they need be if we think about Survival first, then Sustain and then BAU.

I know, I know, I know … for those of you cursing me out; yes, there are some real crucial business processes that legitimately have very short RTOs (or require immediate failover with no downtime), but I think that pool of requirements is much smaller than many of our programs suggest.

So, yes, I think we do a good job focusing on Most Critical job processes, but I don’t think we establish the right mindset in gathering the requirements to support them after a disaster.

I welcome all comments to the contrary or, heavens forbid, in support of this concept.

Recovery Time Objectives: The Bigger Picture

A few of you didn’t take kindly to a blog I wrote a while back that suggested some of us business continuity planners have fallen victims to our own methodology.  Well, get ready to be offended once again.

This time, I want to take a look at the Business Impact Analysis (BIA) process and how we establish Recovery Time Objectives (RTO) – be they for business functions or software and applications.

In this case, I think we have fallen victims to our questionnaires.  Now, of course, some questionnaires are much more detailed and better than others, but I think they all fail from the problem that we do not put our questions in perspective of the bigger picture.  Ultimately, these questionnaires come down to the question of, “How long can we go without … doing something, or running something?”  Like I said, some questionnaires do a pretty good job of also gathering the justification for the ultimate answer, but…

I think the savvy business manager is the one who everyone else thinks is a pain in the asking.  The savvy business manager will stop short of answering these questions until he or she knows what the corporate position is on business targets during a crisis.  I would resist answering these questions until I knew what the Executive Teams’ expectations were for my department.

In other words, I would want to know; During a crisis…

  • Are our revenue targets adjusted?
  • Are profit targets adjusted?
  • Are margin targets adjusted?
  • Or, whatever business metrics I am measured against – are they adjusted?

I think most BIAs start and end with middle management answering individual BIA questionnaires, when, in fact, they should start with Executive Management establishing a Crisis Management Business Plan establishing the acceptable business targets to be achieved during a crisis.  Armed with that information, middle management has a more realistic shot at providing valid answers to our questionnaire.  Right now, every business manager is making their own assumptions about what Senior Management is expecting and these are likely not consistent across the board.

Furthermore, I think most planners simply accept the BIA answers provided with little push back.  Look, I’ve been a planner for a long time – I know exactly how easy it is to be so excited just to get any answers back that you do not dare challenge the results.  But, how often have you seen situations where business managers say they cannot be down for more than 4 hrs and yet close the entire office for a day or more during a snow storm?  Or, there is a function performed by 3 staff members and at time of crisis they say they need all three to be up and running in 4 hours – you mean none of these people ever take a vacation?  Again, it goes back to the original problem – it all depends what they think they need to achieve during a crisis.

Now before you jump down my throat – I do get that during a crisis you may not be functioning the same as normal.  You may be doing some things manually, requiring more labor.  I am just suggesting that sometimes we need to push back a little and have the managers support their answers and make sure they have thought things through logically.

Now on the opposite side of the spectrum, I was working for Comdisco during the World Trade Center bombing in 1993 and I worked very closely with two financial services firm recovering from that event.  On the Monday following the bombing – the first business day following the event – these companies experienced a call and transaction volume almost 10 X their normal volume!  So they, in fact, had some functions in which they really needed more than 100% of the workforce recovered.  I think, as planners, we may need to also push back on some departments to make sure they have taken into consideration the possible changes in work flow and volumes, given the fact that they had a disaster.  Insurance companies are just one example of organizations in which the disaster itself could be a catalyst for increased work activity.

It just seems to me that sometimes, and I don’t mean everyone does this, but sometimes, the BIA really simply becomes a Business Impact information gathering tool and we forget to do that “A” part – we forget to analyze the answers provided.

So, in summary, I think we can sometimes help the process along if we first get Senior Management to establish adjusted business targets for operations during crisis before asking middle management how long they can be down; and, I think we could do a better job challenging some of the answers we get back to our, sometimes, ambiguous questions. 

Okay, there you go, now let me have it and tell me why I’m wrong.

Disaster Recovery: Application Recovery vs. Data Center Recovery

In many disaster recovery programs that I have reviewed, I find that Recovery Time Objectives are assigned to applications, often in a tiered format.  Tier 1 might be applications that have to have near instantaneous failover – these are often less “recovered” environments and more of a geographically displaced redundant, live/live architecture.  Tier 2 might be applications that must be recovered in 4 hours or less; Tier 3 in 8 hours or less; and, so on.

The problem that I often encounter is that each individual application and its hardware environment are often tested and RTO validated in a stand-alone environment.  Doing these one-by-one tests, the IT teams prove to their business partners that each application can be recovered within the parameters of the Tier it has been assigned.  And, the business community is satisfied that, should a disaster occur in the data center, their applications will indeed be up and running in the requisite time-frame.  The problem is that if the entire data center were to be compromised by a single event, not all applications within each tier is likely to be recovered in the defined timeframe.  Yes, each of the individual applications can be brought up in 4 hours or less, but not all 20 (or however many there may be).

Disaster Recovery Plans seldom prioritize within a Tier to identify which applications should take priority if RTOs are in jeopardy or problems in the recovery process occurs.  Meanwhile, the individual application users are under the impression that their application(s) will be up within the established timeframes – after all, it has been proven through several tests.

I challenge Disaster Recovery Program owners to ensure (and prove) that all applications within an RTO category can be recovered within that established time when and if the entire data center is compromised.  I have personally witnessed, on more than one occasion, IT Teams having to work with the business community during the time of failure to determine which Tier X applications are really the most important to get up and running given they have resource constraints preventing them from getting all applications up within the established timeframe.  These are not fun times and the business community feel that they have been misled, which, they have.

You may need to communicate a bi-Tier ranking for applications to indicate a reasonable RTO should only the one application environment (or suite) need to be recovered versus a second RTO should the entire data center need to be recovered.  I do not mean to overcomplicate an already complicated process, but I think we do a disservice to our clients if we allow a false sense of security to occur because we do one-by-one application recovery tests.  Make sure the business community understands the recovery differences between a single application failure and an entire data center failure.

And, yes, I understand that there are more complicated issues with reads and feeds and inter-application dependencies, but I will tackle that problem on another day in another blog.

The Recovery Time Objective Debate Continues

The Recovery Time Objective debate continues over on a LinkedIn discussion board.  Really folks, I don’t know what is so hard to comprehend here!  I think some people are just trying to be difficult as a means to show they are smarter than everyone else.  Me, personally, I prefer the KISS method – Keep It Simply Simple (I know it is usually said another way, but I wanted to avoid labeling people).

Simply put, the RTO measures the time objective for moving from Point A to Point B where; Point A equals the moment when a business process (or technology resource, if used for IT Disaster Recovery purposes) stops functioning and Point B equals the point when the business process (or, you know) must start functioning again to avoid jeopardizing the solvency of the organization.

It is an OBJECTIVE – that word is part of the acronym – why is it so hard to comprehend?

Yes, yes, yes, the event that interrupts the process or service will definitely influence when the recovery process starts, or what recovery tactic you decide to take – but the OBJECTIVE remains the same.  Fine, fine, fine, so you have an emergency response team that is responsible for assessing the damages and determining whether or not to declare a disaster, but the OBJECTIVE remains the same and the clock is ticking.

Hopefully, your proven recovery capability is less than your recovery objective.  In that case, the Recovery Time Objective minus the Proven Time to Recover equals the time your Emergency Response Team has to gather, evaluate the situation, and declare the disaster in order to ensure your RTO is met.

RTO – PTtR = Maximum Time to Declare

Your Emergency Response Team needs to be aware of all of these factors while performing their response tasks.

You do not decide the RTO or the PTtR at time of disaster – it is too late.

The RTOs are established in the BIA process.  The PTtR are established through a series of tests and exercises.

I do not disagree with most of what people are arguing in the discussion thread – I just disagree with the words they are using in the argument.  You are overcomplicating the point and mixing apples with oranges.  Sometimes I think it would be better to just throw out the common terms in use today and come up with new terms at each company that do not have a preconceived notion of what they mean.  Then define the new terms the way you want to use them so everyone in that organization has a common understanding.  That may be throwing out the baby with the bath water, but it might stop me from pulling out what little hair I have remaining while reading this agonizing discussion thread.

Discussion on Recovery Time Objectives

There is a discussion thread in a LinkedIn group on when does the clock start on the Recover Time Objective. I am not going to use this blog post to repeat and provide support for the answer that I gave on the thread – although I think it is a good response – but, rather, I am going to use this as an opportunity to once again point out what I think is one of the biggest challenges we face as business continuity and disaster recovery professionals – the inconsistencies in jargon.

Recover Time Objective or RTO is a common and often used term in our field. And, in this discussion thread, intelligent and experienced practitioners are all arguing and posing very different opinions on the use and interpretation of this important measurement.

How the heck are we to expect our customers (even the internal planner has customers) to understand what we mean by terms we throw around when we can’t even agree amongst ourselves what it means?

Now I am not here to suggest we reach a common ground on these definitions, I am not sure that is possible in my lifetime, I just think it is important that we never assume our audience as the same understanding we do. Sometimes, I think it is better to use new labels, that our customer does not have a pre-conceived definition and define for them what it means.

I have seen many planners miss the mark in meeting customer expectations because they interpreted similar terms differently.

So, make sure you and your customer are all on the same page when it comes to these very common and oft used terms before you get too deep in your planning process.

The Business Impact Analysis (BIA)

I know this is a well known concept and BIAs are part of almost every Business Continuity Program, but I happen to think that many, if not most, people get this wrong – just a little.

Many practitioners, in my way of thinking, overcomplicate or overextend what the BIA is supposed to include and result in.  In many instances, the BIA is performed to establish the equally well known recovery objectives, the Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

One of the issues that I have is what the RTO measures – is it the recovery time objective for a business process or the recovery time objective for an application or IT infrastructure?  Often, people use it for both and I think that can confuse things.  The RTO grew up as a disaster recovery, implying technology recovery, measurement and I think that is where it should stay.  The BIA, by definition does not measure technology recovery objectives it measures exactly what the words say – business objectives.

So let’s back up a bit.  What does (should) the BIA do for us?

The BIA measures and records the “impact” on an organization should a “business” process cease to operate.

The BIA answers questions such as:

What is the impact to our company if we cannot settle trades?  Or, what is the impact if we cannot provide customer service?  Or, if we cannot sell tickets.  Or, if we cannot pick raw materials from our inventory warehouse? …

The BIA measures the impact on WHAT you do, not HOW you do it.  The HOW questions come later in the methodology.  Most companies are not in the business of running computers.  They are in the business of providing financial services, or selling insurance, or flying airplanes, or making consumer goods, …  Now, most also rely on technology and business applications to support what they do but these tools are recovery requirements and are looked at downstream in our analysis.

Once we know the impacts of not performing discreet business processes, we can determine how long the company can survive before these impacts become so severe that they jeopardize the solvency of the organization or pass some other pre-established pain threshold.  To avoid confusion with RTOs, I like to call this the Maximum Acceptable Downtime – now you may say I am MAD, but that’s the result of my BIAs.

And that ends the Business Impact Analysis.

Now, focusing on those business processes with the most demanding MADs we can start looking at how we perform those processes; start analyzing the required technology to support those processes; and, start assigning RTOs and RPOs.  This, we might call our Technology Impact Analysis, although I don’t see too many people using that term.

Many times, MADs and RTOs for applications that support that business process are equal, but, then again, many times they are not.

For example:

In conducting a BIA, a trading company may discover that they must be able to execute a commodity trade within 4 hours of a business interruption, i.e. the MAD = 4 hrs.

In defining how they trade commodities, they identify the Commodity Trading Platform (CTP) as an application that supports this activity.  However, in evaluating contingencies, they decide that they could actually execute trades manually, by filling out a manual trade blotter, like the old days, and enter them into the system within 24 hours of the trade.  So, as long as they have a telephone and a pad of paper they can, with great inconvenience, execute trades.  So, the RTO for the phones might be 4 Hrs, but the RTO for the supporting application, the Commodity Trading Platform is 24 Hrs.

Now, if you want to argue that you get that, but in order to not keep going back to your business partners over and over again in the planning process you collect BIA, Recovery Requirements and Technology Objectives information all in the same interview, I can accept that.  But, I think it is important to differentiate the results of the BIA, business process MADs, from the results of the Recovery Requirements Analysis and subsequent disaster recovery requirements.

Even though most planning professionals preach that there is a difference between business continuity and disaster recovery, I think that the distinction often gets blurred in the execution of our methodology.

Just one man’s opinion, for what it is worth.