Tag Archive for disaster recovery

The Blackberry Outage

The current Blackberry outage going on throughout Europe, and now the US, provides an opportunity to discuss two important Business Continuity Planning issues: 

  1. Don’t rely on a single communications device
  2. Ensure you have processes for addressing the backlog

I remember immediately after the events on 9/11 people were touting how well their Blackberries continued to function during the crisis while all other communications tools were failing.  Shortly after, it seems, everyone was running out and buying a Blackberry.  I was not suggesting people not invest in Blackberries, but I was warning people that just because this particular tool was working in this crisis does not mean it will be the one tool working in the next crisis.  One reason the Blackberry worked so well in 2001 was because so few people were using this device, the infrastructure that supported it was not being overburdened during the time of crisis.  Blackberries rely on a different technology and different infrastructure that was not damaged during 9/11 – I was warning anyone who cared to listen (probably no one) that this might not be true during the next crisis.  My point was not that Blackberries won’t always work, but that you should not rely on a single tool or technology for all of your communications channels.  Lo and behold, we now find out that Blackberries are susceptible to network wide outages similar to other communication tools.

In the referenced article, Research in Motion is saying that they have fixed the underlying problem causing the outage but that the backlog of emails and text messages is delaying getting the service fully functional once again.  This is a reminder to make sure that business areas consider the impact of the backlog during times of outage and have procedures in place to address the backlog once their systems are back online.

I have even seen instances when the inability to handle the backlog that would develop was the primary justification for establishing an RTO for some applications.

Procedures for handling the backlog (and, reentering lost transactions where the RPO is not, point of failure) need to be included in each department’s business continuity plan.  For some financial based applications, this may include having to post date transactions to ensure they have the right effective date with them.  For some applications that automatically generate the transaction date and time, this may require some additional programming or rebooting servers with different time stamps to ensure the proper entry date.

For all applications and business processes that are not immediately failed over, there is the potential for a backlog to develop.  How you handle that backlog must be considered in the recovery and continuity plans.

Business Continuity Planning: Vendor Risks

One of the risks that a lot of companies may benefit from looking at a little closer is that of “vendor risks”.  Vendors can be suppliers or outsourced entities that perform a critical service on behalf of our organization.  We need to ensure that our critical vendors know how to respond in the event of our disaster and we need to know that the vendor can continue to provide materials or support in the event of their disaster.

I know many organizations include “Service Level Agreement” (SLA) clauses in vendor contracts, but I suggest that we may want to go further than that and, every now and then, ask to be shown evidence that they could meet those levels of performance at time of disaster.  How many of you audit or review your vendor’s Business Continuity / Disaster Recovery Plans?  How many participate in their vendor’s Business Continuity / Disaster Recovery tests or exercises?

Many organizations try to mitigate or eliminate vendor risk by engaging multiple vendors to provide a similar product or service.  Just be aware that, sometimes, even though you diversify your vendors, you may not have diversified the infrastructure they depend on.  Lessons learned from the events of 9/11 showed proof of the issues that can arise here.  Many companies felt confident that they were using multiple communication vendors only to discover that they all relied on the same underground infrastructure and same “points of presence” (POPs).  One central office failure; one cable conduit compromised and all vendors were out of service – the diversity did not provide the stability they thought they were getting using multiple vendors.  Even though you may think you have eliminated a potential “Single Point of Failure” (SPoF) by using multiple vendors, make sure you do not still have SPoF in the physical infrastructure they rely on.

Another example of this I recently encountered was working with an airport authority in dealing with a potential flooding risk caused by a suspect dam near the airport.  Many of the airlines at the airport had fuel provided by two or more fuel suppliers, but the delivery of the fuel was all through the same single source pipeline that was in the flood zone.  Even though the pipeline itself was underground and, potentially, not susceptible to damage by the flood, the fuel line switching station was above ground and in the flood zone.  The pipeline needed this switching station operable to move the fuel.

I also was once hired to look into the reliability of an off shore outsourced call center.  This facility, located in India, was a state-of-the-art facility in a pretty resilient compound.  The outsourced company felt so secured in their “hardening” of the facility that they did not see the need in investing in contingency operations.  The problem was, however, that the infrastructure that fed power, phone service and other utilities into the compound was very suspect.  Additionally, the employees did not live in the compound and a disaster in the area could easily prevent them from getting to the complex.  My client decided that they needed a contingency should their primary vendor suffer a business interruption event and took the necessary steps to cover this risk.

And, remember to make sure your vendors know what changes in their delivery or performance must be made at time of your disaster.  One simple example – do your mail carries (US Post Office, UPS, FedEx, others) know where to reroute your mail or where to do pickups from when a particular facility is compromised? 

Also make sure if you have vendor personnel on site that you educate them in the evacuation, notification and escalation process.  Are you responsible for accounting for vendor personnel during a disaster, or do you call the vendor and have them account for or alert their employees at time of crisis?  Do not forget those vendors that may not perform a critical service but are on site – such as, cafeteria staff; custodial staff; plant suppliers; landscapers; etc.  Make sure they are notified of an office closure and are included in the process for accounting for who may have been injured or killed in the disaster.

Sometimes it is easy to overlook our vendors in the planning process.  Make sure your program and department managers have adequately accounted for them.

Business Continuity Blogs: No Offense Intended

I am having great fun in challenging myself to come up with nearly daily blogs about business continuity, disaster recovery, crisis management and related topics.  This exercise of typing my thoughts and experiences in a form that almost makes sense, I think, helps me identify opportunities to improve my approach to the planning process even if it does not inspire any of my readers.

I have posted many of my blogs and similar thoughts on Linked-In group pages and have, it seems, unintentionally insulted a few practitioners by suggesting there is room for improvement in the standard methodology and tools that many of us use.

I assure you that I, in no way, mean to suggest planners cannot be successful in engaging the methodology as it is today.  There are, without a doubt, many quality business continuity programs and plans in place throughout the world today.  There are many successful, intelligent, professional planners that do a terrific job in guiding their organization and fellow planners through the process without the need to deviate from standard practices and tools.

I simply try to find opportunities where we may improve a process – even one that appears good enough today.  I think, at times, it is engaging and rewarding to try to think outside the box every now and then, even if in doing so, it just makes us realize that inside the box is the best place to be.  I do not mean to offend anyone by doing so.  I do not mean to suggest everyone has it wrong and I am the only one who has this thing figured out.  No, in fact most of my thoughts are really an admission that I do not have this thing figured out and I invite others to discuss the topics to help me understand.

I do not mean to compare myself to historic figures, but history is full of people who did not accept the standard method of thinking – challenged common knowledge and understanding, much to the dismay of other professionals in their field, and helped advance the practices and level of understanding.  Christopher Columbus and Galileo are two that immediately come to mind.  I am sure there are many others who went against common understanding, down the wrong path and were proven to be total fools – and, I may more likely be one of those – but, like I said, at least I am having fun.  Insulting or aggravating others is not my intent.

Business Continuity Planning: Recovery Requirements

Remember the movie, “The Jerk” with Steve Martin?  Great movie.  Anyway, there is a scene in this movie where Navin Johnson (Steve Martin’s character) loses all his money and walks out of the house saying something like, “I don’t need anything.  I need this ashtray.  But, that’s all I need.  I don’t need anything but this ashtray … oh, and this lamp.  Just this ashtray and this lamp.  I need this. …”.  You get the picture, right?

There have been times when working with business managers in identifying recovery requirements that I have felt like I was in the middle of that scene.  This somehow seems to be especially true when working with trading floor operations.  Often when I first sit down with trading floor managers and ask them, “What do you minimally need to conduct business”, they answer, quite simply, “A phone.  You give me a phone and I can conduct my business from anywhere.”

Oh really?

You mean you have the phone numbers for all your clients, brokers and the trading floors?

Well no.  I need those numbers programmed into the phone.

So it’s not just any phone you need?

No – but, if I had a phone with those numbers programmed in, I can do my job.

So you don’t need to know the state of your client positions?

Well sure, I need that.  But if I had a phone, programmed with numbers and a copy of the start of day report, I can do my job.

And you don’t need access to any market data feeds?

Well, not all of them.  If I had a phone, programmed with numbers and a copy of the start of day report and access to Bloomberg, I could do my job.

And you will remember the trades you transacted to call into the back office.

Well no.  I need my blotter system or trade tickets.  If I had a phone …

Aaaaaahhhhhhhhh.

It got so bad that I would walk into these meetings with a roll of quarters in my pockets.  When they said all they needed was a phone, I would take out the roll of quarters and say, “Okay, take these.  There is  a pay phone out on the corner of the street in front of this building – go do your business for the rest of the day and let’s see how that works.”  I know that analogy kind of dates me.  Some of you reading this probably don’t even know what a pay phone is – but it often got my point across.

Once I got their egos to agree that it was more than just their expertise that made them a successful trader, and that they did depend on technology and tools more than they liked to admit, the next challenge was in identifying what was needed under what circumstances.

The difficulty with trading floors and many other business functions these days is the interchangeability of certain tools.  You get into discussions where as long as I have Application A, I can do without B or C; but, if I don’t have A, I need both B and C.  This can get very complicated.  Forget about the complication of also recovering the behind the scene applications and tools that the business managers don’t know about – that complexity is up to the technology teams to figure out.

Getting the right applications recovered in the right timeframes takes coordination between many different departments that share applications and databases.  I know some organizations try to identify “application owners” in the business community, but, this too, can get complicated.  Easiest example is, who owns email?  In a trading floor environment, there are many market data feeds that are shared between different trading desks.  Defining the ultimate owner can be challenging and cause troubles.

Once you identify the toolset (or toolset options) required to support critical operations you can start researching options for getting them operational in the requisite timeframe.

Be prepared, however, for some business managers to answer your question on what they need with another question, “Well, what can I have?” or, “How much will it cost me to have …?”

I don’t think it is too out of line for some managers to take the position that it is not a matter of what I minimally need in place, it is the matter of what I (or the company) is willing to pay to have in place.  If it is a reasonable expense to provide full functionality in a recovery site, why go through the exercise of asking me what I can get by with.  Give me solution options, and I will tell you what I can afford to invest in.  This often results in a push me-pull you working atmosphere that not all continuity planners are ready for, but one in which I think we should be prepared to handle.

Just a few more things for us to think about.  I sure am glad this job is not an easy one, otherwise anybody could do it.

Recovery Time Objectives: The Bigger Picture

A few of you didn’t take kindly to a blog I wrote a while back that suggested some of us business continuity planners have fallen victims to our own methodology.  Well, get ready to be offended once again.

This time, I want to take a look at the Business Impact Analysis (BIA) process and how we establish Recovery Time Objectives (RTO) – be they for business functions or software and applications.

In this case, I think we have fallen victims to our questionnaires.  Now, of course, some questionnaires are much more detailed and better than others, but I think they all fail from the problem that we do not put our questions in perspective of the bigger picture.  Ultimately, these questionnaires come down to the question of, “How long can we go without … doing something, or running something?”  Like I said, some questionnaires do a pretty good job of also gathering the justification for the ultimate answer, but…

I think the savvy business manager is the one who everyone else thinks is a pain in the asking.  The savvy business manager will stop short of answering these questions until he or she knows what the corporate position is on business targets during a crisis.  I would resist answering these questions until I knew what the Executive Teams’ expectations were for my department.

In other words, I would want to know; During a crisis…

  • Are our revenue targets adjusted?
  • Are profit targets adjusted?
  • Are margin targets adjusted?
  • Or, whatever business metrics I am measured against – are they adjusted?

I think most BIAs start and end with middle management answering individual BIA questionnaires, when, in fact, they should start with Executive Management establishing a Crisis Management Business Plan establishing the acceptable business targets to be achieved during a crisis.  Armed with that information, middle management has a more realistic shot at providing valid answers to our questionnaire.  Right now, every business manager is making their own assumptions about what Senior Management is expecting and these are likely not consistent across the board.

Furthermore, I think most planners simply accept the BIA answers provided with little push back.  Look, I’ve been a planner for a long time – I know exactly how easy it is to be so excited just to get any answers back that you do not dare challenge the results.  But, how often have you seen situations where business managers say they cannot be down for more than 4 hrs and yet close the entire office for a day or more during a snow storm?  Or, there is a function performed by 3 staff members and at time of crisis they say they need all three to be up and running in 4 hours – you mean none of these people ever take a vacation?  Again, it goes back to the original problem – it all depends what they think they need to achieve during a crisis.

Now before you jump down my throat – I do get that during a crisis you may not be functioning the same as normal.  You may be doing some things manually, requiring more labor.  I am just suggesting that sometimes we need to push back a little and have the managers support their answers and make sure they have thought things through logically.

Now on the opposite side of the spectrum, I was working for Comdisco during the World Trade Center bombing in 1993 and I worked very closely with two financial services firm recovering from that event.  On the Monday following the bombing – the first business day following the event – these companies experienced a call and transaction volume almost 10 X their normal volume!  So they, in fact, had some functions in which they really needed more than 100% of the workforce recovered.  I think, as planners, we may need to also push back on some departments to make sure they have taken into consideration the possible changes in work flow and volumes, given the fact that they had a disaster.  Insurance companies are just one example of organizations in which the disaster itself could be a catalyst for increased work activity.

It just seems to me that sometimes, and I don’t mean everyone does this, but sometimes, the BIA really simply becomes a Business Impact information gathering tool and we forget to do that “A” part – we forget to analyze the answers provided.

So, in summary, I think we can sometimes help the process along if we first get Senior Management to establish adjusted business targets for operations during crisis before asking middle management how long they can be down; and, I think we could do a better job challenging some of the answers we get back to our, sometimes, ambiguous questions. 

Okay, there you go, now let me have it and tell me why I’m wrong.

Office of Disaster Assistance

Here is one more number to include in your Business Continuity, Emergency Response and Disaster Recovery directory: (202) 205-6734.  This is the number for the Office of Disaster Assistance under the US Small Business Association (SBA).  And, despite being administered by the SBA it is available for “businesses of all sizes” according to the Mission Statement included on its website.

If you are not aware of this organization and the things they can do, I urge you to go visit their website.  Check out the page of Current Disaster Declarations to review the events that they provide assistance for.  You might also find some ideas where they could add value to your program by going to their Emergency Preparedness and Disaster Assistance page.

I know many organizations have well developed plans to mitigate losses from disasters; many of you have insurance policies which include loss of business clauses; and, many larger companies have self-insurance reserves to cover losses that may stem from a disaster – but, there still might be circumstances where the Office of Disaster Assistance could be of value.  Anyway, what does it hurt to check them out?

After all, it’s just one more number to add to your directory of resources that could provide assistance; you don’t have to commit to calling it if you don’t need them.

A Call for Ideas and Topics

I am doing the best I can to keep this blog fresh, relevant, timely and topical.  I am glad to say that the blog tools I use indicate that we are getting a relatively good amount of traffic to this page every day.  I am also glad to see that a few of you have added this page to your facebook “likes” – I am a statistics nut and I love watching numbers grow over time.  Hopefully, one day, we will see 3 digits or more in our fan counts.

If there are any topics in particular you would like to see us tackle on this page, please let me know by adding a comment to this post.  Anything is fair game; inside or outside the box of business continuity, disaster recovery, crisis management, emergency response or whatever tag we wish to apply to our field.  I can even be enticed to discuss other topics as well, but advice to the lovelorn is probably something I should stay away from.

I post a link to many of my blog pages in Linked-In discussion groups.  I think this helps generate traffic to this page, but it also results in most of the discussions and counterpoints being posted there and not on the blog page itself.  I would like to try to get more of you to post your comments here for those who may not be Linked-In members, but, perhaps we can achieve that over time.

If you want to learn more about Safe Harbor Consulting, the company I work for, please visit our website at www.safeharborconsulting.biz.  Please take note of the “.biz” extension to make sure you end up in the right place.  The way I like to tell people to remember that is to repeat the phrase, “Safe Harbor Consulting put the ‘biz’ in business continuity.”  And, yes, I can be corny, if you haven’t already figured that out.

So let’s hear from you …

  • What are some of today’s issues you think we ought to post about?
  • What are the typical, everyday problems and issues you find yourselves dealing with?
  • What are the new and creative planning techniques you think we should be researching?
  • What are recent business/technology interruption events you think we should all study and learn from?

Or just drop a post letting me know who it is that has stopped by to take a look at things.

And, by all means, feel free to add your two-cents worth on any post we have already shared – even if it is to point out how off-based I am with my editorial.  Others have done so already, please join in the fun.

Tests, Exercises and Drills

I know this just adds to the “jargon problem” I so often talk about in my blog posts, but today I am going to use our words to differentiate program testing techniques.

It has become in vogue to say: “We do not ‘test’ our plans; we ‘exercise’ our plans.  Testing implies pass/fail while exercising implies getting stronger.  We do exercises to strengthen our programs.”  (I used to say this, too.  That’s how come I’ve got it down so well.)

Well, that’s great and good – I think exercises are crucial and you do indeed want to strengthen your program – but, I don’t think you only exercise alone.  I think there are times when, indeed, you do want to test your programs, give them a pass/fail grade as a means of validating the ability of the plans and solutions to meet your recovery/continuity objectives.

In fact, I think the first thing you want to do is to “test” your program.  Make sure that it works.  Put it to the test.  Once you have proven the solutions and strategies in place do work and can pass the test, then you start to exercise it to strengthen and improve the process.

Furthermore, I think there is a third technique to employ.  Once you have strengthened your program through a series of exercises you may want to start to construct your sessions as drills.  In a drill you simply repeat a proven and strong process over and over again to condition the participants to react in a certain way when the plans are engaged.  In the military and in martial arts, you drill over and over again to change your reflexive actions so when a particular action is required, you behave in a certain way instinctively without having to think about it, or without having to rely on an instruction manual.

So, in summary, I think in a comprehensive program you want to include:

  • Tests – to validate that the solutions and strategies work
  • Exercises – to improve the effectiveness and efficiencies in executing the solutions and strategies
  • Drills – to condition role players to respond and react in a certain way

I think most programs actually do follow this method, without really knowing it.  And, yes, most programs are at the point in the evolution where exercises are the technique they should be using – they are not quite ready for drills, yet.  I simply suggest that we do not necessarily limit our vocabulary to the use of the word ‘exercise’ at the expense of ‘test’ and ‘drills’.

After all, it would be awful if we strengthened our ability to act in a way that couldn’t pass the test.

Disaster Recovery: Application Recovery vs. Data Center Recovery

In many disaster recovery programs that I have reviewed, I find that Recovery Time Objectives are assigned to applications, often in a tiered format.  Tier 1 might be applications that have to have near instantaneous failover – these are often less “recovered” environments and more of a geographically displaced redundant, live/live architecture.  Tier 2 might be applications that must be recovered in 4 hours or less; Tier 3 in 8 hours or less; and, so on.

The problem that I often encounter is that each individual application and its hardware environment are often tested and RTO validated in a stand-alone environment.  Doing these one-by-one tests, the IT teams prove to their business partners that each application can be recovered within the parameters of the Tier it has been assigned.  And, the business community is satisfied that, should a disaster occur in the data center, their applications will indeed be up and running in the requisite time-frame.  The problem is that if the entire data center were to be compromised by a single event, not all applications within each tier is likely to be recovered in the defined timeframe.  Yes, each of the individual applications can be brought up in 4 hours or less, but not all 20 (or however many there may be).

Disaster Recovery Plans seldom prioritize within a Tier to identify which applications should take priority if RTOs are in jeopardy or problems in the recovery process occurs.  Meanwhile, the individual application users are under the impression that their application(s) will be up within the established timeframes – after all, it has been proven through several tests.

I challenge Disaster Recovery Program owners to ensure (and prove) that all applications within an RTO category can be recovered within that established time when and if the entire data center is compromised.  I have personally witnessed, on more than one occasion, IT Teams having to work with the business community during the time of failure to determine which Tier X applications are really the most important to get up and running given they have resource constraints preventing them from getting all applications up within the established timeframe.  These are not fun times and the business community feel that they have been misled, which, they have.

You may need to communicate a bi-Tier ranking for applications to indicate a reasonable RTO should only the one application environment (or suite) need to be recovered versus a second RTO should the entire data center need to be recovered.  I do not mean to overcomplicate an already complicated process, but I think we do a disservice to our clients if we allow a false sense of security to occur because we do one-by-one application recovery tests.  Make sure the business community understands the recovery differences between a single application failure and an entire data center failure.

And, yes, I understand that there are more complicated issues with reads and feeds and inter-application dependencies, but I will tackle that problem on another day in another blog.

Why I Hate the Word ‘Plan’

I have already discussed, and most of us already understand, that business continuity, disaster recovery and crisis management professionals are challenged by use of an inexact and often confusing jargon.  We use terms such as business continuity, disaster recovery, resiliency, hot site, warm site, cold site, recovery time objective, recovery point objective, business impact analysis, contingency plans, etc., that are often used to mean very different things to experienced and practiced professionals – not to mention what they mean to the uninitiated.  It can be confusing and often lead to misunderstandings and gaps between expectations and deliveries.

But, the one word I hate the most.  The word that makes me cringe when I hear it.  The word I try to eliminate from the vocabulary of consultants who work for me is … “plan”.  Such a little, simple word – how can I possibly have such distaste for this common word amongst all those other confusing terms?  Well, I’ll tell you.

What exactly do people mean when they say the word “plan”?  And what exactly do people assume when they hear the word “plan”?

There have been more than one occasion where a consultant went into an organization and had this conversation:

CONSUTLANT:  “Do you have business continuity (or disaster recovery) plans?”

CLIENT:  “Yes.”

CONSULTANT:  “Can I see them?”

CLIENT:  “See what?”

CONSULTANT:  “Your plans.”

CLIENT:  “Oh, there is nothing to see, our plan is to …”

What the consultant was meaning to ask was, “Do you have a manual of documented business continuity policies and procedures?”  What the client heard was, “Do you have a business continuity solution in place?”

Then there was this rather uncomfortable moment I had in a corporate board meeting where I was reporting on the company’s business continuity posture:

CEO:  “Okay, Joe, cut to the chase.  You have been here a while, what is your greatest fear that could impact our ability to operate our business?”

JOE:  “A data center disaster.  This is the one disaster that will impact your operations world-wide and bring everything to a halt.”

CEO:  “But, we have taken care of that.  Our IT Director just gave us a presentation last month on his Disaster Recovery Plan in case of a data center disaster.”

JOE:  “Yes, I saw that presentation.  His plan is a plan to build out a recovery capability – but you have no recovery capability today.  His presentation showed a backup site that he recommends be established but isn’t there today.  If your data center goes down today … you are out of business.”

CEO:  (Turning to the IT Director)  “Is that true?  We don’t have a recovery plan today?”

IT DIRECTOR:  “No, we have a plan.  It is just going to take us 15 months to get it up and running if the budget gets approved.”

CEO:  “Oh, that’s not good!  I was under the impression we had a plan in place and not just a plan to build a plan.”

Awkward!

I have seen it time and time again.  The board of director does what it is told to do; ask if we have a plan in place.  The responsible party gives a nice terse, “Yes” answer and everybody is happy.  Then I come in later and explain, “Well, you might have a recovery ‘plan’ but you don’t have a recovery capability.”

I instruct my consultants:  If you want to know if they have a recovery capability, ask them what their recovery capability is; If you want to know if their capabilities are supported by documented policies and procedures, ask them to see their documented policies and procedures.

My consulting plan is – avoid the word “plan” – and, be more precise by stating what you want that word to mean.