Archive for Disaster Recovery

Disaster Recovery Tests: Please DO Feed the Animals

This past weekend, my 7 yr old son and I visited our own little disaster site in the hopes of doing a little cleanup work – his bedroom!  My challenge was to make it fun enough for him to participate in the effort with as little whining and crying as possible.  It occurred to me that this was very similar to the challenge I, and others, have when trying to get folks to participate on a business continuity and/or disaster recovery test.

Let’s face it folks – we can really be a pain in the backside to these people who have better and “funner things to do” – as my son put it this weekend.

I know with the budget crunches going on and the all out efforts to cut costs it is hard to get too creative with this stuff, but I still think it is worth the effort and expense to reward your test/exercise participants with snacks, meals, refreshments and the like, if not also with some kind of other tchochke item.  In the past, I have seen testers give out tee shirts, coffee mugs, and other stuff as reward for participating on tests.  One creative planner, used to have snacks tied to a theme; like ice cream cones over the summer; or hot dogs for a test scheduled during the World Series; etc.

I know this sounds corny and I see many of you rolling your eyes (yeah, this blog technology is scary – I am watching you), but these little gestures go a long way with winning good favor with those we rely on to get tests scheduled and completed.  They also can soften the impact of failures you will undoubtedly experience along the way.

Well, by singing songs, counting stuff we put away, making a game out of throwing stuff in the trash and a promise of a Dairy Queen Blizzard after the job was done – the disaster area that was my son’s bedroom finally got clean.  Now all we need to do is administer CPR to his mother who fainted when she saw what we had accomplished!

Earthquake in Turkey

Earthquakes in foreign countries and underdeveloped, remote regions certainly have less business continuity impact and garner less of our attention, but the destruction, devastation and loss of life is no less tragic and no less heartbreaking.  Our thoughts and best wishes go out to the people of Turkey and the surrounding areas impacted by the devastating earthquake experienced there over this week-end.

The, now reported, 7.2 magnitude earthquake that hit near the cities of Ercis and Van in eastern Turkey over the weekend has resulted in enormous damage to the two cities and numerous villages in the area.  Rescue efforts are still underway as both death tolls and stories of survival continue to rise.  There have been, and no doubt will continue to be, a number of large after-shocks that will add to the terror and losses.

We, at Safe Harbor Consulting, will continue to follow the news stories and hope for more accounts of rescues and survival.  If there are stories of lessons learned from this event that might be applicable to other regions of the world, we will attempt to pass those along as well.  For now, we just hope and pray for the best.  We invite you do to do as well.

Disaster Response – Enforcing Time Limits

Do you have a policy in your business continuity, disaster recovery, emergency response and/or crisis management program that establishes a limit on the number of hours responders can work before requiring a mandatory break?  Are you in position to enforce this policy?  Do you enforce it during recovery tests?

I know that during time of crisis people rise to the occasion and can sometimes exhibit superman (or woman) like powers and appear to go strong for many, many hours – but the fact of the matter is, the longer they are active, the less effective they are likely to be and the more errors or poor decisions they are prone to make.

I strongly suggest that your programs – all of them, technology recovery teams as well – have a stipulated policy that no one individual can work for more than 12 straight hours without taking a break.  And, I highly recommend that you have individuals on your team responsible for ensuring that this policy is followed. 

I think a 12 hour on, 12 hour off schedule should work fine, requiring only two subject matter experts for each role in the program.  I would prefer three 8 hour shifts – this can still be accomplished with just two individuals – but 12 on / 12 off makes it easier to ensure your primary team member is on during the most important 12 hours of the day or night.

I know it can be difficult making the second shift team members stay away from the response during the 1st twelve hours following the disaster, but you need to let them know how important it is that they show up 12 hours into the crisis, rested, refreshed and ready to operate. 

I also recognize that the 12/12 shift does require some turnover time from one shift to the next, but we need to make sure that that turnover does not draw out too long.  It will be tough to get the first shift team members to remove themselves from all the activity after 12 hours, but it is for the benefit of the individual and for the benefit of the organization that they should be required to remove themselves from the event and get some rest.  I think it is also important to have them physically removed from the crisis, as much as the situation allows, and put up in a location where they can rest undisturbed and away from all the activity.

I know this is not easy.  It is not easy for me to follow my own rule.  But it really is for the benefit of all that this policy be established and enforced.  I remember the old days, during mainframe recovery tests, where teams of us would go almost 48 hours non-stop in the recovery process.  And, still today, there are technology, network, database and other recovery teams that have few, or even a single, subject matter expert that will work on an issue until it is resolved no matter how long it takes.  I think it is up to us, as planning professionals to identify these employee-related, single points of failure in our solutions, communicate the problem to management and seek options for remedying this exposure.

If you have technology recovery tests scheduled for more than 12 hours – you need to let it be known that no one individual will be allowed to participate in the test exercise for more than 12 hours – and, you need to make sure that that rule is enforced.

There are actually companies that provide employee health and well being services who can help you enforce this rule and help provide mental health counseling for employees impacted by and/or participating in crisis situations.  You may want to check them out for advice on how to implement this particular component of your program.

This blog was written in less than 12 hours – just so you know.

Today’s Disaster – Wild Animals on the Loose!

Okay, here’s a new one – a city in lockdown mode because there are wild animals on the loose roaming the city streets!

I can’t help but chuckle imagining the broadcast message that one would send out to their employees telling them the office is closed due to a city lockdown caused by wild animals.

I really have no more to say about this one, other than I just had to share this story with you.  I will have to challenge myself a little harder to come up with a legitimate blog post – but, you can read the story and adjust your plans accordingly for this risk.

Risk Free, Satisfaction Guaranteed Program Review

Safe Harbor Consulting (SHC), a management consulting firm specializing in business continuity, disaster recovery, emergency response and crisis management, is offering a risk free, satisfaction guaranteed Program Review.  SHC will review your program documentation, interview employees with key responsibilities in your solutions and review other program material in an effort to discover opportunities to strengthen your programs, improve your strategies and/or expand your solutions.

If, at the completion of the review and following the delivery of the SHC Findings Report, you are not satisfied that we have identified valid, substantial opportunities to advance your program and/or better position your organization’s response and recovery posture, you will not be invoiced for SHC services.

“I have found that having outside experts review program material prior to conducting a Tabletop Exercise or Physical Program Test is an excellent technique for ensuring your program material is in tip-top condition prior to sharing it with internal management and employees”, says Joe Flach, CEO and Lead Consultant at SHC.  “If the material we review is in excellent condition and, other than a few cosmetic fixes has no real identifiable issues, problems or concerns, than our review will indicate as much and we will not charge you for our efforts.  Only if we discover legitimate opportunities to improve the program or program material, and only if the customer agrees that we have achieved this, will we prepare an invoice for our agreed upon fees.”

To take advantage of this Risk Free, Satisfaction Guaranteed Program Review offer, please contact Safe Harbor Consulting at (253) 509-0233 or email them at safeharborconsulting@yahoo.com.  To learn more about Safe Harbor Consulting you can visit them at www.safeharborconsulting.biz.

National Failure Day

I found this news story to be rather intriguing and, although a little bit of a stretch to suggest it is business continuity or disaster recovery related, I have been known to stretch things out of proportion a time or two in the past.

Finland is celebrating National Failure Day today (Thursday, October 13) to help stimulate growth in the economy and combat their culture of being risk adverse and not prone to trying new business ventures due to a fear of failure.

People that have that core characteristic, fear of failure, probably should not pursue a career in business continuity / disaster recovery planning and certainly not be in charge of managing the testing process for these programs.  But, we do try to address that fear in our programs.  That is one reason why we started avoiding the word “test” in our methodology, because it implies pass/fail and who wants to fail?

I try to get people to understand that your recovery and continuity programs do have areas in which you will fail – the testing process is to discover and fix those prior to the real event.  Hopefully, through testing, we can uncover the most damaging failure points in a controlled, testing environment rather than discover them when all Helsinki is breaking loose (see how I kept the Finland theme going there?)

But, alas, this fear of failing tests results in people jury-rigging the test and preparing months in advance; taking special back-ups; installing equipment and software; etc.  As long as the disaster gives us a month’s warning that it is on the way, we have proven we can recover.  But hey, we didn’t fail the test – Yippee.

I commend Finland for their courage to face their fear and try to cultivate a willingness to take chances in order to stimulate a down economy.  Who knows, maybe one day, people from Finland will make good business continuity, disaster recovery planners.

I wonder if I can get my organization to celebrate a Company Failure Day during our next test?

The Blackberry Outage

The current Blackberry outage going on throughout Europe, and now the US, provides an opportunity to discuss two important Business Continuity Planning issues: 

  1. Don’t rely on a single communications device
  2. Ensure you have processes for addressing the backlog

I remember immediately after the events on 9/11 people were touting how well their Blackberries continued to function during the crisis while all other communications tools were failing.  Shortly after, it seems, everyone was running out and buying a Blackberry.  I was not suggesting people not invest in Blackberries, but I was warning people that just because this particular tool was working in this crisis does not mean it will be the one tool working in the next crisis.  One reason the Blackberry worked so well in 2001 was because so few people were using this device, the infrastructure that supported it was not being overburdened during the time of crisis.  Blackberries rely on a different technology and different infrastructure that was not damaged during 9/11 – I was warning anyone who cared to listen (probably no one) that this might not be true during the next crisis.  My point was not that Blackberries won’t always work, but that you should not rely on a single tool or technology for all of your communications channels.  Lo and behold, we now find out that Blackberries are susceptible to network wide outages similar to other communication tools.

In the referenced article, Research in Motion is saying that they have fixed the underlying problem causing the outage but that the backlog of emails and text messages is delaying getting the service fully functional once again.  This is a reminder to make sure that business areas consider the impact of the backlog during times of outage and have procedures in place to address the backlog once their systems are back online.

I have even seen instances when the inability to handle the backlog that would develop was the primary justification for establishing an RTO for some applications.

Procedures for handling the backlog (and, reentering lost transactions where the RPO is not, point of failure) need to be included in each department’s business continuity plan.  For some financial based applications, this may include having to post date transactions to ensure they have the right effective date with them.  For some applications that automatically generate the transaction date and time, this may require some additional programming or rebooting servers with different time stamps to ensure the proper entry date.

For all applications and business processes that are not immediately failed over, there is the potential for a backlog to develop.  How you handle that backlog must be considered in the recovery and continuity plans.

Business Continuity Planning: Vendor Risks

One of the risks that a lot of companies may benefit from looking at a little closer is that of “vendor risks”.  Vendors can be suppliers or outsourced entities that perform a critical service on behalf of our organization.  We need to ensure that our critical vendors know how to respond in the event of our disaster and we need to know that the vendor can continue to provide materials or support in the event of their disaster.

I know many organizations include “Service Level Agreement” (SLA) clauses in vendor contracts, but I suggest that we may want to go further than that and, every now and then, ask to be shown evidence that they could meet those levels of performance at time of disaster.  How many of you audit or review your vendor’s Business Continuity / Disaster Recovery Plans?  How many participate in their vendor’s Business Continuity / Disaster Recovery tests or exercises?

Many organizations try to mitigate or eliminate vendor risk by engaging multiple vendors to provide a similar product or service.  Just be aware that, sometimes, even though you diversify your vendors, you may not have diversified the infrastructure they depend on.  Lessons learned from the events of 9/11 showed proof of the issues that can arise here.  Many companies felt confident that they were using multiple communication vendors only to discover that they all relied on the same underground infrastructure and same “points of presence” (POPs).  One central office failure; one cable conduit compromised and all vendors were out of service – the diversity did not provide the stability they thought they were getting using multiple vendors.  Even though you may think you have eliminated a potential “Single Point of Failure” (SPoF) by using multiple vendors, make sure you do not still have SPoF in the physical infrastructure they rely on.

Another example of this I recently encountered was working with an airport authority in dealing with a potential flooding risk caused by a suspect dam near the airport.  Many of the airlines at the airport had fuel provided by two or more fuel suppliers, but the delivery of the fuel was all through the same single source pipeline that was in the flood zone.  Even though the pipeline itself was underground and, potentially, not susceptible to damage by the flood, the fuel line switching station was above ground and in the flood zone.  The pipeline needed this switching station operable to move the fuel.

I also was once hired to look into the reliability of an off shore outsourced call center.  This facility, located in India, was a state-of-the-art facility in a pretty resilient compound.  The outsourced company felt so secured in their “hardening” of the facility that they did not see the need in investing in contingency operations.  The problem was, however, that the infrastructure that fed power, phone service and other utilities into the compound was very suspect.  Additionally, the employees did not live in the compound and a disaster in the area could easily prevent them from getting to the complex.  My client decided that they needed a contingency should their primary vendor suffer a business interruption event and took the necessary steps to cover this risk.

And, remember to make sure your vendors know what changes in their delivery or performance must be made at time of your disaster.  One simple example – do your mail carries (US Post Office, UPS, FedEx, others) know where to reroute your mail or where to do pickups from when a particular facility is compromised? 

Also make sure if you have vendor personnel on site that you educate them in the evacuation, notification and escalation process.  Are you responsible for accounting for vendor personnel during a disaster, or do you call the vendor and have them account for or alert their employees at time of crisis?  Do not forget those vendors that may not perform a critical service but are on site – such as, cafeteria staff; custodial staff; plant suppliers; landscapers; etc.  Make sure they are notified of an office closure and are included in the process for accounting for who may have been injured or killed in the disaster.

Sometimes it is easy to overlook our vendors in the planning process.  Make sure your program and department managers have adequately accounted for them.

Business Continuity Blogs: No Offense Intended

I am having great fun in challenging myself to come up with nearly daily blogs about business continuity, disaster recovery, crisis management and related topics.  This exercise of typing my thoughts and experiences in a form that almost makes sense, I think, helps me identify opportunities to improve my approach to the planning process even if it does not inspire any of my readers.

I have posted many of my blogs and similar thoughts on Linked-In group pages and have, it seems, unintentionally insulted a few practitioners by suggesting there is room for improvement in the standard methodology and tools that many of us use.

I assure you that I, in no way, mean to suggest planners cannot be successful in engaging the methodology as it is today.  There are, without a doubt, many quality business continuity programs and plans in place throughout the world today.  There are many successful, intelligent, professional planners that do a terrific job in guiding their organization and fellow planners through the process without the need to deviate from standard practices and tools.

I simply try to find opportunities where we may improve a process – even one that appears good enough today.  I think, at times, it is engaging and rewarding to try to think outside the box every now and then, even if in doing so, it just makes us realize that inside the box is the best place to be.  I do not mean to offend anyone by doing so.  I do not mean to suggest everyone has it wrong and I am the only one who has this thing figured out.  No, in fact most of my thoughts are really an admission that I do not have this thing figured out and I invite others to discuss the topics to help me understand.

I do not mean to compare myself to historic figures, but history is full of people who did not accept the standard method of thinking – challenged common knowledge and understanding, much to the dismay of other professionals in their field, and helped advance the practices and level of understanding.  Christopher Columbus and Galileo are two that immediately come to mind.  I am sure there are many others who went against common understanding, down the wrong path and were proven to be total fools – and, I may more likely be one of those – but, like I said, at least I am having fun.  Insulting or aggravating others is not my intent.

Business Continuity Planning: Recovery Requirements

Remember the movie, “The Jerk” with Steve Martin?  Great movie.  Anyway, there is a scene in this movie where Navin Johnson (Steve Martin’s character) loses all his money and walks out of the house saying something like, “I don’t need anything.  I need this ashtray.  But, that’s all I need.  I don’t need anything but this ashtray … oh, and this lamp.  Just this ashtray and this lamp.  I need this. …”.  You get the picture, right?

There have been times when working with business managers in identifying recovery requirements that I have felt like I was in the middle of that scene.  This somehow seems to be especially true when working with trading floor operations.  Often when I first sit down with trading floor managers and ask them, “What do you minimally need to conduct business”, they answer, quite simply, “A phone.  You give me a phone and I can conduct my business from anywhere.”

Oh really?

You mean you have the phone numbers for all your clients, brokers and the trading floors?

Well no.  I need those numbers programmed into the phone.

So it’s not just any phone you need?

No – but, if I had a phone with those numbers programmed in, I can do my job.

So you don’t need to know the state of your client positions?

Well sure, I need that.  But if I had a phone, programmed with numbers and a copy of the start of day report, I can do my job.

And you don’t need access to any market data feeds?

Well, not all of them.  If I had a phone, programmed with numbers and a copy of the start of day report and access to Bloomberg, I could do my job.

And you will remember the trades you transacted to call into the back office.

Well no.  I need my blotter system or trade tickets.  If I had a phone …

Aaaaaahhhhhhhhh.

It got so bad that I would walk into these meetings with a roll of quarters in my pockets.  When they said all they needed was a phone, I would take out the roll of quarters and say, “Okay, take these.  There is  a pay phone out on the corner of the street in front of this building – go do your business for the rest of the day and let’s see how that works.”  I know that analogy kind of dates me.  Some of you reading this probably don’t even know what a pay phone is – but it often got my point across.

Once I got their egos to agree that it was more than just their expertise that made them a successful trader, and that they did depend on technology and tools more than they liked to admit, the next challenge was in identifying what was needed under what circumstances.

The difficulty with trading floors and many other business functions these days is the interchangeability of certain tools.  You get into discussions where as long as I have Application A, I can do without B or C; but, if I don’t have A, I need both B and C.  This can get very complicated.  Forget about the complication of also recovering the behind the scene applications and tools that the business managers don’t know about – that complexity is up to the technology teams to figure out.

Getting the right applications recovered in the right timeframes takes coordination between many different departments that share applications and databases.  I know some organizations try to identify “application owners” in the business community, but, this too, can get complicated.  Easiest example is, who owns email?  In a trading floor environment, there are many market data feeds that are shared between different trading desks.  Defining the ultimate owner can be challenging and cause troubles.

Once you identify the toolset (or toolset options) required to support critical operations you can start researching options for getting them operational in the requisite timeframe.

Be prepared, however, for some business managers to answer your question on what they need with another question, “Well, what can I have?” or, “How much will it cost me to have …?”

I don’t think it is too out of line for some managers to take the position that it is not a matter of what I minimally need in place, it is the matter of what I (or the company) is willing to pay to have in place.  If it is a reasonable expense to provide full functionality in a recovery site, why go through the exercise of asking me what I can get by with.  Give me solution options, and I will tell you what I can afford to invest in.  This often results in a push me-pull you working atmosphere that not all continuity planners are ready for, but one in which I think we should be prepared to handle.

Just a few more things for us to think about.  I sure am glad this job is not an easy one, otherwise anybody could do it.