Archive for Tests

Planning versus Being Prepared

Many organizations engage in business continuity and disaster recovery planning; few organizations are prepared for a business interruption event or a disaster.  There is a difference.

My wife is a terrific party planner.  We just threw a birthday party for our youngest son who turned eleven years old this past Sunday.  My wife “planned” his party weeks in advance, but, until we got the invitations sent, the supplies purchased, the house cleaned, the balloons and decorations put up, the gifts wrapped and the cake baked, we were not “prepared” for the party.

The Allied Forces “planned” the D-Day Invasion months in advance; but, until they recruited for, trained, transported the forces and equipment to where they were needed, ran simulations, drills and practices, monitored the weather, performed reconnaissance, set up Command Centers and established communications channels and protocol, they were not “prepared” for the invasion.

Simply going through the motions of creating Business Continuity and Disaster Recovery Plans does not necessarily mean your organization is prepared to respond to, operate through or recover from a business interruption event or disaster.  There are many organizations who have followed the standard and accepted business continuity planning methodology, resulting in numerous, well-documented plans, that are NOT prepared for a disaster.  How can this be?  Here are some contributing factors that can result in that kind of dichotomy:

Invalid Planning Assumptions.  Almost every plan written includes a list of planning assumptions in the Introduction or Overview sections.  Many times these “assumptions” are really planning requirements, caveats or downright erroneous assumptions that invalidate the plans and continuity strategies in place.

For example:

  • A plan might include the assumption that employees are trained and have copies of the plans in their homes. This should not be a plan assumption; this should be a program requirement.  This requirement is auditable and should be tracked.  Your plan should not “assume” this to be true; your program should “ensure” that this is true.
  • A plan that utilizes a work from home solution might include the assumption that employees routinely take their laptops home with them every night. Again this is an example of a program requirement, not a plan assumption.  If your business continuity solution relies on corporate assets, such as laptops, being available in certain employee’s homes at time of a disaster, you need to ensure that these assets are there when needed.
  • Sometimes, plans “assume” that the disaster impacts only the facility that the plan is written for. In cases when the continuity or recovery strategies rely on alternate sites (or employees working from home) that share a common footprint of known risks and threats in the area; that may not be a plausible assumption.  In these cases, it is important that management know “what” they are prepared for.  For example, management might be told that you are prepared for a building outage but not a wide-area outage caused by an earthquake or flood or hurricane.  This could be important information to know if you are in an earthquake, flood or hurricane zone.
  • Many plans include the “assumption” that the strategies and technologies the plan relies upon are available, functional and usable at time of need. Many times, management reads this “assumption” as a “given” when, in fact, these solutions are yet to be implemented, contracted for or proven reliable.

When assessing an organization’s level of preparedness, plan assumptions should not be glossed over nor should they be accepted as being “givens” or truths.  If the viability of your plan is dependent on these assumptions being true, you must have policies and procedures in place to ensure these conditions exists and protocols in place to measure the level to which they are being met.

Dependencies That Can’t Be Depended Upon.  In a related situation, some plans include a list of dependencies that the plan’s execution relies upon.  Sometimes, the reliability of these dependencies are also listed in the plan’s assumptions.

For example:

  • The successful execution of the strategies outlined in the plan might be dependent upon external, single-source suppliers (of services, information or raw material) remaining operational. If these organizations are also at risk of being impacted by the same business interruption event, this might not be a reliable requirement.  You should include the examination of these organizations’ recovery plans in your programs’ activities or eliminate this dependency as a single point of failure within your environment.
  • Plans are often dependent on certain individuals or subject matter experts being available to participate in the recovery effort. “People” are often overlooked as single-points-of-failure.  If the successful execution of your recovery solutions rely on one or more particular individuals being available to execute the plan, you are at risk of failure during events that impact the availability of your work-force.  Many companies that have this dependency also state that their plans could be used during a Pandemic event – this is just one type of scenario that puts that dependency at grave risk.
  • Many plans are also dependent on certain technologies and/or applications being accessible at time of an event.  Sometimes, the recovery or continuity of these technologies and applications are within the scope of your plans and sometimes, they are not.  In either case, whether or not this dependency can be relied upon is something that can and should be proven.

Failure to Socialize the Plans.  Even companies with spectacular plans and solutions in place can be unprepared for the events they have planned for due to the lack of training and education of the people who must execute the plans.  Well written plans and fully enabled solutions can fail to protect the organization from devastation if the people relied upon to execute those plans or utilize the solutions have not been trained in and practiced their roles for time of implementation.

None of Shakespeare’s plays would be successful if the actors were reading the scripts for the first time on the night of the opening performance.  Documented plans should be treated like scripts; the lines should be memorized and rehearsed well before they are needed.  If your organization is dependent on the documented plans at time of a disaster, then it is quite possible that you are not “prepared” to respond and recover.

Unreliable Testing Practices.  And then there are companies that do routinely practice and rehearse for the event, but are still not “prepared” because of some unreliable testing practices that are commonly used.

Most business continuity and disaster recovery plans are designed to allow an organization to respond to and recover from an incident that occurs without warning demanding immediate response, yet, it takes them months to plan for a test.

If the advanced planning for a test is more than an exercise in scheduling resources, your organization may not be prepared for the real deal.  Too often, the time needed to prepare for a test is used to create special back-ups; install or provision equipment; order supplies; coordinate resource availability; or a number of other logistical activities that require time to complete – none of which you will be able to do at time of a disaster that hits without warning.

If your organization plans its tests weeks or months in advanced, you need to scrutinize the actions being taken to prepare for the test and question whether or not that activity would be required at time of a real event.

And, too often, organizations execute these tests or rehearsals utilizing a small set of understudies and not the people who will engage at time of the real event (thus, not achieving the socialization mentioned above).  This, too, is something that can be audited and tracked.  Your program should identify anyone who has the potential of being engaged at time of an emergency response, continuity and/or recovery event and ensure that they are trained and routinely participate in recovery tests and exercises.

CONCLUSION

So, yes, there are many companies that “plan” for a business interruption event but are far from being “prepared” for a business interruption event.  The ultimate goal is being “prepared”; do not allow yourself to be lulled into a false sense of security just because you have a “plan”.

Conducting a Test – Yes, a “Test”

Yeah, I know, I know … we don’t have “tests” we have “exercises”, because tests imply pass/fail and exercises imply getting stronger.  Yeah, I used to sing that same silly song, too.

I now understand that there are times when you need to test; times when you need to drill; and, times when you need to exercise.  You do need to test your solutions to make sure that they do, in fact, work.  You do want to know if you can “pass the test”.  You may refer to these as validation tests.

I’ll take it a step farther and would even like to see us test the people.  I think it would be great to gather our key business continuity players, managers and employees into a room and give them a regular, school-like, no. 2 leaded pencil, don’t start until I tell you to, and put down your pencils when instructed, actual tests.  Why not?

I would like to ask key players and managers questions that they should know about our Emergency Preparedness, Business Continuity and Disaster Recovery Programs.  These questions might include:

  • If the fire alarm went off right now:
  • What would you do?
  • Where would you find evacuation routes posted?
  • Where would you congregate once outside of the building?
  • Who are your floor wardens?
  • If you received a bomb threat on the phone, what would you do?
  • If you got a call at 2:00 am that the building had burnt down…
  • What would you do?
  • Who would you call?
  • Where would you go to work?
  • Where would you find your Business Continuity Plan?
  • If the Data Center experienced a disaster…
  • What applications that you use would be recovered?
  • In what timeframe?
  • What would be the status of the data?
  • What applications would not be recovered?
  • What business processes would you be expected to continue?
  • What business processes would be temporarily suspended?
  • If you do not know the answers to any of the questions on this test, where would you go to find them?
  • If we experienced a disaster and you weren’t available to participate in the recovery, who has been trained to play your role?  Have you trained them to be successful in this effort?

I could go on, but I want to try to make this an interactive blog and challenge you to post the types of questions you would want to include on this pencil to paper test.  You can do so by posting a comment to this blog.

You can exercise your solutions all you want.  You can physically recover hardware, applications, data, networks, etc., time after time – but there are some things that people need to know when the alarms go off or your ability to execute your plans will be severely hampered.

How do you think your organization would do if given this kind of test?  My bet is most companies would not fare too well.  There are historical cases where adequate recovery capabilities were in place but the people were not educated well enough to implement these capabilities at time of event.  How better to determine our level of preparedness than to give them a test?

You can exercise and get as strong as you want, but if you can’t pass the test … you will fail.

Testing Your Automated Emergency Notification Systems

Do those of you who rely on automated notification systems test the process regularly?  I know quite a few organizations that have invested in these software products and/or internet based services but yet never test them.  I think that presents a huge risk if and when you need to implement the service.  There are a number of issues with these products that you should make sure you have vetted besides validating that the phone numbers entered are correct.

One thing you should find out is what displays on a phone’s caller identification when the service is activated.  Some products allow you to customize the display while others may display an 800 number or the name of the service provider.  Many of your employees may see the caller ID displayed, not recognize it and ignore the phone call.  This can result in a very low percentage answer rate at time of crisis.

Also, if your service provides a computer generated voice message from typed text, you will want to make sure you are comfortable with how numbers and company jargon is interpreted by the voice module.  I have used several systems where you needed to be creative with the use of spaces, periods or commas to ensure the proper flow of the message.  Phone numbers entered as we normally type them were read too fast or specific company jargon was mispronounced and needed to be phonetically typed for the voice module.  People typing in the messages have to be trained to enter the message so it is read properly by the computer voice system.  One example where this became an issue was with an airline company where the employees regularly typed in airport codes in messages.  The message, “Problem on Flt 999 from EWR to LAX”, did not come across well when sent out as a voice message.

You may also need to make sure people are trained in the response method to indicate they got the message and, in some cases, to indicate their response posture.  And, be sure that there is no confusion with how to respond if the message is being picked up as a voice mail message.  I worked with one company where the message gave instructions to “Press 1 to listen to the rest of the message” or “Press X to indicate if you can respond”, etc.  This worked great if you received the message live, but pressing numbers while listening to the message in voice mail had no effect.  Employees were confused as heck while trying to follow these instructions in voice mail.  The messages had to be altered to indicate, up front, that if you are listening to the message in voice mail you will not be able to respond directly.

You may also need to test the system to see if there are any problems caused by the call volume.  I have seen, on more than one occassion, where business phone numbers were the primary number called and by issuing an alert, the company PBX was so innundated with incoming calls that it brought the system down.  This, needless to say, was a big problem.

One thing people should know and your management teams should be made aware of is that even for systems that are perfectly implemented and regularly tested, an 80% hit rate on weekends and after hours is still a very good response.  You should test your systems and track the success rate to get a realistic sense of what kind of response you can expect during a real emergency.

I understand there may be a cost incurred by phone call and doing too many tests can become expensive.  I also am cognizant of the delicate balance between making sure people know how to send a message and how to respond to the message with creating a “cry wolf” syndrome where too many tests result in people not being responsive to the phone calls.  It is up to us to make sure we come up with the proper schedule and frequency of these tests to ensure the use of this tool is effective and efficient at time of need.

Disaster Recovery Tests: Please DO Feed the Animals

This past weekend, my 7 yr old son and I visited our own little disaster site in the hopes of doing a little cleanup work – his bedroom!  My challenge was to make it fun enough for him to participate in the effort with as little whining and crying as possible.  It occurred to me that this was very similar to the challenge I, and others, have when trying to get folks to participate on a business continuity and/or disaster recovery test.

Let’s face it folks – we can really be a pain in the backside to these people who have better and “funner things to do” – as my son put it this weekend.

I know with the budget crunches going on and the all out efforts to cut costs it is hard to get too creative with this stuff, but I still think it is worth the effort and expense to reward your test/exercise participants with snacks, meals, refreshments and the like, if not also with some kind of other tchochke item.  In the past, I have seen testers give out tee shirts, coffee mugs, and other stuff as reward for participating on tests.  One creative planner, used to have snacks tied to a theme; like ice cream cones over the summer; or hot dogs for a test scheduled during the World Series; etc.

I know this sounds corny and I see many of you rolling your eyes (yeah, this blog technology is scary – I am watching you), but these little gestures go a long way with winning good favor with those we rely on to get tests scheduled and completed.  They also can soften the impact of failures you will undoubtedly experience along the way.

Well, by singing songs, counting stuff we put away, making a game out of throwing stuff in the trash and a promise of a Dairy Queen Blizzard after the job was done – the disaster area that was my son’s bedroom finally got clean.  Now all we need to do is administer CPR to his mother who fainted when she saw what we had accomplished!

National Failure Day

I found this news story to be rather intriguing and, although a little bit of a stretch to suggest it is business continuity or disaster recovery related, I have been known to stretch things out of proportion a time or two in the past.

Finland is celebrating National Failure Day today (Thursday, October 13) to help stimulate growth in the economy and combat their culture of being risk adverse and not prone to trying new business ventures due to a fear of failure.

People that have that core characteristic, fear of failure, probably should not pursue a career in business continuity / disaster recovery planning and certainly not be in charge of managing the testing process for these programs.  But, we do try to address that fear in our programs.  That is one reason why we started avoiding the word “test” in our methodology, because it implies pass/fail and who wants to fail?

I try to get people to understand that your recovery and continuity programs do have areas in which you will fail – the testing process is to discover and fix those prior to the real event.  Hopefully, through testing, we can uncover the most damaging failure points in a controlled, testing environment rather than discover them when all Helsinki is breaking loose (see how I kept the Finland theme going there?)

But, alas, this fear of failing tests results in people jury-rigging the test and preparing months in advance; taking special back-ups; installing equipment and software; etc.  As long as the disaster gives us a month’s warning that it is on the way, we have proven we can recover.  But hey, we didn’t fail the test – Yippee.

I commend Finland for their courage to face their fear and try to cultivate a willingness to take chances in order to stimulate a down economy.  Who knows, maybe one day, people from Finland will make good business continuity, disaster recovery planners.

I wonder if I can get my organization to celebrate a Company Failure Day during our next test?

Tests, Exercises and Drills

I know this just adds to the “jargon problem” I so often talk about in my blog posts, but today I am going to use our words to differentiate program testing techniques.

It has become in vogue to say: “We do not ‘test’ our plans; we ‘exercise’ our plans.  Testing implies pass/fail while exercising implies getting stronger.  We do exercises to strengthen our programs.”  (I used to say this, too.  That’s how come I’ve got it down so well.)

Well, that’s great and good – I think exercises are crucial and you do indeed want to strengthen your program – but, I don’t think you only exercise alone.  I think there are times when, indeed, you do want to test your programs, give them a pass/fail grade as a means of validating the ability of the plans and solutions to meet your recovery/continuity objectives.

In fact, I think the first thing you want to do is to “test” your program.  Make sure that it works.  Put it to the test.  Once you have proven the solutions and strategies in place do work and can pass the test, then you start to exercise it to strengthen and improve the process.

Furthermore, I think there is a third technique to employ.  Once you have strengthened your program through a series of exercises you may want to start to construct your sessions as drills.  In a drill you simply repeat a proven and strong process over and over again to condition the participants to react in a certain way when the plans are engaged.  In the military and in martial arts, you drill over and over again to change your reflexive actions so when a particular action is required, you behave in a certain way instinctively without having to think about it, or without having to rely on an instruction manual.

So, in summary, I think in a comprehensive program you want to include:

  • Tests – to validate that the solutions and strategies work
  • Exercises – to improve the effectiveness and efficiencies in executing the solutions and strategies
  • Drills – to condition role players to respond and react in a certain way

I think most programs actually do follow this method, without really knowing it.  And, yes, most programs are at the point in the evolution where exercises are the technique they should be using – they are not quite ready for drills, yet.  I simply suggest that we do not necessarily limit our vocabulary to the use of the word ‘exercise’ at the expense of ‘test’ and ‘drills’.

After all, it would be awful if we strengthened our ability to act in a way that couldn’t pass the test.