Managing a Crisis
We recently managed a crisis situation for a rural software development center that depends on a frame relay telecom connection to its end user customers. One day a lightening storm took down the frame relay connection and more than 40 people could not work.
Naturally, the local telephone company said the phone line was good and the problem had to be a router issue. Does this sound familiar? After lots of work proving the routers were good, the telco found the problem in its frame relay equipment and dozens of people across Minnesota breathed a collective sigh of relief.
This crisis was a big deal because it shut the company down for more than a day. If not resolved, it could have meant bankruptcy.
Pick your metaphor: Crisis situations separate the big dogs from the pups, the men from the boys, the pros from wannabes, the best from the rest. If you find yourself in one of these situations, here are a few prudent steps to follow.
First -- this is critical -- thoroughly assess the situation. What are the specific symptoms? When did the problem start? What changed from when it worked to when it failed? What events occurred around the time of the failure?
Acknowledge everyone’s emotions and the critical nature of the problem. Then take control. This is a severe situation and as far as you’re concerned, nothing else in the world should have any higher priority. You should expect and insist that everyone involved cooperate fully with you. This is a time for firm diplomacy -- don’t be timid, but don’t be overbearing.
Gather the data you need quickly and firmly. Sometimes it makes sense to track down a sequence of events. What occurred, when did it occur, and what was the system's behavior before and after? Also, and this is important, can you reproduce the problem at will?
Many problems seem random because nobody does the detective work to understand the conditions that caused the problem. A few years ago, I ran into a Windows NT system that failed to boot after running "for a while." We were finally able to reproduce the problem at will by changing the attributes of one particular printer. The problem turned out to be registry corruption.
Once you understand the symptoms, eliminate the obvious possible causes first. One time, we spent nearly a day onsite troubleshooting a network problem only to find somebody had unplugged a hub. Whoops -- that was embarrassing!
Next, check the desktop, server, router, and other relevant setups and ensure they are set up per vendor recommendations. Document, or at least understand, any setups that are different, including why they are different and what will happen if put back to vendor specs?
These simple steps take care of most problems. For deeper issues, crisis management gets more complicated. The next steps involve rigorous deductive reasoning and lots of persuasion. Come up with ideas for possible causes: Draw upon experiences with similar situations, advice from friends, vendor hotlines and any other credible sources of data. Develop a hypothesis design a quick experiment to test your hypothesis and take appropriate measures depending on the results of the experiment. Keep doing experiments and record the results. Ideally the results from one experiment will refine your hypothesis and guide the next experiment. Design each experiment to either support or eliminate a possible cause.
For the rural development center, we configured identical Cisco routers at the site and in our Eagan office. When we unplugged our known good Eagan router and installed it at the development center, we reproduced the symptoms. This eliminated any router issues and clearly pointed to a telecom problem.
Sometimes problems seem complex. The trick is to reduce the problem to a simple, repeatable sequence. With the registry corruption problem, we drilled down by recording server-based activities, such as adding shares and changing printers over time, until we finally narrowed the problem to a single printer definition.
Throughout the process, it is absolutely vital to remain totally focused, completely in control and in constant communication with all relevant players. Rumors fly and people panic and make dumb decisions based on bogus data in these situations. It’s your job to stay calm and make sure everyone has up to date information on what is really going on and the plans for the next few steps. Good communication is as important as solving the problem, maybe even more important, because the solution is only worthwhile if everyone knows about it.
Sound intense? High pressure? You bet. But the reward for successfully handling one of these situations is a great feeling of satisfaction and the gratitude of your customers. Besides, there's nothing like a good crisis to get the adrenaline pumping. --Greg Scott, Microsoft Certified Systems Engineer (MCSE), is CTO of Cross Consulting Group (Eagan, Minn.). Contact him at gregscott@scottconsulting.com.