An NT Horror Story

I ran across a major NT customer a few months ago that was nailed by a nasty chain of events. The customer has a large and critical database served by an NT cluster. Over a weekend, while the administrators tried to upgrade the database, the primary system started rebooting spontaneously. Mind you, these were not blue screens of death (BSOD) crashes; these were genuine, spontaneous reboots. Needless to say, this wreaked havoc on their upgrade and they never finished it.

Cluster technology is a wonderful idea for enhancing overall system reliability. Although I’ve complained about the primitive state of NT clusters today, they do a decent job of failover. Without clustering, this problem would have shut down this company for at least several days.

By the time I arrived, most of the excitement was over, and they were rebuilding a new cluster on all new hardware and restoring their database from a backup tape. My job was to help reconstruct the sequence of events from the prior several days and try to make sense of it.

Here is what we think happened: A power supply on the primary system serving the database the users were upgrading went flaky. This caused the spontaneous reboots. Sometimes while rebooting, the server failed in Powerup Self Test (POST) with memory problems.

Any halfway-decent server uses ECC. ECC memory uses a few extra bits and a fancy parity scheme to detect memory errors and correct single-bit errors. Any server that holds important data should have ECC memory. In fact, all PCs and all computers everywhere should have ECC memory, instead of the junk we use today, and NT should record an event log entry every time ECC logic corrects a memory error.

In this case, the memory errors were so bad -- at least, during POST -- that the ECC circuitry could not correct them.

The situation went downhill from there. After a while, the system started behaving strangely. Notepad could not open files, the system started crashing with BSODs, and things pretty much fell apart.

We think that the flaky power supply fried the memory by feeding it rotten power. The now-bad memory, in turn, corrupted this server’s system disk, which made a bad situation even worse. The only thing saving this customer from total disaster was the backup cluster member, which kept running.

The proper fix was to replace the power supply and memory in the failing server and rebuild its system disk from scratch. This didn’t happen until much later.

Instead -- and this boggles my mind -- when the customers called their hardware vendor to report the POST memory failures, the vendor told them to ignore the errors because the server had ECC memory.

I’ve seen lots of dumb and irresponsible actions over the years, but this one stands out. I’m just a skinny bald guy from Minnesota, but it seems to me that if hardware under warranty or service contract fails a self-test, the vendor ought to fix it. After all, isn’t that why we buy service contracts? In this case, the vendor’s lack of service made a bad situation intolerable.

By now, everyone was upset and our customer was alternately demanding and begging for all new hardware from the vendor. In another incredible act of stupidity, the vendor sent the customer to a channel partner, who made the customer buy new hardware. In fairness, some of the new hardware included upgraded storage.

When the new servers arrived, the customer cloned the old system disk and tried it in a new server. This failed because the users cloned an already corrupted disk. But nobody yet realized the old disk was corrupted.

Finally, in desperation, they decided to rebuild their NT cluster on the new set of servers and restore their database from a backup tape. By the time I arrived, they were deep into rebuilding. A week later, they were back up and running.

After this experience, the customer decided that NT is not scalable or reliable enough to meet its enterprisewide needs, and it now plans to migrate its large, critical applications to Unix servers. Because of service problems, the customer also decided to change hardware vendors, and the original vendor probably has no clue how many millions of dollars in future sales and support contracts this fiasco cost it.

Finally, on a related note, I’ve complained repeatedly in this column about problems with NT scaling, and I’ve campaigned for Microsoft Corp. to dump its "shared-nothing" strategy in favor of a distributed lock manager. It seems the good folks in Redmond may have listened. In a September press release, Microsoft and Compaq Computer Corp. announced they would incorporate this technology into a future release of Windows NT beyond version 5. --Greg Scott, Microsoft Certified Systems Engineer (MCSE), is president of Scott Consulting Corp. (Eagan, Minn.). Contact him at gregscott@scottconsulting.com.