Allchin Says NT is Cause of Many Unplanned Reboots

LAS VEGAS -- The most common cause of unplanned reboots on Windows NT is, in fact, the core operating system. This revelation was made by Jim Allchin, senior vice president at Microsoft Corp., during a press conference here at Fall Comdex ‘99.

Allchin said 65 percent of reboots are planned while the remaining 35 percent are unplanned. According to a chart that Allchin displayed, the core operating system is responsible for 43 percent of the unplanned reboots of Windows NT boxes.

Allchin said two years ago Microsoft set out along a path to figure out what the root problems of NT reliability were. The dilemma was that some customers were unhappy with NT’s reliability while others were achieving much better results.

"We knew that was running it, NASDAQ was running it, the Chicago Board of Trade was running it, and so there were customers that were incredibly happy with the reliability that we were offering," he said.

There were, however, a number of companies complaining. According to Allchin, after obtaining logs for about 5,000 servers, Microsoft found that depending on the way companies manage and operate their servers, there is a 5x difference in uptime.

"Operational practices make a big difference. If you treat it like a mission critical environment, you had a better experience with the system," he said.

For its part, Microsoft has taken several steps to fine tune the operating system -- not to mention hardware, device drivers, and third-party applications -- to reduce the number of reboots for Windows 2000.

The first thing Microsoft did was to acquire PREfix, a tool for analyzing source code before testing begins.

"We've run this over NT sources and fixed literally thousands of problems that this has been able to discover," Allchin said.

Microsoft found that device drivers are a common problem that lead to reboots, so the company created the Driver Verifier testing tool. The tool is a mechanism that helps Windows 2000 expose errors in kernel-mode drivers and activates defenses when interacting with unstable drivers.

"We went to the hardware qualification lab, and we said, ‘You are not to pack a single driver that doesn't go through this driver verifier.’ And so the quality that we're going to get out of these drivers now is better than anything that we've had in the past," he said.

From a security perspective, Microsoft worked on attacks on the system. A team inside Microsoft and outside security analysts are reviewing code and attacking the system in a black box way. The company has a full-time code penetration team whose sole purpose is code reviews every day. Microsoft uses the results for best practices and to educate customers and engineers.

Microsoft also beefed up its system testing, in terms of component stress testing and short-term and long-haul testing.

Every day developers build a new version of Windows 2000, and every night they run a stress test on it. According to Allchin, the stress test is the equivalent of about three months of run-time that the company is able to accomplish every night. And it's done on up to 1,500 machines each night.

Windows 2000 is the first product for which Microsoft has built long-haul stress environments. In this phase of testing, Microsoft determined what categorizations that customers felt were important in terms of how they were going to use servers -- such as Web servers, file servers, print servers, and DHCP DNS type environments -- then the company put stress on them and left the machines up for a long period.

"I think our qualification [to pass stress testing] is 65 million entries inside Active Directory, and 2.3 billion look-ups inside DNS. That's the sort of numbers that we expect to see from these tests," Allchin says.

Additionally, there are now three levels of dumps that occur when a system crashes. For instance, when Windows 2000 Professional ships it will automatically capture a minidump that will be small enough to transmit around. The company says it has sophisticated tools to diagnose the problem and point to a device driver, or a particular problem, and quickly turn around a fix. Or, if a user has done something to cause a problem, the software has a feature called space mode boot that can help walk them through the steps of what they might have done wrong.

Allchin expects the sum of this work will enable a greater number of customers to achieve higher uptimes with Windows 2000 than with Windows NT 4.0.

Must Read Articles