In-Depth

Maintaining Up-time in an Ever-Changing Environment

A new survey sheds light on the high cost of downtime, and reveals how IT departments are struggling to manage complex environments

In a study of over 300 senior IT professionals completed earlier this year, mValent found that each hour of application downtime costs Fortune 1000 companies over $300,000, on average, for one-third of respondents. Getting back up and running isn’t as easy as it once was: 38 percent of respondents noted that troubleshooting configuration errors with application servers, middleware, and databases takes more than one day.

According to Jim Hickey, chief marketing officer at mValent, "Our initial hypothesis going when we composed the survey, when you build a new instance of an application environment—be it an upgrade or expansion—there are multiple instances of a server created in an n-tier environment. You have a database, messaging queues, an application server, and more. In most companies, those technologies elements are managed by different people with different sets of expertise.

The study, "Challenges and Priorities for Fortune 1000 Companies," highlights one problem in particular: the ability for IT to respond quickly, especially given the risk and cost of complex IT configurations. The survey notes that "On top of this increased complexity and application volume, IT organizations are struggling with the fact that the rate of IT infrastructure change has gone way up—a significant complexity multiplier."

"To get a new instance of the application built, you have to coordinate four to five people," Hickey observes. "Just getting a database configured for a proof-of-concept project I worked on took two days. The configuration may not go well, there may be significant troubleshooting required—and you have to figure out who can fix a problem. Products may be ‘standard,’ but orchestrating the team’s resources inevitably results in a lengthy process."

Are the costs in line with other published reports? "Actually, we’ve seen research studies that it could go as high as $1 million an hour," Hickey explains. What’s behind such a large number?

"I think there are two things going on. First, if the applications are revenue-focused, if your application is down or response is slow, then you’ve not earning your revenue because you can’t book the order, and customers who experience slow response time are going to go elsewhere.

"Second, for non-revenue applications, it’s a case of lost productivity. If the application isn’t up, workers are stuck—they’ll get a cup of coffee. It’s pure wasted time. If uptime is five 8s rather than five 9s, then you have a significant hidden cost in your cost structure."

Application-related failures are the leading cause of downtime, according to several analyst surveys, Hickey says, and over time, "the other causes of downtime—servers and networks—have become increasingly stable. So while hardware is more reliable, the applications have become grossly more complicated—doing more, wider scoped, more infrastructure elements behind them, leaving a higher likelihood of failure. The rise of virtualization solved the problem in one way, but it’s so easy to cut-and-paste an environment, so you end up with numerous but discretely different snapshots of applications that makes control difficult."

Given the complex environment, what is IT doing to address the problem? Several initiatives and standards and best practices are being studied, including elements of the ITIL framework, such as change management and configuration management. "IT is also scrambling for tools and automation to try to ameliorate and address these problems and deliver a higher level of up time. That drive for automation drives them to a build-vs.-buy decision, and most of our customer prospects who have tried to go it alone come back to us after 18 months because the internal process is just not their specialty or their work just doesn’t scale."

In an age where packaged applications are one way to get up and running quickly, don’t such applications make configuration easier? No, says Hickey. "Part of the problem with administering infrastructure and applications is that settings are maintained in many—even hundred—of files in several locations. They aren’t organized and presented in a logical way. For a specialist, such as a security administrator, looking at the settings for your specialty, you may have to look in a couple of dozen places, and it’s easy to overlook some settings and errors in those settings." That lack of a coherent, unified way makes it easy to for configuration errors to creep in.

Then there’s the problem of different environments—a test environment must be different than a QA environment which is different from production. Keeping all those settings correct configured can be a major headache.

Hickey points out that mValent’s Integrity attacks the problem differently. Instead of looking at configuration settings file by file and over all the environments, Integrity has over 100 signatures that know where everything is stored, and then goes through all the configuration files into its own database where the settings are parsed and stored in an Oracle database. The data can be presented in a customized way for, say, the security specialist, and the program can compare entire environments (QA vs. production vs. development, for example) and look at differences at the property level or differences by asset type (for a particular server, for example). (The company released a version last week that supports Microsoft’s most popular servers; see http://esj.com/vendor_news/article.aspx?editorialsId=1234.)

Integrity isn’t just a reporting tool, Hickey told Enterprise Systems. "You can remediate any problem through its interface, including correcting a setting on multiple machines or environments at once, and you can roll back changes to a previous version and ‘reinstall’ a snapshot (you can set the interval between snapshots) or clone whole configurations to new environments, or even create and employ templates" (which helps eliminate the need for all the individuals ordinarily needed, as in the troubleshooting example Hickey pointed to).

After adding up time and using an IT average salary (from salary.com), the mValent survey estimates that each troubleshooting incident costs over $1250 in direct labor costs, but that’s a fraction of the total cost to a company. There’s also damage to the company’s reputation, which may be hard to put a dollar value on.

Given the costs, it’s no wonder that Hickey says IT must focus on keeping applications and systems up and running, and that means in part that the environment must be managed carefully. In a complex IT environment, it may all come down to the people and the administration tools they have. The survey found that most Fortune 1000 companies employ more than 11 people to manage their infrastructure—everything from application servers to databases and operating systems—but notes that most organizations believe it’s an "unwieldy process."

About the Author

James E. Powell is the former editorial director of Enterprise Strategies (esj.com).

Must Read Articles