Q&A: Predictive Fault Management

IT no longer has the luxury of reacting to problems; data centers must now anticipate problems to correct them before they interrupt critical business applications.

IT no longer has the luxury of reacting to problems; data centers must now anticipate problems to correct them before they interrupt critical business applications. To learn more about an emerging technology -- predictive fault management -- we spoke to Oren Teich, vice president of product management at Replicate Technologies. Its Replicate Data Center Analyzer is one of many solutions to help data centers get a handle on -- and prevent -- problems.

Enterprise Systems: What is predictive fault management?

Oren Teich: Traditional data centers rely on closely monitoring every possible aspect of their infrastructure, which contributes to increasing complexity and cost of maintenance. When faults are found, it's an “all hands on deck” fire drill to fix issues as quickly as possible.

Predictive fault management turns this model on its head by leveraging virtualization to enable an entirely new mode of operation. By looking at the entire data center holistically, predictive fault management makes it possible to identify issues in advance of downtime by pointing out where downtime may occur due to migrations or software or hardware failures. This enables data center administrators to work proactively instead of constantly working in fire fighting mode.

What is your definition of a unified data center? Why is there a need for predictive fault management in the unified data center?

The unified data center refers to all components in the data center. A typical data center today includes networks, storage and servers, and virtualization running on those servers with its own network, storage, and server settings. The unified data center refers to this entire collection, treating the entire data center holistically.

Data centers are becoming more and more complex by the day. As people drive higher and higher consolidation ratios, and IT continues to realize the benefits of virtualization and live migration, the cost of errors is increasing exponentially. A single fault can now cause downtime for tens, if not hundreds, of applications. This new layer of complexity, and the ensuing costs, demands a new mechanism of data center management.

What issues or problems does predictive fault management solve?

Predictive fault management addresses resilience, connectivity, security and capacity. Each of which present their own issues: resilience, connectivity, security, and capacity. Let me explain each one.

Traditionally, the only way to know if your failover policy is implemented correctly is to test it -- pull a cable or shut off a machine. With predictive fault management, it's possible to actually know in advance if your data center resiliency policy is configured correctly, so your data center is resilient.

When it comes to connectivity, you must realize that as machines migrate and move, latent issues can cause virtual machines to drop off the network or have intermittent connectivity. This causes applications to fail. Predictive fault management enables administrators to know in advance if any issues are going to cause application downtime.

Security is always in flux. Data centers are open to a host of new security issues. As just one example, the migration of virtual machines moves memory state from one machine to another. This memory state includes unencrypted passwords and other secured information. Misconfiguration in the data center can expose this information, opening the data center to serious security risks.

Finally, a data center manager always has to be thinking ahead. They have to always ask: does our data center have the ability to support our capacity needs? Are users experiencing varying performance as machines migrate around? Predictive fault management enables administrators to identify differing capacity abilities across the unified data center.

In addition to solving these problems, are there additional benefits that predictive fault management brings that an IT shop may not expect?

This is an important point because there are other benefits. One of the biggest is helping data center administrators with the complex requirements of their job. Data centers require collaboration among many teams -- security, network, server, storage, and so on. Predictive fault management gives administrators guidance and help in managing the complexity.

How is predictive fault management different from other data center management initiatives?

Predictive fault management leverages virtualization to enable an entirely new mode of operation. Taking advantage of the flexible nature of a virtualized data center, a predictive fault management system is able to instrument and analyze data centers in much deeper detail than traditional management products. Using this deep information, a predictive fault management product is able to look at the entire data center holistically -- from the storage, network, and servers -- to identify issues in advance of downtime.

What virtualization issues does predictive fault management solve?

As IT drives higher consolidation ratios using virtualization and realizes the benefits of live migration, the number and cost of errors increases exponentially. A single fault can now cause downtime for tens, if not hundreds of applications. PFM addresses this complexity and the ensuing costs for a new mechanism of data center management by analyzing the entire, unified data center. The majority of downtime in data centers is due to configuration issues. Predictive fault management is able to continuously analyze the configuration and as-deployed environment to identify potential issues before they cause downtime.

What are some best practices IT can follow to better manage their unified data centers and avoid data center configuration errors?

Communicate, communicate, communicate! Virtualization introduces significant new technology layers, such as virtual switches and virtual storage. These technologies are now in the hands of administrators without the deep domain knowledge that traditional SAN, network, or server administrators have spent years developing. It's more critical than ever with virtualization that the SAN, security, network, and server teams all communicate regularly and clearly.

Using products that enable a clear discovery and view into the unified data center is a key step. Once everyone knows what's going on, it's much easier to talk across teams and quickly identify and resolve any issues that may arise.

What are the technical requirements for deploying a predictive fault management solution within the unified data center?

The beauty of solutions such as RDA is that they solve the very challenges virtualization brings by using virtualization. RDA requires nothing beyond a typical virtualized environment. Delivered as two small virtual appliances, it simply and quickly can be deployed as software across the entire infrastructure.

What does your company’s RDA product do, and how does it work?

Replicate Data center Analyzer (RDA) builds up a comprehensive model of a virtualized data center through a unique combination of discovery mechanisms. Combining empirical data from Replicate’s virtual appliance probes with configuration information derived from existing system management tools, RDA constructs a unified view of the data center across storage, network and servers. RDA analyzes the integrity of the unified data center by using industry best practice and technology dependencies supplied by Replicate in the form of knowledge modules. RDA’s combination of discovery and knowledge driven integrity analysis provides both predictive fault identification and specific resolution guidance for existing and predicted faults – eliminating configuration errors in the virtualized data center.

Can you give us an example of a successful predictive fault management/RDA installation and how customers benefited?

A scenario we've seen repeated over and over has a VMware administrator either setting up or expanding their virtualization footprint. They're bringing on new physical hardware, servers, storage, and parts of the network. When they go to start using their new infrastructure, they find that some critical components such as VMotion aren’t working, or virtual machines are disappearing seemingly at random. Instead of finding out after these issues start appearing what’s going on, our customers can use RDA to be able to find out in advance that, for example, their physical switches aren't configured to pass the right traffic down to the new servers. By identifying these errors and warning the user, RDA is able to prevent virtual machine downtime.

Must Read Articles