Disaster Antidote: Getting Control of Data Center Power

By Jeffrey S. Klaus

The costs of a data center power outage can be crippling. For a large enterprise, where daily transactions add up to millions of dollars in revenue, the costs of a complete shutdown can exceed $100M in less than one week.

Data center managers and facilities teams aim to avert these losses with the best possible disaster recovery plans. Top priority is typically placed on minimizing the reaction times and achieving near-instant switchover to "hot standby" systems.

Is this enough? Are data center managers able to achieve optimal power utilization in the event of an unexpected power outage at the primary data center? What about managing power under normal conditions? Can power be effectively monitored and controlled to avoid equipment-damaging spikes and outages?

Clearly there are many motivations for getting better control of data center power, not the least of which is cost control as energy prices continue to skyrocket. Server sprawl has made the data center energy bill one of the fastest increasing component of operational costs. In the past, however, traditional power management has failed to get power under control, and data center managers routinely over-budget power and cooling to accommodate worst-case scenarios and avoid surges that could put assets or services at risk.

Getting an Accurate Baseline

Data center technology providers are stepping up to meet these needs with a variety of new tools. Most of these let IT managers examine the temperature at the air-conditioning units, and perhaps the power consumption for each rack in the data center. However, they lack visibility at the individual server level, and typically base their calculations on modeled or estimated data that can deviate from actual consumption by as much as 40 percent.

Alternatively, data center managers can leverage a new type of holistic energy and cooling management solution that offers more fine-grained levels of monitoring. The latest innovations in this area focus on server inlet temperatures. Middleware vendors aggregate server inlet temperatures as well as real-time power consumption characteristics for servers, blades, power-distribution units (PDUs), uninterrupted power supplies (UPSes) and other data center equipment. The aggregated thermal and power data, combined with return-air temperature at the air-conditioning units, feeds into thermal and energy maps of the data center.

Compared to previous power management approaches based on modeling and estimations, aggregated fine-grained data yields an extremely accurate view of the data center by gathering actual power usage data and can be used to analyze power behaviors by rack, row, or room. Logging and evaluating ongoing usage for high-priority groups of resources further allows data center managers to be much better prepared to allocate available power in the event of a partial power outage, equipment failure, or even full-scale disaster.

Averting Many Problems and Mitigating the Rest

Thermal maps can highlight hot spots as a proactive measure for circumventing potential server or computer-area air handler (CRAH) failures. Early identification of hot spots, before they reach critical levels, can minimize these negative impacts on equipment and user services, and enable preventive measures to be taken.

A crude preventive approach would be to simply cap the power consumption of every server or groups of servers. However, because performance is directly tied to power, a more intelligent energy management solution dynamically balances power and performance in accordance with the priorities set by the particular business.

Fine-tuning power and server performance requires accurate continuous monitoring of actual power consumption, and the ability to dynamically adjust CPU operating frequencies. This calls for a tightly integrated solution that can interact with the operating system or hypervisor based on threshold alerts. With this level of control, power management solutions can optimally balance server performance and power to avoid dangerous power spikes.

Temperature and power monitoring provides a knowledge base that can be leveraged during a disaster or outage. Armed with accurate power characteristics, data center teams can allocate power and introduce power capping and throttling based on the needs of the high-priority business applications. Lower-priority applications can be disabled temporarily or assigned to servers configured to operate in lower-performance, power-conserving mode.

Power management best practices also allow data center architects to calculate and configure rack densities that will stay within the power envelopes for normal or restricted levels of operation. Accurate insights into power characteristics can also drive up efficiencies that help extend the life of UPSes during power outages, as measured during proof-of-concept testing of power management solutions in data centers.

The Prognosis: Long-Term Data Center Health

Consumption monitoring and management helps operators better define and adjust overall policies for data center power. In real-world use cases we've analyzed, intelligent energy management solutions are identifying opportunities for reducing energy waste by 20 to 40 percent. We have observed that approximately 10 to 15 percent of all data center servers, for example, are idle, and yet a typical server still draws about 400 watts of power each, for an annual cost of $800 or more per server. Reducing this type of waste extends operation during times of restricted power, and yields significant ongoing reductions in operating cost.

Some companies are also taking advantage of power management solutions to introduce energy metering and cost chargebacks, and thereby motivate conservation. Others are using real-time data to identify opportunities to replace expensive intelligent power strips with lower-cost alternatives. In a large site, a $400/strip reduction adds up to big savings.

An intelligent power management solution can also contribute to more efficient facility designs for cooling and air-flow, and can help data center managers make the best possible use of existing floor space and accurately forecast future needs based on expected company growth.

As perhaps the most critical business resource, power will always need to be stringently managed. As a consequence, best practices should continually evolve, to be aligned to leverage the latest technology advancements. Luckily, the returns on investment of the industry-leading power management technology put this technology within the reach of today's budget-sensitive data centers.

Jeffrey S. Klaus is the director of Data Center Manager (DCM) solutions at Intel Corporation, where his team is pioneering power amd thermal management middleware. You can contact the author at DCM.Sales@intel.com.