Disaster Recovery: Balancing Performance and Availability
Financial services giant finds happy medium between performance and availability with Big Blue’s new DR service for zSeries
In business continuity planning (BCP) and disaster recovery (DR), there’s often a necessary trade-off between performance and availability. An emphasis on the former almost always comes at the expense of the latter, and vice versa.
Until recently, this was a trade-off that Mia Arter, assistant director of IT with The Principal Financial Group, was forced to accept.
Arter was dissatisfied with The Principal’s existing DR solution—which was outsourced to a prominent DR services provider—but, for lack of suitable technology, felt that she had nowhere else to turn. That changed earlier this year, however, when IBM announced a new eXtended Remote Copy (XRC) feature for its Geographically Dispersed Parallel Sysplex (GDPS) service for zSeries. Thanks to Big Blue’s new DR offering, Arter and The Principal have been able to find a happy medium between the two extremes.
“Our solution was with a third-party vendor, and we had limited connectivity from our non-corporate offices to that location, but our recovery was about four days, and it was strictly ship tapes [to an off-site recovery center] and restore from there,” she explains. IBM offered a DR solution for Principal’s zSeries environment, Arter confirms, but it was overkill. “The recovery solution they first provided, that exceeded our business requirements and actually would have impacted our production time, which was not acceptable to us.”
There’s good reason for that, says John Sing, a business continuance strategist with IBM’s Systems Group. IBM’s GDPS/PPRC—or Geographically Dispersed Parallel Sysplex/Peer to Peer Remote Copy—service was designed for full data mirroring in a metropolitan (100 KM radius) area. The upside to this approach, says Sing, is that recovery time is minimized. The downside, for many customers, is that performance takes a hit. “This can provide a faster time to restoration than other approaches, and that’s because it has functionality that [is] only possible when you’re keeping everything in lock-step,” Sing explains. “So, today, it still has to be able to be in the 100 KM range, and I have to have the capability to accept the performance overhead.”
Unlike some Wall Street brokerage firms, The Principal doesn’t require near-continuous availability for its zSeries applications, which include a high-volume 401(k) system, along with separate pension, life, and CRM systems. Nevertheless, Arter says her zSeries mainframes typically handle 8.5 million transactions per day, and that customer-initiated transactions (over the Internet) account for approximately one quarter of 401(k) system's activity. With that kind of load, four days of downtime—i.e., the recovery time objective (RTO) guaranteed by The Principal’s previous DR solution—was unacceptable.
“What we were looking at was meeting our customer needs, and the solution from a third-party vendor could not do that without significant increase in expenses, and bandwidth would have been a huge contributor,” she explains.
Enter GDPS/XRC, which IBM announced earlier this year. Big Blue positions GDPS/XRC as a more scalable, albeit less available, alternative to GDPS/PPRC. “When the primary criteria has to do with extreme amounts of scalability and maintaining extremely high levels of performance at the same time, that is where the global mirror [XRC] version tends to apply, especially when we’re talking over 8 million transactions per day,” Sing explains.
For many applications, GDPS/XRC boasts an RTO of one to two hours, with a recovery point objective (RPO) of less than two minutes. (For the record, IBM promises an RTO of “less than one hour” with GDPS/PPRC.) GDPS/XRC also isn’t constrained by the distance limitations imposed by Big Blue’s metropolitan GDPS offering, says Sing, so that it can effectively span continents.
For Arter and The Principal, IBM’s new DR offering seemed heaven-sent. “Our cost-benefits study showed that if we continued with the third party vendor and compared that to if we were our own provider, we could do it cheaper,” she indicates. “After we went through the matrix of requirements, IBM came out on top as the leading contender because of this [GDPS/XRC].”
That cinched it for The Principal, which tapped IBM to design its new DR infrastructure. The financial services giant purchased new zSeries mainframes and IBM TotalStorage Enterprise Storage Servers and designated its Des Moines, Iowa campus as its primary site. From there, data is copied over the wire (fibre) to a secondary zSeries mainframe at a recovery site located some 30 KM distant.
What’s more, The Principal also maintains a tertiary zSeries system in the same facility. “The secondary and tertiary [systems] are part of the GDPS/XRC solution,” explains Arter. “We mirror to the secondary set at that alternate location, and at the time of the disaster we keep all of that data consistent on the secondary site, and we flash copy the data from the secondary to the tertiary [system], and recover actually using that tertiary data.”
Mainframe capacity can be prohibitively expensive on one system, let alone three, but The Principal exploits IBM’s Capacity Backup on Demand to help minimize these costs, says Arter: “We utilize a capability called Capacity Backup on Demand, so we only have minimal engine activity, and that controls the mirroring of the data from the primary data center to the secondary.”
What kind of investment did this require? A sizeable one, she grants. “We had to add the secondary and tertiary [mainframes], and we essentially tripled the size of our mainframe disk farm.”
But the investment was more than worth the price in peace of mind alone, Arter argues. “We have 4,000 people performing daily operations and administration in offices throughout the country. They are reliant on the infrastructure provided by the home office, so it’s critical to them that they can recover in 24 hours. This lets us do that.”
IBM’s Sing says that the new GDPS/XRC offering addresses the requirements of companies (like The Principal) that don’t require near-continuous availability, and which are ill-served by other approaches to DR, such as the outsourced recovery center. “In the case of Principal, the business trade-off is that they don’t particularly have a requirement like a national bank might, like in the European Union where you have some of the organizations flowing money between the 14 and 15 nations, so they don’t need that kind of availability,” he explains.
Stephen Swoyer is a Nashville, TN-based freelance journalist who writes about technology.