Q&A: Maximizing System Uptime
How to avoid the top mistakes when managing uptime performance, and how you can maximize system availability.
Continuity planning is a tricky business; there are dozens of interconnected systems that must be included in any recovery, up-to-date data to restore, and crushing time pressures to consider. To learn more about the top mistakes IT makes when creating a continutity plan, we spoke with Chris McAllister, senior director of product management for GoldenGate Software. Chris has over 25 years of experience designing and developing large scale OLTP applications for the NonStop environment including travel reservation, customer data warehouse, retail, banking, and ERP systems for Fortune 500 companies.
Enterprise Strategies: I understand you’ve created a list of the top mistakes IT makes when managing uptime performance. What’s at the top of your list?
Chris McAllister: I’d say that it’s most likely when IT doesn’t realize that planned outages for upgrades, migrations and system maintenance are a fact of life and can seriously impact the stringent service level agreements for system uptime. The IT group commits to a number of SLA’s for mission-critical applications and systems which often evolve and change over the lifecycle of the application. In many cases, changes are required by industry imposed governance regulations.
In addition to consideration for unplanned outages, the IT group needs to have proven and tested processes in place for all the planned outage windows, which are used for regular system maintenance. This includes installing application fixes and patches, as well as upgrading to new versions and switching out hardware and operating systems to take advantage of new system processing power.
Unfortunately, many organizations don’t factor in all these planned outage windows when they agree to stringent availability levels. Achieving 99.95 percent availability is a challenging endeavor for most organizations and in my experience can only be achieved with a proven bi-directional replication technology solution that supports the widest range of database and hardware platforms. Ultimately, if the data is not available for the business end users, they cannot perform any business transactions and this translates to lost revenue, and damaged credibility.
What should IT do to maximize system availability?
There’s certainly increased pressure to get a return on investment as quickly as possible. To achieve the highest level of application and system availability, IT groups need to adopt different technology approaches to keep systems continually up and performing optimally. Gone are the days of weekend outages for system maintenance; most if not all corporations need to act globally. With the prevalence of online business, systems need to be always on; customers and partners need access around the clock, so outages of any kind can severely impact revenue, customer satisfaction, and overall brand credibility.
To maximize up time, many IT groups are using (or considering) active-active or multi-master configurations across multiple data centers. This allows you to achieve higher performance and transaction processing rates spread across multiple servers. If one system fails, the other takes the entire processing load and the business continues as usual. This lets the IT group take full advantage of the server processing power and can spread the data load across multiple systems.
Assuming they have factored in any conflict detection and resolution with key infrastructure software, they can achieve greater return on investment and overall lower total cost of ownership. Additionally, they can avoid the scenario of data centers sitting idle waiting for an unexpected outage.
Until recently, many IT groups have avoided a multi-master scenario because it was deemed too complex and prone to errors. However, we have seen a significant increase in demand for this configuration because of the cost savings and additionally because it actually forces IT operations to work more closely with the application group which often results in softer benefits such as elevating IT awareness across the enterprise and up to the executive and board level.
One challenge of any recovery plan is making sure in advance that it will work.
You have to test and test again to assure the business that the systems work. In our experience, the most reliable data center or business continuity plan involves regular testing and maintenance on the standby systems.
In some situations, real-world testing takes place every quarter or perhaps three times per year during which all users are pointed to the secondary or live standby system so that they can ensure the hardware and software work optimally and satisfy the ongoing business processes, as well as transaction processing workload.
Regular testing and fail-over to secondary (DR) sites is important for the IT group’s peace of mind, but it’s also necessary if you want to build trust and confidence among business application end users. This is especially important in key industry sectors such as health care and financial services where data loss or lengthy outages can mean significant revenue loss or, in the worst case, a life-and-death situation.
Isn’t maintaining a secondary site expensive? Do you have any suggestions about how organizations can minimize their costs?
Yes, it is true that having a secondary data center located a significant distance away can be an expensive proposition for any business. In our experience, the larger, global organizations don’t have a choice in this matter and it’s really a necessary part of doing business. For mid-size companies, it may be a little more challenging, especially if they only have a few small office locations.
Having said that, there are alternative ways to achieve business continuity, and we have seen organizations outsource their DR site to a third-party vendor that specializes in this offering. In some cases, a business partner could co-locate with them to a secondary data center where they could even share capacity on a server. In either case, it is far more reliable than the old-fashioned method of tape or disk back-up. Additionally, not doing anything at all is definitely more detrimental to the overall health of the business.
We have seen some great examples of the outsourcing model. One in particular is that of a health-care provider who spun off the IT group and became a for-profit center, whereby they essentially offer turnkey data management services for smaller hospitals and institutions in the same geographic area and at much more reasonable rates. This model is a win-win for the IT group or service provider as well as for the smaller hospital that would not be able to take full advantage of such state-of-the-art technology.
Additionally, when building a disaster tolerance or continuity plan, you don’t need to have the exact same environment on the standby system. You can absolutely leverage lower-cost hardware and databases, even open source, whereby the overall cost is not as prohibitive. The most important part of this scenario would be that you need to have a replication technology that supports the widest range of databases and platforms.
What other testing tips can you offer?
It’s all about the data and its integrity. At the end of the day, the most important asset for any organization is the data, which the business turns into information. Fast-running, optimal applications and hardware are extremely important, of course, but if there is any data loss or if integrity has been compromised, the business can face severe penalties.
Keeping primary production systems synchronized with back-up or live standby systems is ever more important when considering business continuity plans. Testing and validating these systems to make sure the data is fully intact is a best practice that IT groups should adopt. This validation can be run as regularly as every hour, day, week, or month depending on the criticality of the data for the business. There are many proven technology solutions in place today that do this very successfully.
What’s the one thing IT most often overlooks when creating a continuity plan?
I think we have covered many of the areas that IT often overlooks, but I would like to add a few more. When an organization is rolling out new applications that are critical for the business, I believe they overlook the continuity aspects of that system or application. Very often the budget is approved and the business owner works with the IT development team to make sure the business processes are followed and supported with the right application functionality and workflow. However, the IT operations team needs to be integrated very early in the process so that there is a full understanding on overall performance, capacity, and contingency for new mission-critical applications. I have seen all too often this aspect comes up late in the project and it can stall the project indefinitely, especially when more budget dollars are required.
Additionally, IT is often asked to do more with a smaller budget and fewer resources and it seems to me that the business users and executives don’t fully appreciate the cost and rigor involved in keeping systems continuously running. It usually takes a severe outage before they fully understand the ramifications, especially if bad PR results from such an outage. Often this is too late and the business is damaged for a considerable length of time. Continuity plans are like insurance policies -- you simply cannot live without them. In my opinion, the business and IT group needs to be “lock-step” on this critical area for the continuity of business operations.
Speaking of solutions, what products or services does GoldenGate offer to help in continuity planning?
GoldenGate delivers several high availability and disaster tolerance solutions that provide continuous database availability; unlimited scalability; improved system performance, and complete data recovery, no matter how your business grows or the type of interruptions it faces.
During unplanned outages, the GoldenGate Live Standby solution provides continuous data availability for business-critical systems by capturing transactional data from the primary system and applying it to a standby database in real-time. Planned outages, such as maintenance, upgrades, and migrations benefit from GoldenGate’s Zero Downtime solutions that eliminate downtime during planned outages. The GoldenGate Active-Active solution enables continuous availability and improves performance with load sharing between two or more multi-master systems.