Tracing MVS COBOL: The Management of Application Failure Recovery in the Production Environment

Calculating the cost of downtime is often a frightening wake up call to the business bottom line. When applications fail to perform as expected or crash, business managers are keenly interested in understanding the financial impact. Downtime costs can be staggering. For example, participants in a recent study of MVS COBOL application failure and recovery, conducted by Newport Group (Barnstable, Mass.), reported hourly costs of downtime as high as $500,000 per hour, with average costs of $74,200 per hour.

Application failures for the majority of participants (43 percent) numbered in excess of 25 within the last 12 months (see chart "Number of ABENDs Within the Past 12 Months"), or an average of slightly more than two application failures per month. The average recovery time for participants is five hours. Summing the most severe impact of failure from this survey data equates to an approximate annual high-end loss of $37.5 million and an overall average annual loss of $5.6 million. This level of impact on any bottom line is sure to grab the attention of business executives who, in turn, must advocate the push for applying more efficient, proactive solutions.

Calculating accurate downtime costs is difficult. In doing so, businesses tend to combine lost productivity costs with costs associated with application recovery. However, estimating lost business or lost business opportunity is often a gray area that significantly widens for applications exposed to the Web. Even more difficult to measure is the negative effect application failures have on business user confidence and attitude. Today’s business users expect high availability and consistently reliable performance. Anything less leads to frustration and becomes a barrier to productivity. In the final analysis, the criticality of application downtime issues and the quest to implement solutions that reduce and/or eliminate downtime, rests solely on the impact that downtime has on the business bottom line.

The threat of increased downtime costs is magnified for applications with secure hooks to the Web. As businesses move to Web-enable existing legacy applications and/or leverage mainframe databases to underlie Web systems, companies can expect mainframe downtime costs to grow in correlation with increased application accessibility and use. Therefore, what is important to recognize from recent research is the frequency of failures in the less glamorous trenches of Corporate America’s legacy COBOL applications residing on MVS mainframe systems. Although the majority of failures in these environments fail to grab industry headlines, the frequency and cost of failure recovery represents a major challenge to IT departments and a major threat to the business bottom line of the organizations that support them.

In defining the scope of application failure recovery issues within MVS COBOL environments, it is worth noting that an average large IT shop may have over 100 million lines of code and regularly release, incorporate and layer new code into existing applications on an ongoing basis. Time and cost constraints often dictate little chance to re-architect applications from the beginning. As code is added, changed or updated, it may take weeks or even months to trigger a certain set of transactional conditions that can manifest into problems or failures. The challenge to this common practice centers on the affect application changes and/or additions will have on future transactions. Even top-tier IT departments underestimate the rippling effect that today’s changes can have on tomorrow’s application functionality.

Compounding this issue is the fact that the individuals who originally created these applications have long since moved on or even retired. In MVS COBOL environments in particular, companies struggle with shrinking staff resources and the lack of technical expertise that typically leads to a reliance on outsourcing maintenance responsibilities.

However, regardless of where the responsibility for application maintenance resides, tracing the sequence of programming statements leading up to an ABEND in a production environment remains a very time consuming task. In fact, a common misconception is that the root cause of failures can be traced back in a production environment on an immediate basis. This is rarely the case. An MVS program dump only provides a call stack and does not contain statement sequencing information. However, the need of tracing back program statement sequencing is critical to the recovery process. Based on recent research, most IT shops revert to failure and defect resolution by taking the application out of production and checking code in the test environment. This practice places increased pressure on IT departments trying to squeeze a five- or six-hour remediation into a two- or three-hour window before resorting to a system rollback.

Survey Says…

In an effort to gain insight into how organizations respond to the frequency and types of MVS application failures and how these failure instances are managed, research was recently conducted by IDG Research Services Group on behalf of InCert Software Corporation (Cambridge, Mass.), a manufacturer of agent technology software for MVS COBOL application failure and recovery. Specifically, this study aimed to understand issues surrounding the frequency and types of failures companies experience, procedures used to discover root causes of failures and/or problems and tools currently utilized to analyze and recover from failures.

Seven thousand names were randomly selected from IT management databases or subscriber lists of IT publications. From these random names, the survey attracted 693 responses of which 172 qualified responses were used to draw conclusions. Criteria used to determine appropriate expertise for completed surveys rested on respondents that had MVS systems installed, worked on COBOL applications and were familiar with ABENDs or the recovery process. Final results represent a cross-section of industries and company sizes.

To give perspective, survey respondents represented companies with average annual sales of $4.4 billion and an average of 5,700 employees per organization. Primary business segments represented in the survey included insurance/real estate (16 percent), manufacturing (15 percent), communications/utilities/transportation (13 percent) and government (12 percent). Of the survey respondents, 53 percent reported having one MVS mainframe, 47 percent have two or more and close to one third have four or more MVS mainframes installed. The most common COBOL compilers in use are VS COBOL II and COBOL for OS/390. The majority of respondents noted using three or four different COBOL compilers.

Of the nearly 30 questions that were asked of the respondents, two areas in particular 1) frequency and types of failures, and 2) the recovery time versus batch window, provided the most eye-opening results. In short, application failures were a common experience for respondents. Fewer than 10 percent said they did not experience any failures while 43 percent noted more than 25 failure instances in the past 12 months (see chart "Number of ABENDs within the Past 12 Months"). This translates into some type of failure about every two weeks. The majority of respondents attribute the cause of failures to be application specific (79 percent), as opposed to an MVS system failure (11 percent). Other causes of failure revealed by respondents included human error, databases, CICS and infrastructure problems.

To further compound this problem, the average time to recover from these crashes was five hours (see bar chart "Maximum Recovery Time After an ABEND"). Based on these findings, the challenge for IT managers is the ongoing need to allocate approximately 10 hours each month to failure recovery when only less than half that amount of batch time is available per failure. In fact, the research revealed far less time available than needed to recover from such failures. Specifically, 46 percent reported only 1-2 hours of available batch window recovery time, 25 percent reported 3-4 hours and 9 percent had less than 1 hour to recover (see bar chart "Maximum Batch Slack Time Available for Reruns"). These specific results clearly shed light on the challenge MVS mainframe shops face in finding and managing time for failure recovery.

A contributing factor to recovery time constraints is the method in which failure instances are resolved. As it related to the utilization of tools required for application and system recovery, only 56 percent of survey respondents reported diligent use. In the event of a failure, 79 percent replied that the procedure used most often to discover the cause of the crash was to read the program dump using an automated tool. However, 76 percent then take the application back into a test environment and turn on debuggers. Only 15 percent of the organizations took time to read the native dump, while 51 percent said they rarely or never used the information. Although in some cases it may be the only option available, Newport Group believes taking an application out of production only compounds the time constraints IT shops have in trying to resolve failures.

This introduces the need for proactive failure management and agent technology solutions for MVS COBOL environments. According to Jim Sinur, Research Director for the Application Development Division of GartnerGroup, "Agent technologies will be growing in the future as organizations move towards adaptive, scenario-based systems. Agents are goal-driven functionality that come alive when the proper conditions occur. They stay pretty invisible and light on system impact, if they stay in communication with other agents and base functionality. It’s like having an immune cell in the bloodstream of your computer system. It’s there when you need it, but it floats free until then."

One Solution

In conducting multiple recent focus groups, InCert Software has uncovered that many organizations often do not fix a problem when it occurs. Due to time constraints, they instead eliminate the offending transaction and delay the fix for a later date. Utilizing this process to contend with failures thus results in executing applications with non-current and incomplete data. Having the ability to trace back to the root cause of a failure in production provides a proactive solution that will allow MVS COBOL shops to fix failures as they happen.

Targeting this MVS mainframe environment need, InCert Software’s TraceBack is an agent technology designed to operate on the executable to trace COBOL code sequences backwards from an ABEND to identify the root cause of application failures. TraceBack is not only able to execute trace backs in the test environment, but can do so in a production environment, thus negating the need to replicate failures in a test environment.

InCert started by developing its agent technology, QAgents: monitoring agents inserted into software binaries by their own QAgent engine. The engine is the mechanism that inserts and distributes the agents into the code. The process begins with the engine analyzing and comprehending the binary. This code comprehension includes the construction of a control flow graph to completely understand all possible execution paths through the binary.

Understanding the branches between all the blocks of code allows the engine to determine exactly where the agents need to be installed. Once these locations have been determined, the engine distributes the agents with single or multiple pre-defined instructions. During this procedure, the engine saves all branching and registry information to make future agent deployment efficient. Agents reside in the software binaries or load modules and capture information on execution behavior and the sequence of statement execution leading up to an ABEND. Tracing back sequences of statement execution behavior results in the ability to immediately determine the root cause of failures following an abnormal application termination.

Further Y2K Risk Protection

This function is pertinent given the uncertainty of the looming Y2K problem. Despite monumental efforts in Y2K remediation, there still exists the fear of what will actually happen when the clock ticks over. This fear is mainly due to the uncertainly of how multiple external interactions will affect application behavior. Charles A. Aquilina, Director of Resolve 2000 for Keane Incorporated recently stated that, "Regardless of the due diligence organizations have taken to make their applications year 2000-compliant, each organization should have contingency plans on how to cater to failure if, and when, it occurs." Keane is providing Crisis Management Centers that will support its clients if the year 2000 date change becomes an issue. Technology geared to uncover the root cause and resolve application failures adds a layer of risk protection against Y2K disasters or the barrage of multiple -- even simultaneous -- application failures.

Failure management in MVS COBOL environments has by default become a reactive process. Once a crash occurs, IT shops are sent into motion trying to uncover the root cause. It is unsettling that current research indicates failures in MVS COBOL environments are pervasive and will undoubtedly increase further as the continued demand for additional application changes and functionality increases, hooks to the Web are expanded and skill sets in this discipline continue to dwindle. The high cost of application downtime for MVS COBOL applications and the limited window of time available to correct failure leads us to recognize the proactive advantage of being able to trace the root cause of a failure immediately and accurately within the production environment. The automatic notification to an ABEND statement coupled with the direct, reverse engineered path to problematic code alleviates the need to pull applications out of production to trace problems in the test environment.

Current tool offerings for MVS COBOL environments are unable to trace back statement sequences in production. Perhaps the issue can best be summarized by one survey participant who responded to the question of how failures are addressed in production by stating "call the expert to guess at the cause." The ability to remove the guesswork and provide quicker problem resolution using available resources within an organization has the potential to save a company millions of dollars in lost productivity and downtime costs. The significant financial impact, coupled with waning resources of technical expertise in this discipline, has focused many organizations on proactively managing failure recovery.

About the Authors:

Kevin M. Gallagher and Billie Shea are Research Analysts for Newport Group Inc., an independent, IT research and reporting firm based in Barnstable, Massachusetts. For further information, visit their Web site at www.newport-group-inc.com.