A Search and Deploy Mission, Part 1:  Production Support/Application Testing and

In the first of his two-part article, the fundamentals of software defect investigation and resolution are discussed, in order to help MVS developers by presenting a methodical approach to understanding the process of tracking down and solving a production software error.

Editor’s Note: Part one of this article discusses the fundamentals of softwaredefect investigation and resolution. It is meant to assist MVS developers by presenting amethodical approach to understanding the process of tracking down and solving a productionsoftware error. It is written specifically for the IBM mainframe environment (COBOL, PL/I,batch applications) – but the approach can be generalized (with limitations) to any3GL and 4GL linear programming language on any platform. It is not meant to be taken as auniversal practice, to be followed systematically in handling every ABEND condition, butmore as a set of guidelines or an organizational approach to understanding what needs tobe done to solve complex software issues.

When an application ABEND (ABnormal END-of-job) occurs, MVS stops executing yourprogram, closes files and buffers and generates a single high-level message in the form ofa System Completion Code (Sxxx). The System Completion Code is usually written to anoutput listing file through your //SYSOUT DD * JCL entry. This completion code indicateswhy the system has decided to stop executing your application. It is related to, but oftenonly loosely related, to what is really wrong with your application. Because of this theSystem Completion Code represents only the starting point for your analysis of theproblem.

Other Debugging Assistance

Along with the System Completion Code, mainframe some other system-level, symbolicdebugging software, usually generates a listing (SYSOUT) which describes:

  • The System Completion Code (and often a short text description of what it designates).
  • The COBOL instruction (statement) or line number, which contained the invalid operation causing MVS to halt execution.
  • A "core-dump" (a hexadecimal printout) of the internal machine storage and registers relevant to the areas of your program surrounding the COBOL instruction which caused MVS to halt execution.

This information is useful to begin understanding and researching the problem, but itis usually far from sufficient to solve the problem, which could be any combination of:

  • Incomplete, incorrect or invalid COBOL procedural logic.
  • A typo such as a misplaced period, or incorrectly specified field.
  • Incorrect or invalid input data.
  • Batch jobs run out of sequence.
  • Input files missing or corrupted (hardware errors).
  • Errors which relate to JCL problems.
  • New business conditions, that have escaped into production, etc.

There are as many different ways to analyze and research COBOL ABENDs as there areindividual approaches to writing procedural logic. However, if you’ve never done thistype of "logic-detective" work on a large scale, and to help you get startedwith this complex and crucial process, consider the following approach of five steps:Preparation, Research, Hypothesis, Solution, and Resolution.

As a final note before beginning, it is important to understand that there are reallytwo distinct phases of MVS Production Support:

1.  On-Call ABEND resolution – during which time a technician receivesnotification that a job or transaction has ABENDed and must be "fixed" within anextremely short timeframe (usually minutes to hours). In this case, the technician’smain concern is to "patch" the problem – get the system back online, or getthe batch jobstream into production ("Patch-It").

2.  "Next-Day" problem resolution – when technicians actually trackdown and solve the problem that caused the ABEND ("Fix-It").

It should be noted that in certain situations the On-Call ABEND resolution must includefixing the problem. However, it is my experience that making source-level changes tocomplex production applications under duress, in the middle of the night is the exceptionrather than the norm.

The steps below represent a process for Fix-It (next day problem resolution). As suchthey are beyond the scope of the emergency measures used to "patch" problemsduring most On-Call emergencies, as typically the batch cycle window must be completed asthe highest priority and many of the Fix-It steps discussed below consume considerableamounts of time.

Preparation – Collect all necessary backgroundinformation (WHAT happened and WHERE the ABEND occurred).

During this phase, it is important to collect and organize all of the informationnecessary to begin analyzing the problem. This includes the following procedures:

Print out the ABEND information. Collect all supporting ABEND output (SYSOUT) from thejob – (CA-ABEND-AID, Job SYSOUT and DISPLAY statements, etc.).

Obtain copies of the source listings: JCL, Program source and all copybooks (orexpanded source listing).

From the JCL, learn the dataset names of input and output files accessed by the program(which you may need to browse as part of your research).

Learn the nature of the batch job from system documentation, or from an applicationbusiness expert (at least at the level of module-flow and file-access).

Research – Construct a mental map (understanding) of theprogram’s execution (HOW the ABEND occurred).

To make the correct WHY determination usually requires a combination of"Static" and "Dynamic" analysis – complementary research andinvestigative approaches. [Note: These steps need not be followed in this order. Rather,in time you will develop an "intuition" as to which kind(s) of analysis will bemost likely to provide the information you need to solve your problem.]

Static Analysis – The examination of code without execution – NicholasZvegtinsov (Editor of the Software Maintenance Journal and Software Management Network– www.softwaremanagement.com) has identified five separate Static Analysis methods ortechniques:

1. Structural Visualization: The generation of an accurate understanding or mentalimage of the program’s control structure, or logic-architecture. Using the startingpoint represented by the ABEND condition (the statement which caused MVS to haltexecution) and using electronic-assisted tools build an accurate understanding of the codeinvocation at:

  • The module/file level (System View)
  • Paragraph/Section level (Hierarchy chart)
  • Statement level (Flow chart) (If necessary – i.e., if the code is dense or complex)

Structural Visualization can be done "top-down," by asking open-endedquestions; such as learning how a particular routine "hangs-together logically,"or it can be used "bottom-up," by asking specific close-ended questions about aprogram, such as "How does this particular paragraph get executed?" "Howdid this module get invoked?"

2. Data Flow Analysis: A combination of control structure analysis and data itemanalysis, which seeks to determine the usage of particular fields throughout a program.Data flow analysis is used to determine (from a given instance of a data item) where thenext occurrence(s) of that item exists in your program, and how the data item is used; (asa receiving field in a MOVE or mathematical operation, as the sending field in a MOVEstatement, as part of a logic-branch (IF, PERFORM UNTIL/VARYING, etc.).

3. Data Impact Analysis: An expansion of Data Flow Analysis which traces the movementof data from field-to-field throughout a program, or throughout an entire application;including I/O (screens and files). Using Data Impact Analysis, you can identify all fieldsthat might have had an impact on the contents of a field (before the ABEND occurred). Andjust as importantly – you can learn the affect changing this field will have on thebehavior of the application.

4. Textual or Data Item Usage: Utilized more for application maintenance andenhancement requests, this type of Static Analysis involves searching for"categories" of program-items, such as "List all fields that contain *JUL*,*GREG*, *YR*, *YEAR* (suspect date candidates for Year 2000 conversion), or list all suchfields with two digits (numeric) or two-byte (alphanumeric) definitions."

5. Code Partitioning: Again, utilized more for application maintenance, enhancementsand application reengineering, Code Partitioning involves mentally organizing andanalyzing code by function or process, such that you understand and can distinguish theusage of code by business process. For example: Find all code that relates to thecalculation of premium renewal payments … or … Isolate the code that edits aparticular file, with an eye towards creating a shared subroutine from the code.

Dynamic Analysis – The examination of code through itsexecution – Nicholas Zvegtinsov had identified four distinct Dynamic Analysismethods:

1. Tracing: Source-level interactive debugging. Watch the program executestatement-by-statement, and line-by-line. This is very useful for detailed-debugging,particularly of dense or complex instructions. Some software testing tools allow you totrace the program logic backwards from the point of failure, attempting to re-create thesequence of events (COBOL statements) that transpired up to and including the ABENDcondition. Tracing is an invaluable method for detailed debugging. However, given the sizeand scope of production applications, it is generally more practical to trace specificproblem areas of a program.

2. Interactive Execution: Execute (run) a program, stopping at selective Breakpoints(Pause execution each time a certain field-value changes, or when a value exceeds somethreshold), and examining the contents (value) of specific fields. Interactive Executionmust be done by (or with) an application analyst who understands how the system issupposed to operate. Interactive Execution is useful for observing control flow, and isoften combined with line-by-line tracing by setting selective breakpoints, monitoringvalues, "running" the application to the breakpoints, and then tracing the codeline-by-line.

3. Selective Data State Collection: Execute code and establish a functional summary ofspecific data states that it creates. Use these states in subsequent test runs to compareresults of current values to expected values.

4. Coverage: Analyze the number of times each COBOL statement is executed for a givenrun. This technique is extremely useful for analyzing test data coverage of a givenapplication. And it can be used effectively for debugging if it makes apparent problemssuch as infinite loops (S222, S322 and B37 ABENDs), over-loading tables (loading tablesbeyond the maximum OCCURS clause and overlaying storage, which can cause S0C1, S0C4 andS0C7 ABENDs).

Using a static research and analysis tool to perform Static and/or Dynamic Analysis onthe specific areas of the application relating to the ABEND, to determine (based on WHEREthe problem manifested itself to the system – obtained from the ABEND-AID listing ofwhich statement caused the ABEND ) HOW this particular problem occurred in theapplication.

The Why

Hypothesis – Determine WHY the ABEND occurred.

With the research in steps 1 and 2, you should be able to describe WHAT, WHERE and HOWthe ABEND occurred (at what point in the program the logic failed, and what sequence ofCOBOL statements caused the failure).

However, before modifying any logic, you must determine WHY these statements (orsequence of events) caused this particular failure (e.g., "Why did this productioninput file contain spaces in a numeric field?" "Why did the program’s logicperform the Initialization routine twice?" "Why did the Read routine executepast end-of-file?").

Only through a determination of WHY will you be able to make a change to productionbusiness logic safely, and with confidence that: Your change will resolve the ABEND; andYour change will not introduce new ABENDs.

Sometimes it is relatively easy to come to an understanding of WHY certain ABENDconditions occurred. For example, perhaps a period was left off the appropriatetermination point for an IF statement, which caused execution to perform an operation outof sequence. Or perhaps, an IF NUMERIC test (which should have been coded for all numericfields in a file) was forgotten. Or a paragraph was performed through the wrongparagraph-exit, or a production job was released before certain files were available(causing I/O errors). These types of ABEND situations can be understood (and usuallyresolved) fairly quickly. However, this is not always the case.

What If – In the case of the IF statement with the incorrect termination point.The logic that has been coded, correctly processed the first 100,000 records in the file?Making a change to a critical IF condition could very well affect other down-streamprocessing within the program, wreacking havoc with subsequent routines. Or what if, inthe case of the file containing blanks in the numeric fields, the input file was supposedto be "clean" (validated) by this point in the jobstream, having gone throughallegedly "exhaustive" edits in prior modules. By simply adding an IF test youmay solve your program’s specific ABEND, but you will not have resolved the actualproblem, which exists somewhere else in the system. In other words, provincial approachesto resolving production ABENDs are not recommended, as they usually change the problem,instead of solving it.

It should be noted that, a clear understanding of the business functionality automatedby this process is usually required to completely resolve WHY something has gone wrong.Calling on business experts or "application/business" experts who understand"the big picture" and the context in which the job executes is the rule ratherthan the exception to this process.

Developing a clear and accurate determination of WHY a problem that lead to an ABENDcondition exists may take a considerable amount of time, depending on the: Size,complexity and structure of the code; Your familiarity with the program’s businesspurpose, coupled with your ability to grasp the point of each statement (assuming youdidn’t write the code); Type of ABEND and reason for the problem (some are morediabolical than others); and Size of the input/output files, and capabilities of your fileeditor.

Note that, in addition to an understanding of the reason for the ABEND, the results ofyour investigation should produce an understanding of how the business rules map onto thesource (why a given routine was coded a certain way). This is a prerequisite tounderstanding what is wrong in the current situation, and leads to the solution to theproblem, the fix itself.

Fix the Problem and Test Your Solution

Take the appropriate action to resolve any business or system-wide issues. Depending onhow extensive the damage caused by the problem, or for how long any problems havepersisted undetected: Files may have to be restored from backups from a previouspoint-in-time; Jobs may have to be re-run from a previous point-in-time (synchronized withfile generations); and Files may have to be modified with "one-shot" programs,written to resolve issues that require "surgery" on the data.

Take the appropriate action to fix the technical (coding) problem: Edit program source– modifying the existing production logic … and/or … Modify the JCL (if theerror included JCL issues)

Compile and Link the new version of the application. Create an "image copy"of the production file system, in order to test your fix; Re-Run the batch job and analyzeresults; Run "Regression Tests" against the new code, analyze for unexpectedresults.

Resolution – Cutover to production: Promote your changesinto production; and schedule and re-run the cycle.

Summary

There are software tools that can assist in the most difficult (time-consuming andlabor-intensive) elements of the above – determining WHY things happened as they did.These include:

Static Code Analysis packages – which provide "application models" (adictionary of all application elements and the relationships among all elements). Suchtools can speed the process of understanding a problem, visualizing a sequence ofexecution, seeing the dependencies that exist in the operational mechanics of a system,etc.

Graphical source code debuggers – which allow you to establish breakpoints duringprogram execution, step through code instruction-by-instruction, simultaneously monitorthe contents of variables, reset execution at any time, etc. These types of testingfacilities raise the level of abstraction, allowing programmer/analysts to ignoreirrelevant aspects of the application, in order to concentrate on those that are.

(Next Month: Tips and Techniques)


About the Author:

Jonathan Sayles is Senior Technical Consultant, Micro Focus, PLC (Palo Alto, Calif.),and has published books and articles on topics, such as relational database, client/serverdevelopment and application development workbenches.