In-Depth
Batch to the Future
Remember the promise? Re-engineer the business, redesign using client/server, and presto, all those mainframe legacy systems and that boring batch are history?
Of course, it never happened. The passion for client/server for operational systems has cooled. Browser technologies have swept in very fast, and look very attractive in providing a front-end to those legacy systems. Mainframes have made a comeback, and mainframe MIPS capacity from the new, cheaper, air-cooled CMOS mainframes are going through the roof. You only need to look at the rates at which Mainframe MIPS are being installed to see that there is no doubt that users are finding more and more to do with these computing megaliths. In fact, figures from the last two years show that the shipped capacity has more than doubled: 70 percent growth in 1995-96 and a further 40 percent growth in 1996-97.
Batch has also made a comeback - as an increasing problem for IT managers, and a problem set to get worse:
- Business units are insisting on systems being up for longer hours to meet their increased global aspirations.
- The advent of the Internet means that users are increasingly accessing the system from outside the corporation - and demanding that the system be up when they want it, often 24 hours a day.
- Business units want to create more sophisticated data warehouses with complex batch extractions from operational systems.
- And, as if all these factors were not enough, the irresistible approach of the dreaded millennium is further consuming our capacity with testing and re-development work.
As a result, the "Batch Window" (the time the main operational systems are down to do batch) is shrinking at a faster rate than the MIPS dedicated to batch is increasing, leading to the most significant cause of systems being unavailable: batch window overrun.
Batch processing is not going away. In large organizations, the online systems cannot make all the changes required in real time. Batch is an efficient way to make those updates later, and re-synchronize all the systems. So companies have attempted to tackle the batch window problem in a number of ways.
The Technology Option
Batch does not respond well to blunt implements, and the types of technology used in modern computer systems are blunt indeed. When CMOS processors were first introduced, they certainly provided benefits to the end users in terms of reduced cost, size, power and space requirements, but did not match the sheer brute power of the older bipolar (ECL) technologies.
Well, technology has moved on and the latest G5 announcements from IBM have put the CMOS processors on par with their ECL counterparts. The microprocessors at the heart of the S/390 G5 Server can run at speeds of up to 465 MHz and deliver over 900 MIPS in 10 processor systems.
The other approach to processor capacity has been to combine both technologies into hybrid ECL/CMOS systems. Hitachi’s Skyline family is an example of this approach and can deliver nearly 1000 MIPS in a multiprocessor configuration with individual processors delivering over 120 MIPS creating a powerful batch machine. But unfortunately, the widely used tactic of building bigger machines by putting more and more processors in parallel doesn’t really help the batch window.
In an OLTP environment, parallel architecture is king: hundreds of users all working at once, lots of discrete transactions all running side by side. But look at the average batch stream - as the name suggests, a stream of sequential jobs each needing to be processed before its successors can execute, each patiently waiting for its precursors and resources to become ready before it runs. Serial processing and running in parallel simply do not go together.
And it gets worse - CPU capacity is not the whole story anyway. When you really dig down, you discover that the CPU component of batch is usually less than 30 percent, so doubling processor speed only improves the batch performance 15 percent.
I/O is the problem. We might have developed an amazing ability to crush more and more DASD capacity into less and less space, but the real issue is how fast can you reach the data you need, and that is all about DASD seek times, and they are not changing. Consequently, many companies have already moved to cached DASD. By faking your disk storage in memory you can achieve very significant reductions in I/O times and corresponding improvements in batch performance, while making EMC rich in the bargain.
The impact of some classes of batch, such as data extraction for data warehousing can be improved by using technology such as SnapShot copy from StorageTek. This allows a copy of the information to be made instantaneously without taking the production systems off-line. This technique, however, is not as useful for systems that have to wait until the batch is complete before they can restart.
The bottom line is that the sledgehammer of hardware technology is not the only tool to crack this particular batch nut. We are going to need to look elsewhere to achieve the goal of compressing an increasing batch workload into a reducing batch window.
Alternate Approaches
As usual in these situations, a little lateral thinking goes a long way. It doesn’t take long to realize that if the machines we are running our batch on are not going to grow as fast as the workload, then we need to do something about the workload. Once you start thinking down that road, life become much more accommodating. It soon becomes obvious that there are about three principal ways of improving the situation by looking at the workload and how it is processed.
The first of these is scheduling - making sure you are making the best use of the space you have available to process in. It’s no different than the effect every mother has on her 6-year old’s toy box. Just take all the toys and pack them neatly into the box (instead of letting them sit where they fall). You can fit more toys into the same space. The same is true with scheduling. If you can pack many jobs into the batch window, your efficiency goes way up.
The second is automation - we have demonstrated in every area of human endeavor that a little automation can make a dramatic improvement in performance whether it’s knitting pull-overs or building automobiles. By off-loading from the operations staff the mundane and detailed aspects of the processing, improvement can be achieved and errors reduced.
The third technique is optimization - here we look at how the tasks interact and how techniques like "pipelining" can introduce some parallelism into the process, further improving systems performance and reducing processing time.
Let us see how each of these options can be exploited and how the increased confidence of S/390 software vendors is evidenced by start up companies, such as Beyond Software, and expansion into the U.S. of companies, such as Germany’s BETA Systems Software. Such companies are forming a new industry of tool developers who are taking over for the hardware technologists the task of delivering the improvements in batch performance that are needed.
The Scheduling Option. Scheduling of batch activity is akin to the standard Gantt chart project management schemes. Scheduling packages have been available for some time from a variety of major software system vendors. They typically provide a set of tools to allow operations teams to visualize the batch window and optimize the ordering and timing of each job to ensure that no resource or timing conflicts occur. They also can perform "what-if" analyses to look for improvement opportunities. Today, scheduling is recognized as a vital first step in the tuning of any large scale processing operation.
The Automation Option. Many mainframe sites have automated batch processing and recovery procedures. These systems reduce waiting time and the incidence of human error. Moreover, complex recovery procedures can be implemented without expensive operator training and further opportunity for mishap. These systems have generated significant improvements in productivity and batch reliability. There are a number of packages on the market from companies like BETA Systems, Computer Associates, Cybermation and IBM, which deliver a wide range of automation solutions. The business impact is that batch is more reliable and cost effective. However, automation is not going to reduce the batch workload; it only ensures that the scheduled activity occurs with minimum error and delay.
The Optimization Option. This is proving to be the most promising set of techniques for tackling the batch window problem. IBM’s SmartBatch eliminates having to wait until a job step is complete before starting the next. A "pipe" can be created between the output of one step and the input of the next. The two steps can run in parallel. I/O is avoided and elapsed time is reduced. IBM also has Sysplex enhancements for batch, including the ability to dynamically schedule initiators across the Sysplex.
Closing the Circle
Most significantly none of these approaches deal with the only constant in an IT department’s life - change.
If we are going to operate at peak efficiency and continuously deliver the level of performance demanded by the industry we have to address change and its effects. What we now recognize is that we must close the batch performance circle. This is taking a page from the TQM community handbook and seeking continuous process improvement by applying their mantra "monitor, measure, improve."
This new generation of system brings an integrated set of tools that undertake this task of continuous process improvement. The tool sets address each phase of the process: Pre-Production (planning and scheduling), Production (monitoring and automation) and Post-Production (analysis and reporting).
Each phase is designed to address a set of specific questions, which must be answered if our goals are to be achieved.
Today’s products also allow data centers to allocate work among staff of varying proficiency levels, and often this is crucial because personnel can be stretched to the limit handling the migration of UNIX into the data center and onto the mainframe.
Data center operators today are facing increased demands to manage distributed systems, as well as the mainframe systems. Therefore, the batch window simply needs to become more efficient. Intelligent pre-production and post-production software are integral parts of the methodology which must be put into place today.
The Redesign Option
The most significant change is to move to more radical techniques, such as concurrent batch/online. These approaches can have significant overhead and some shops are wary of the impact of this type of batch on response times. However, this is an essential technique for organizations that demand or will demand 24x7 processing.
Operationally, it is important to be able to dynamically monitor and optimize the impact of concurrent batch, and to provide restart and recovery capabilities. Probably the most important thing for organizations to do is to put batch back as an important area for design and planning. Designing new systems that will take advantage of new batch technologies and that minimize the impact on other systems will be essential.
Batch is a significant problem that has been resistant to change, and resistant to technology. The impact of the Internet, and the drive to 24x7 systems is putting significant pressure on IT departments to address this issue. IT shops will have to set development standards that will allow continuous processing, and they will need a set of batch support tools that will allow them to avoid some problems, automate others, and optimize the results. Fortunately, the buoyancy of the mainframe software tools segment is delivering new tools and methodologies which support this trip Batch to the Future!
Questions Which Need Answers in Every Data Center
Pre-Production:
*Does the staff understand the schedule?
*What are the relationships among the events in the scheduling system?
*Does the schedule include all business requirements?
*Where should new jobs be placed in the schedule?
*What is the critical path in the schedule?
*What effect will new requirements have on the system?
*Are there any unknowns in the scheduled workload?
*Are there over and/or under utilized periods in the batch window?
Production:
*What is my disaster recovery plan?
*When a job fails, what jobs must wait on this job before they can execute?
*What jobs must run before a desired job can be run?
*How long does it usually take to process from JOB A to JOB B?
*Is the batch processing proceeding as expected?
*Did this job fail before?
Post-Production:
*What were the problem processing periods?
*What jobs had errors or caused delays?
*Did any unusual events occur?
*How were these errors addressed?
*What was the actual workload?
ABOUT THE AUTHOR:
David Floyer is Research Director Commercial Systems and Servers of International Data Corporation (Framingham, Mass.) and clients on the competitive and technological issues surrounding Enterprise Systems, as well as a consultant to the vendor and user communities.