In-Depth
SAP at World Bank : A Practical Guide to Removing Location and Time Constraints from Your ERP Infrastructure
Today’s infrastructure needs to be resilient; not just to failures in the computer hardware, but also to the catastrophic failures of computer sites. Fortunately, the solutions that were cost prohibitive yesterday, are feasible today, with the falling equipment and communication costs. But that is far from saying that this is an easy task to achieve.
A robust infrastructure still will not solve problems in application programs or software. There is always a possibility that the database would become corrupt, due to an application program or a software bug, and would need to roll back in time.
Any solution is a function of the specific requirements, budget funds and available technical options. We, therefore, do not insist that the solution we implemented at World Bank is applicable to all companies. Instead, we cover the trade off in different choices.
Choices and Voices
It is always useful to define a theoretical solution to the problem, even though there is no practical implementation. The theoretical model allows us to evaluate different practical alternatives, in how close they are to the theoretical model and what the trade off is.
Figure 1 depicts a utopian computing model to address the location constraint issue. Different layers of computing resources, the presentation interface, the application logic, the database and the OS environments should interact irrespective of where they are located in relation to each other. They should also have fault tolerant attributes, in the sense that if the database environment fails at one location, it can still be running from a different location, with no impact on the other resources.
Of course, such a model does not exist and it will be a long time until we have a practical implementation, although many vendors have tried to implement their own versions. These solutions require users to evaluate them, understanding the trade off. One big problem is that almost all solutions try to address the issue within the specific layer of resource they have ownership on.
In Figure 2, Notes provides the replication model of its database, and is therefore transparent to disparities in other layers. For example, you can use NT and UNIX environments on either end, running on different hardware platforms. Because the application is active at either end, the failover is almost instantaneous. The solution, however, uses the network bandwidth, processing and disk subsystem resources that compete with the application workload itself.
Similarly, you can use the OS disk mirroring, in conjunction with the high-availability feature, as shown in Figure 3. Again, this is transparent to the application layer or underlying hardware, but consumes processing power.
The solution in Figure 4 shows an approach where the storage subsystem takes on the task of data replication, while retaining the high-availability option within the processing system (as in Figure 3). This is a solution implemented in the World Bank ERP infrastructure. The benefits are that a storage subsystem can make the replication more efficient, as it has the most information on its own environment. No processing subsystem resources are utilized.
The ideal solution would be where each new layer provides its own unique features, as well as providing the information and control to the existing layers, so they can take advantage of the features already provided and just add value by providing incremental features pertaining to the layer. This would, of course, require open and standard interfaces within the layers, in addition to vendor cooperation.
World Bank recently implemented a SAP ERP system, replacing approximately 60 legacy systems. It also enabled realtime global access to its headquarter-centric system from more than 100 locations via its global network. SAP provides a three-tier architecture with a presentation interface on the desktop, application logic running on multiple application servers and a database machine primarily for database software. SAP application logic allows transparency in application server in failure, and users get routed to another server, but with a re-logon. (Like Notes, SAP can also provide application level replication, called Application Link Enabling, or ALE. We did not, however, implement this option.)
At World Bank, we have provided location independence for the application logic layer by housing the servers at two sites. This is, however, not the option for a database machine. We have used Oracle as the database engine. To provide location independence for the database, we have used the AIX high-availability option HACMP, in conjunction with EMC storage subsystem’s remote mirroring (SRDF) feature in our solution.
In case of the failure of site A, the high-availability option enables site B to assume the identity of site A. As the site B processor restarts the database, it uses the mirrored copy of the database. The solution, therefore, requires integration of two technologies. Also, we use the site B machine to run our QA and integration test environment in normal operations mode, thus using the investments. The QA environment is automatically terminated in the failover mode to free up computing resources for production.
The solution would not have been as complex, had it not been for the need to provide for time independence. Like the location issue, the ideal scenario for time independence is seen in Figure 5. You need to go back to the point in time, when the actual error occurred, and then traverse the time forward in the correct direction from that point on. In the database world, this means reapplying the logs, so that after and before images are reversed. This simple approach, unfortunately, is not currently used. Instead, the practical implementation calls for taking full periodic backup and archive logs. In case of failure, we start with restoring the back dated copy of the database and then forwarding the logs until the time of failure.
This approach requires backups to be created. In our environment, we have integrated five separate technologies: EMC storage subsystems business continuity volume (BCV) feature, Oracle online backup option, SAP’s BackInt feature, IBM’s ADSM backup management software, and IBM’s Magstar Robotics Interface and tape handling subsystem. The combination of these technologies has allowed us to take a full, automated backup with no degradation or stoppage of the application. When the backup operation is initiated, the database momentarily marks the database for this. It then reroutes any database changes in separate redo logs. Our backup script detaches the disk copy of the mirror from the live disk and informs the database engine of the completion of the backup.
The operation takes less than a few minutes. The split mirror copy is then ready to be backed up to tapes via ADSM and Magstar tape subsystem via a separate server (see Figure 6). By running the operation on an independent server, no resources of the database machine are used, thus preventing possible performance problems that could occur due to interference. This feature proved extremely useful in the initial data loads, when we could take backups of the database at different points, without interrupting the critical data load process.
Lessons Learned from the Implementation
We had limited experience with the high-availability feature of AIX, but no experience with EMC storage technology features. It was important for us, therefore, to have alternative options available, if the primary options ran into difficulties in implementation. For example, we would use just the Oracle online backup option, if the split mirror mechanism had shown problems. For the remote mirror feature, the backup option would have been OS-level mirroring.
Paper designs also need to be flexible to leave room for changes. Initially, we planned to use our failover machine for three purposes, to run as the QA machine during normal operations and take the backup of split mirror on to the tapes, and to run as a production machine in the failover environment. This overloading of functions created conflicts, as we could not have switched over automatically in a failover mode while the production backups are taken. This required last minute reconfiguration to offload the backup operation to another server.
We need to have internal technical staff, who understand the implications of the design trade off and can operate it later. As we integrate different technologies from different vendors, the risks fall on us for the integrity of overall design.
Future Issues
Our two sites currently are confined within a smaller geographic location than we would like to have, because of the technology limitations at the time of implementation. This situation will change as the vendors improve the technologies further and communication costs decrease.
In case of a roll back in time, we need to ensure that not just the ERP system is rolled back in time, but also the interfaces with other systems and other business processes. This requires greater awareness from the business and application sponsors.
As we provide a global access, it is not sufficient to have robust server infrastructure. We have adequate redundancy in the local network, but we also need adequate redundancy in our global network as well.
About the Authors: Raju Kocharekar is the Business manager for the Enterprise Computing Unit of Information Solutions Group in World Bank. He can be reached via e-mail at rkocharekar@worldbank.org.
Deba Patnaik is the team lead for the ERP infrastructure team, and was the senior member of the team during the implementation phase. He can be reached at dpatnaik@worldbank.org.