Nirvana Revisited: Data Access in a Heterogeneous Environment

Remember when Nirvana, to an IT organization, was when the company standardized on one database architecture, thereby eliminating data redundancy? Well, with the advent of client/server and distributed platforms, IT organizations have had to deal with the reality of numerous databases. Coupled with the fact that these environments must co-exist with legacy applications, most organizations have never achieved the Nirvana of data integration -- nor, given the power of today’s distributed computing environments, should they try to. And now there’s the Internet, which has caused data propagation to reach new heights.

Therefore, the objectives today are different. Rather than seeking to implement one, centralized repository of corporate data, IT organizations are looking for ways to coordinate and manage the distributed data they already have.

It’s no easy task: heterogeneous hardware and software platforms abound. The hardware platforms are mainframe, client/server and Web-based. The number of different databases that the enterprise supports is growing rapidly. Because much of this data is actually duplicated (for instance, the Web database houses catalog information that is also in a mainframe database), organizations are faced with complex data extraction and data replication problems every day.

Users have an increasing need to extract and create subsets of databases to support cross-platform application testing, staging data for production, archiving production data, moving updated data between different platforms or creating realistic, time-shifted data for year 2000 testing. Let’s look at some of the data movement issues in two key areas: cross-platform application testing and Web-based data migration.

Cross-Platform Application Testing

From an application testing perspective, many organizations are architecting new client/server systems using either a mainframe or a Windows-based system as the server and Windows-based workstations as the clients, enabling the company to take advantage of both platforms, combining performance with a state-of-the-art graphical user interface. In these types of situations, application unit testing may be performed on Oracle/NT Servers, while integration and system testing is performed on NT Client workstations using CICS/DB2 OS/390 servers.

In the preceding scenario, the IT organization will need to subset Oracle and DB2 databases in preparation for creating realistic, high-quality test data that can be used across heterogeneous environments. Such a data movement strategy enables the enterprise to use the same test cases across all platforms.

However, when extracting data from a complex relational database, the challenge is to create a referentially intact subset of data. The complexity of enterprise-scale relational databases is daunting. Relational data is, by its very nature, fragmented. If database subsets involved only one table, there would be no problem. However, most subsets involve dozens of tables and those tables are interconnected by hundreds of relationships. These are subtle relationships, many of which cannot be defined to the database, and hence are managed by the application code. Yet the IT organization is chartered with migrating complete subsets of data that accurately reflect all relationships.

Because many existing data access and movement products are often too simplistic, operating on a limited number of tables and relationships, IT organizations have had to devote an enormous amount of time writing, testing and debugging customized extract programs.

Nothing less than a referentially intact and complete subset – including all the right data from all the right tables, every time – is acceptable. As it should it be. Regardless of the number of tables or relationships involved, the tools that an IT organization chooses for its data access and movement strategies should create accurate extracts from an arbitrarily complex data model, spanning many tables. This is particularly important when organizations need to create subsets of data for application testing purposes.

The data extraction tools that are used to create a subset of test data must account for all relationships. Otherwise, the full range of "real world" relationships, as defined to the database and/or application cannot be accommodated in the subset of test data. The more realistic a test database is, the more useful it can be in identifying potential application problems. Nearly all problems could be uncovered and addressed prior to deployment, if a highly realistic test database is created and an adequate number of test cases are used.

Unfortunately, many organizations don’t have the time to create test databases that precisely mimic production databases. Some use very small, limited test databases, which may be created manually. These databases do not thoroughly test the application.

Simulating real production conditions is, therefore, vital to the ongoing operation of the business. To achieve realism in testing you must have realistic test data. One approach to creating a truly representative, realistic test database is to simply clone the production database. While this might seem like a reasonable solution, there is usually not enough disk space and typically no more than one database is cloned at a time, limiting the productivity of testers. IT organizations need relationally intact subsets of production databases that can be used to test the same cases on every platform, whether it be mainframe, client/server or Web-based.

To test functionality, the database should be small enough to enable rapid test runs, yet large enough to include special test cases and boundary conditions. A cloned database does not test boundary conditions, can increase security exposures and does not enable cross-platform testing. On the other hand, a relationally intact subset of data can take a fraction of the storage space of cloned databases and testers can quickly create and refresh subset databases, thereby increasing staff productivity.

Web-Based Data Migration

As companies have experienced the benefits of heterogeneous data access and movement across the mainframe and client/server environments, they have naturally extended the same functionality to include Web-based platforms. Now companies have to consider Web-based data movement and synchronization strategies.

Today’s businesses forge numerous links via the Internet with employees, customers and suppliers, adding to the plethora of computing platforms and databases. These Web-based business initiatives encompass various flavors of data movement, including data that is made available for customer or employee queries, subsets of data that are downloaded by employees, customers and suppliers, to local, disconnected databases, and data that has been distributed via the Web to remote, occasionally connected applications.

When data is distributed via the Web, it is not unusual to duplicate corporate data at remote sites during the update process. Thus, when updating data via the Web, it is essential to consider some type of data synchronization technology. This technology should help to move and reconcile the distributed corporate data. To successfully Web-enable corporate data, companies must:

Examine the impetus for the project. Understand what is driving your Web-publishing initiatives. Will the application be one-way or bi-directional?

Identify the operational issues. How much data is there, where is it going, what is required to get it there and how often will updates occur?

Determine how the data will be segmented. The problem of segmenting complex relational data for distribution cannot be understated. Information that is scattered across dozens of interrelated tables in the corporate database must be segmented uniquely for each user. When updates occur, they must be distributed accurately to the appropriate users.

Complex corporate databases containing hundreds of interrelated tables cannot be de-normalized by "flattening" the data model. This compromises referential integrity, creating an inconsistency between the Web and corporate database data models.

Because most IT organizations know that the Nirvana of a single integrated corporate database does not exist, it would be prudent to anticipate that Web-enabling corporate data is going to be a formidable task. The environments, both hardware and software, are heterogeneous, and bi-directional updates will become a reality. If corporations do not adequately plan for the correct data movement and synchronization strategy, the result will be applications that do not meet corporate business objectives.

Data Nirvana

As companies establish new applications and migrate legacy applications to client/server and Web-based platforms, data movement and synchronization are two important IT functions that need to be thoroughly planned. To achieve some sense of uniformity across platforms, organizations have embarked on complicated data movement strategies. Unfortunately, data movement technology has not always kept up with the pace at which data is distributed.

Effective IT organizations stopped chasing the dream of a single, centralized repository of corporate data a long time ago. These organizations have employed best-of-breed tools to enable data movement interoperability between heterogeneous DBMSs. They have developed strategic corporate applications, which can positively affect the bottom line, because the applications are supported by a data movement and synchronization strategy that can precisely subset complex relational databases and expeditiously update remote databases. In short, they are achieving the goal of "data Nirvana," a powerfully effective corporate database, by leveraging today’s distributed technologies along with leading-edge strategies for data movement.

 

About the Author:

Steve Gerrard is Vice President of Strategic Planning at Princeton Softech, a wholly owned subsidiary of Computer Horizons Corp.