data directions: Index Index - Who's the Fairest of them All?

With next month's release of V4R4, IBM made the AS/400 into a full-fledged player in the data warehouse marketspace. Oh, yes I know they have tools that have been there for a while, and the scalability of the raw data is the best in its class. However, real heavy data-intensive data warehousing and query retrieval applications had trouble competing with technologies from other major database vendors.

The problem and hence the solution, is in the way the data is stored for these types of applications. In the past we have discussed that data stored for retrieval is not the same as data stored for transaction based systems. At least it should not be if performance is an issue.

Data marts and data warehouses like to store data in what is basically a cube -- a multi-dimensional cube where you can retrieve data from any which way that you can think of. The trick to using this type of approach is twofold. The first you can control somewhat, how to load the data into it. The second is more difficult - how to index it and retrieve the data efficiently.

Along came the data warehouse vendors who recognized this and came up with the approach of the "cubes." Since most native databases did not support this schema, they built their own, or built extensions to better known databases to support their needs.

Recently all the major DB vendors have announced a way to handle these "cubes" in some native way. The idea of course is better efficiency and more control since the functions are built-in. Microsoft announced this in SQL Server V7 and IBM did it with V4R4. IBM's approach was to use a xxxxxxxx index schema that allows the system to store indexes to these multiple arrays more efficiently, and that of course means that the retrieval of the data would be quicker.

The AS/400 uses two schemes for indexing, the binary radix index and the vector index. We have been using the binary index since the days of the System/38 (of course we all know what a binary index is, right?). The vector index is new and gives us some very powerful features as long as you use it for what it was designed. And of course as long as you don't use DDS. What sacrilege! Your standards require that all files -- physical and logical -- have DDS.

What the heck does all this mean? I know from letters (and e-mail) I receive, that many of you are not "bits and bytes" technical, so here is a quick explanation (here come the techies!). Basically a radix index is one where the system keeps track of the physical locations of the data as well as optimizing multiple records with the same keys. They don't really have another record, just counters that tell the system how many locations there are. The big advantage of this over the vector index is that the system can guarantee the physical sequence of the records to your application. Records with the same key can be delivered FIFO or LIFO if you so desire.

A vector index has advantage being able to keep track of multiple paths to the same data. This is where the multidimensional aspect comes into play. It is good for delivering lots of summary data based on various criteria. It does not, however, guarantee the sequence of the transactions within a key sequence. Usually, for queries and other summary applications like data warehousing, this is not a problem since you are summarizing the data anyway.

Well if you are going to use this feature, you will have to do it in SQL. Only with SQL, using the Create Table command, can the user specify that the index should use the vector index. Alright, before you get all upset about this, let's think about what we have here: A new way to index records for mass retrieval. If that means nothing to you, then forget it. There are lots of vendors working as we speak to come out with data warehouse tools that will take advantage of this and you won't even know they are doing it. And realistically, who cares.

Yes, there are those of you who use system level APIs, but the real truth is that most of those are done by vendors who have products where they need it. It was IBM's' way to open up the system, without opening up the system.

The net of this, is that IBM has made the AS/400 a viable player in the data warehouse marketplace (maybe a little expensive, but ...). What you have to do, if you are talking with a data warehouse vendor whose product runs on the AS/400's database, is to make sure that they make use of this facility. It will be the most efficient for their retrievals. If you have a shop (or are part of one) where you have lots of queries over large database files, sorting every which way, then you may want to look into using this yourself.

It is not difficult, but like everything technical, you have to try it to see if it works for you, then decide how you will implement it. Happy Indexing!

John Bussert is president of Swift Technologies (Marengo, Ill.), a company specializing in AS/400 and Windows NT software. jbussert@stecnet.com.