Better Data Management Without Jeopardizing Customer Relations
Experian North America is constantly receiving and transporting data from large data providers, and, in some cases, across platforms. These data providers send data using an electronic data set transfer product that does "on-the-fly" compression. And some data providers send this data on tape or disk media.
With the popularity of the Internet, many data providers would also like to use FTP as a means of transferring this data. However, FTP does not provide data compression. Since some of these data transfers involve gigabytes of data, the information had to be compressed. As we are all aware, the Internet is not without its issues or concerns. At the top of this issues list is bandwidth and security. Experian needed to overcome these two issues before feeling comfortable received data over the Internet.
With offices in 17 countries, Experian provides clients with the products and services in more than 50 countries worldwide to manage the entire customer relationship cycle, ensuring that the right actions are taken at each stage. They help these clients to identify potential new customers and target them with the most suitable offer, set appropriate financial terms and process applications promptly, manage customer accounts efficiently and deliver exceptional customer services, and act on each opportunity to expand relationships further.
The Challenge
The North American division of Experian needed to find a way to better manage the amount of data that they were transferring to and from clients on a daily basis. The biggest challenge was finding a simple solution that would allow each of the clients to use their preferred means of transferring data, taking into account the different platforms that their clients preferred to use. They did not want to dictate to their clients how the data should be transferred and they did not want to have a different solution for each of the transfer methods. In addition, they needed to make sure that whatever solution they implemented would still provide a secure means of sending this information.
To define the ultimate objective, Experian needed to find a cross-platform compatible solution that allowed them to reduce their bandwidth, while securing their data transfers with encryption, and verifying the integrity of the data transferred.
Representatives from Experian downloaded a 30-day trial evaluation from www.asizip.com and started testing it for their needs. Experian was only interested in a fully supported PKZIP because of the critical nature of the data being compressed. The product provided the functionality and reliability that was needed.
There were four main issues that Experian needed to satisfy in order to have a completed solution for their data set transfer. The final solution must work across platform, reduce bandwidth, secure the data sets, and verify the data integrity.
Work Across Platform
The data sets Experian handles are typically transferred as binary. FTP is rapidly becoming the preferred means to transfer this data by the providers; therefore the information is FTPed from the data provider’s system to the Connect: Mailbox software acting as an FTP server. This data is used to update the File One Database at Experian.
PKZIP MVS allows for conversion between ASCII and EBCDIC character sets. Compressed text data sets can be transparently moved between IBM Mainframe environments and systems based on the ASCII character set, such as PCs or UNIX. Since there are still other preferred means of transferring this information, an electronic data set transfer and tape or disk media, PKZIP MVS needed to be able to work with these as well.
PKZIP MVS supports the compression of and extraction of data sets to sequential data sets, PDSs and PDS members, VSAM data sets or Magnetic tapes and cartridges. In addition, PKZIP MVS supports the compression and the extraction of MVS Load Libraries. Extracted Load modules will remain executables so long as their original blocksize is preserved. It will also support the compression of and extraction to Generation Data Groups (GDGs). A GDG generation may also be used as an archive.
Because of PKZIP MVS’s compatibility, Experian was able to leave the transfer method decision up to the individual clients. When compressing these data sets, PKZIP MVS can store allocation information, such as RECFM, logical record length, blocksize, units, volumes, extents, etc. with the data set in the ZIP archive. This information assists the Experian staff in recreating the data set during the UNZIP operations, even if they are UNZIPPING the information on other platforms.
Reduce Bandwidth
Data compression is generally achieved by eliminating redundancy within a data stream. This is normally done by identifying areas of the data that contain repeated data patterns and replacing them with smaller coded sequences, also called Lossless Compression.
Lossless compression means exactly that, no information was removed or lost during compression. Since the redundant information in merely replaced with a smaller coded sequence, the compression utility is able to restore the coded sequence with the original text. When the data set is decompressed, all the original information will be present.
There is another form of compression that is primarily used form compressing graphics or sound. This other form of compression is known as Lossy. During Lossy compression, the less important information is actually permanently removed from the file. When the file is restored, the user cannot see a difference in the graphic or hear a difference in the sound bit, but the less important information has been permanently removed from the file.
PKZIP Version 2 uses two separate compression algorithms in a process known as deflation. The first process compresses the data using a form of Lempel-Ziv (LZ) compression called sliding dictionary. Sliding dictionary LZ is a derivative of LZW, it works by writing tokens that identify where repeating phrases have previously occurred in the data set. Instead of checking the entire data set for matching phrases, it uses only a part of the data set. The term sliding dictionary is used because the algorithm uses fixed-size blocks whose addresses are repeatedly incremented as the data set is read.
PKZIP uses a 32K block size, which provides a balance between compression ratio and compression speed. As the block size is increased, the length of time required to check for repeating patterns increases. This is an excellent form of data compression which performs well on data sets that contain repeating character sequences.
The most significant difference between a textbook implementation of LZ compression and the mechanism that PKZIP uses is that when the dictionary becomes full, PKZIP only partially clears it of phrases. Many programs that implement LZ and LZW compression clear the dictionary completely and start the construction of phrases from scratch.
The benefit of only partially clearing the dictionary is faster, more reasonable compression results. By only partially clearing the dictionary, PKZIP is constantly rebuilding the dictionary. It uses the CPU time to replace the redundancies instead of fully clearing the dictionary and beginning again from scratch after each block of 32K.
Similar to the PKZIP DOS utility, PKZIP MVS is a high-performance data compression product containing two main programs, PKZIP and PKUNZIP. The PKZIP program is used to compress or store data sets into a ZIP data set or "archive." This process can be comprehensively controlled using options allowing such things as compression speed adjustment, password encryption, and full control over Data Control Block attributes on the output data set (i.e., the archive).
A PKZIP ISPF panel interface is available to simplify the use of the various features with PKZIP MVS. However, PKZIP can be executed in batch from either JCL, ISPF Panels, or invoked from a user’s program (COBAL, ASSEMBLER, etc.). PKZIP currently supports the following RECFMs: U, F, FA, FB, FBA, FBM, FBS, FM, V, VA, VB, VBA, VBM, and VM. It also supports SEQ, PDS, VSAM (ESDS, RRDS, KSDS), and GDGs. Soon, PKZIP MVS will also support PDS/E’s and HFS data sets.
One of the last data compression obstacles that Experian needed to overcome, was compressing the gigabytes of information that clients wished to FTP to them. PKZIP MVS has the ability to compress data sets with sizes up to 4 Gigabytes uncompressed. If the data set exceeds 4 Gigabytes, PKZIP MVS offers the added ability to use GZIP compression. An add-on to PKZIP MVS version 2.51, GZIP technology breaks the 4 Gig limit of PKZIP MVS. This compression algorithm for GZIP is very similar to the algorithms used in PKZIP.
Although it uses slightly different terminology to perform the functions, GZIP will still retain the name of the data set, the date and time of the data set, the CRC value, the compression method used on the data set, and the uncompressed size of the data set. In addition, GZIP members can also store platform specific information.
The data sets that Experian is working with vary in size from several hundred megabytes to several gigabytes. In only a few minutes, PKZIP MVS was able to compress these data sets an average of 75 percent, significantly reducing transfer times and costs.
Secure Data Sets
PKZIP provides the ability to encrypt compressed data. A password is required to decompress the data. PKZIP uses a standard 40-bit encryption algorithm to secure the data through encryption. Please remember that PKZIP is a data compression tool, not an encryption tool.
Verify Data Integrity
When a Cyclic Redundancy Check (CRC) is performed on a data set, the data making up the data set is passed through an algorithm which computes a value based on the contents of the data set. The result is an eight hexadecimal digit number for that data set. A change in the contents of the data set will produce a different CRC value.
The CRC process provides a very good means of determining whether one data set exactly matches another – if the CRC values are the same then the contents should be the same. PKZIP calculates a CRC value for a data set before it is compressed and stores the value, with the compressed version of the data set, in the ZIP archive. When the data set is extracted by PKUNZIP, a CRC value is computed for the extracted data set and compared with the original CRC value. If the values match, then it is a mathematical certainty that your data sets will be restored the way that PKZIP found them originally. If the CRC values stored in the archive and the CRC calculated following extraction do not match, this is reported by PKUNZIP. In this circumstance, it is likely that there has been some corruption of the data in the ZIP archive. The corruption of archives is, more often than not, caused by bad data transfers.
PKZIP MVS has become a part of Experian’s daily routine. The clients are able to transfer the information using their preferred transfer method, and Experian is confident that the information will be safe during the transfer. PKZIP’s 32-bit Cyclic Redundancy Check validates this.
About the Authors:
Jerome Vitner is a Systems Programmer for Experian North America.
Tait Hamiel is a MVS Specialist for ASCENT SOLUTIONS Inc. (ASi)