Data Assets: Forecasting Large Data Warehouse Applications in a Banking Environment
Bank of America has a large data warehouse application known as the Enterprise Data Warehouse (EDW). The EDW application is a DB2 database used for data mining. It consists of long running queries that search for patterns and trends to assist the bank in becoming more profitable. Data mining queries include analyzing customer accounts at specific bank locations; searching for opportunities to increase the bank’s market share; and developing marketing strategies to offer additional products to customers. Although the bank has a huge customer base and many types of products, it also has competition. The information obtained from the DB2 queries provides the bank valuable time to react to changing customer needs, and to target business opportunities. In the current competitive and fast changing market in which the world wants instant everything, the data warehouse gives the bank the window of opportunity it needs to accelerate every aspect of the banking business. [Pritchett, P., "Culture Shift," Pritchett Publishing, 1993, p.5]
Even though a DB2 data warehouse is a valuable resource that can provide the needed "edge" in today’s market, the capacity planning group at Bank of America continually analyzes the DB2 queries and the amount of CPU consumption during prime time. Recently, the bank upgraded the CPU processor to support this important application. The capacity planning group then analyzed the EDW’s DB2 queries with a methodology called principal component analysis. This analysis shows how query blockers can greatly reduce hourly CPU consumption during prime time.
Data Warehouse Queries
The EDW application is installed on a single processor that was recently upgraded to 1606 MIPS. Analyzing and forecasting the EDW application has been challenging for the capacity planning group for the following reasons:
• User queries are unpredictable. [Agosta, L., "The Essential Guide to Data Warehousing," Prentice Hall, 2000, p.270]
• Unanticipated users impact the workload. [Anavi-Chaput, V., and others, "Data Warehousing with DB2 for OS/390," IBM Corp, 1997, p.201]
• Data warehouse queries can last from one to two hours or longer.
• Ad hoc queries require between two and 10 times more CPU per query than a predefined, tuned DB2 query. [Agosta, L., p.294]
Bank of America’s DB2 analysts have determined that during prime time, 95 percent of the DB2 queries are ad hoc. The processor upgrade effected an immediate decrease in query elapse time; however, the overall CPU utilization still ranges between 85 percent and 91 percent during prime time. According to Bill Inmon and others in 1999, "an ad hoc query is dangerous from the standpoint of performance in that it is very easy to submit a query that will consume many resources." [Inmon, W., Rudin, K., Buss, C., Sousa, R. "Data Warehouse Performance," John Wiley & Sons, 1999, p.33]
Bank of America’s EDW application runs in its own LPAR and shares processor resources with another minor LPAR. Although 66.6 percent of all queries used less than 10 CPU seconds, 1,493 nonparallel queries used over 180 seconds of CPU time.
It is important to increase parallelism in order to decrease elapse time even though the CPU per parallel query will increase. [Wiorkowski, G., "DB2 for OS/390 Development for Performance," Gabrielle & Associates, 1999, p. 614] The bank cannot fully utilize parallelism, because of problems with parallelism in DB2 for OS/390 version 5. Even after the processor upgrade, the EDW application often approaches full CPU capacity during prime time. Analysts who access the EDW application have complained that some queries submitted after 5:00 p.m. didn’t complete before the processor was IPL’d. When DB2 version 6 is installed in the near future, both parallelism and analyst satisfaction should increase sharply.
Principal Component Analysis
Capacity planning always explores different approaches in analyzing workloads and developing forecasts. Using proc factor, principal component analysis was employed to analyze and forecast EDW’s CPU usage. According to Hatcher and Stepanski in 1994, principal component analysis has the following advantages [Hatcher, L.H., Stepanski, E. J., "A Step-by-Step Approach to Using the SAS System for Univariate and Multivariate Statistics," SAS Institute Inc., Cary, NC, 1994, p.450]:
• Develops a small number of artificial variables.
• Accounts for most of the variance in the observed variables.
• May be used as a predictor.
• Serves as a variable reduction procedure.
• Serves as a data reduction procedure that makes no assumptions.
An eigenvalue represents the amount of variance that is accounted for by a given component. [Hatcher, L.H., Stepanski, E. J., p.455] The cumulative percent of variance accounted for by components 1 to 5 is 89 percent. Therefore, CPU categories less than 10 and between 10 and 30 CPU seconds were not statistically significant from a CPU seconds perspective, and were eliminated.
Then, the application of proc corr shows the average number of queries per hour and the average number of CPU seconds per category. The proc reg generates the multiple regression formula. This formula generated an average of 666 MIPS per hour for DB2 usage. The accuracy of principal component analysis is verified by comparing the similar results of average MIPS per hour (800 to 1500 M-F) calculated weekly: 642 for the week of June 5, and 656 for the week of June 12.
When DB2 version 6 is installed in the near future, DB2 analysts want to explore the use of query blockers. Use of query blockers to prevent long-running or problem queries from executing during prime time can greatly reduce CPU usage. A query blocker stops a query from going into execution, if it is suspected that the query is going to monopolize the system. [Inmon, W., Rudin, K., Buss, C., Sousa, p.142] Because a long-running query can be reassigned from prime time to another time period, massive amounts of resources can be saved. [Inmon, W., Rudin, K., Buss, C., Sousa, p.142]
By controlling the DB2 queries between 120 and 180 CPU seconds and over 180 CPU seconds, there is a significant savings in CPU consumption during prime time. Controlling these queries by using query blockers and/or moving long-running queries to the off-shift can have a major resource-saving impact.
The capacity planning group of Bank of America tracks and analyzes parallel and nonparallel queries. Through the use of a methodology called principal component analysis, the capacity planning group can isolate major variables that contribute to high hourly CPU consumption. From principal component analysis, a multiple regression formula can be generated. This formula helps the capacity planner understand the effects of long-running queries during prime time, and the subsequent need for query blockers to control CPU resource consumption.
Prioritization of prime time CPU usage in tandem with periodic hardware and software upgrades will contribute to Bank of America’s ability to profitably compete in today’s market.
About the Author: Rick Isom is a Capacity Planner with the Large Systems Capacity Planning Group at Bank of America (Richmond, Va.) He has over 30 years experience in computer operations, computer performance reporting and capacity planning. He can be reached at (804) 553-5458, or email@example.com.