In-Depth

Parallel Processing with Pl/i

As the processing speed of a chip reaches the speed of light, researchers must find other ways to increase processing power. One idea which holds promise in that regard is parallel processing. Parallel processing uses multiple processors to solve an application task in less time. An application task must be broken into independent pieces in order to be executed in parallel. The main task calls subtasks, each performing a portion of the overall application task. The main task effectively "forks" to various subtasks and then waits for the subtasks to complete. The wait by the main routine "joins" the subtasks together and processing resumes in the main after the join.

Parallel processing is here today on IBM mainframes. Current mainframes have multiple processors per CPU. The Pl/i programming language has been outfitted with certain parallel tasking constructs in order to take advantage of the fact.

Discussion

The following discussion talks about the utilization of parallel Pl/i for CPU bound and I/O bound applications. The distinction between the two types of applications in the context of this article is fairly simple. If most of an application’s time is spent processing data stored in program memory, it is a CPU bound application. If most of an application’s time is spent performing reads and writes to files, it is classified as an I/O bound application.

Parallel processing on IBM mainframes with Pl/i is best illustrated by example. Having been a software developer for over twenty years, I learn the most by examining code which demonstrates some technique. Therefore, a number of parallel Pl/i examples will be presented.

The first step in putting together a parallel app is to identify sequences of code, which can be executed at the same time. Consider a problem in which mainframe application programmers routinely solve, table lookup.

Table lookup is commonly used to validate data. If a data item exists in a table, then the item is valid and processing continues. Otherwise, the data item is rejected and the user is typically notified. Table lookup can be a time consuming process. Suppose you had to validate a million rows of input against a million row lookup table? The binary search routine has historically solved this problem in COBOL. But the data in the table must be sorted. How about a parallel algorithm to solve this problem?

Suppose your table contains 450,000 rows or so as shown in Figure 1. The data can be split or distributed into equal pieces for lookup purposes. The objective is to balance the load of work performed by each subtask on a separate processor. As can be seen in Figure 1, the main routine must wait for the longest subtask to complete. Therefore, each task should be given about the same amount of work.

Figure 1

TLOOK: PROC OPTIONS(MAIN);

DCL ECB1 EVENT;

DCL ECB2 EVENT;

DCL ECB3 EVENT;

DCL ECB4 EVENT;

DCL I FIXED BIN(31);

DCL X(450411) CHAR(7);

DCL NPTS FIXED BIN(31) INIT(450411);

DCL LEFT_OVER FIXED BIN(31);

DCL NPROC FIXED BIN(31) INIT(4);

DCL AZONE PIC '999999';

DCL FOUND BIT(1) INIT('0');

DCL HBOUND BUILTIN;

DCL TIME BUILTIN;

DO I = 1 TO HBOUND(X, 1);

AZONE = I;

X(I) = 'A' || AZONE;

END;

PUT SKIP LIST('START TIME: ', TIME());

/* DISTRIBUTE BLOCK, EVENLY -> BALANCE LOAD */

LEFT_OVER = NPTS;

DO I = 1 TO NPROC - 1;

LEFT_OVER = LEFT_OVER - (NPTS / NPROC);

END;

PUT SKIP LIST('FIRST 3 SIZE: ' NPTS / NPROC);

PUT SKIP LIST('LAST DIST SIZE: ', LEFT_OVER);

CALL LOOK('A187500', 1, NPTS / NPROC, FOUND) EVENT(ECB1);

CALL LOOK('A187500', NPTS / NPROC + 1, 2 * NPTS / NPROC,

FOUND) EVENT(ECB2);

CALL LOOK('A187500', 2 * NPTS / NPROC + 1, 3 * NPTS / NPROC,

FOUND) EVENT(ECB3);

CALL LOOK('A187500', 3 * NPTS / NPROC + 1, LEFT_OVER,

FOUND) EVENT(ECB4);

WAIT(ECB1, ECB2, ECB3, ECB4);

PUT SKIP LIST('FOUND= ', FOUND);

PUT SKIP LIST('STOP TIME: ', TIME());

LOOK: PROC(KEY, LO, HI, FOUND);

DCL KEY CHAR(7);

DCL LO FIXED BIN(31);

DCL HI FIXED BIN(31);

DCL FOUND BIT(1);

DCL I FIXED BIN(31);

DO I = LO TO HI WHILE(FOUND = '0'B);

IF X(I) = KEY THEN FOUND = '1'B;

END;

END LOOK;

END TLOOK:

The name of the subtask in Figure 1 is LOOK. LOOK is passed a range of items to search for the item being validated. In effect, each of the four subtasks searches a separate portion of the X array. A common variable to each subtask is bit switch FOUND. When any subtask finds a match, the FOUND switch is set to TRUE. The condition on the loop in the subtask examines FOUND after each iteration of the loop. Therefore, when any of the subtasks finds a matching item, all subtasks terminate and control returns to main.

The expected number of compares for a sequential table lookup is N divided by 2 where N is the number of items in the table. Under the parallel table lookup scheme, the expected number of compares is N divided by the Number of Subtasks divided by 2. Hence, it appears the parallel algorithm takes less time than a sequential table look up to find a given item. Perhaps the parallel algorithm is faster than an equivalent binary search? The expected number of compares under binary search is Log Base 2 of N. For N equal around 100 or less and 8 processors working simultaneously, the parallel table lookup is faster than the binary search with regard to expected number of compares - and the items need not be sorted. Any similar function, whose work is a function of the number of items involved, which can be broken into independent pieces, can be performed more quickly under a parallel scheme.

As it turns out, the parallel table lookup is not faster than a sequential search of the entire table. At least not today. The time to reference data across processors offsets the time savings due to the parallel algorithm. But such an algorithm will be quicker as parallel programming constructs on the mainframe become more efficient with regard to sharing data between processors.

Consider another example. The sample Pl/i code in Figure 2 performs an in-memory sum in parallel fashion.

Figure 2

PP: PROC OPTIONS(MAIN);

DCL ECB1 EVENT;

DCL ECB2 EVENT;

DCL I FIXED BIN(15);

DCL NUMBER_OF_TASKS FIXED BIN(15) INIT(2);

DCL X(30000) FIXED DEC(15, 2);

DCL PART_SUM(2) FIXED DEC(15, 2) INIT(0, 0);

DCL AVERAGE FIXED DEC(15, 2);

DCL HBOUND BUILTIN;

DCL DECIMAL BUILTIN;

DCL TIME BUILTIN;

DO I = 1 TO HBOUND(X, 1);

X(I) = I;

END;

PUT SKIP LIST('START TIME: ', TIME());

/* DISTRIBUTE CYCLIC */

CALL DO_PART_SUM(1, NUMBER_OF_TASKS, PART_SUM(1)) EVENT(ECB1);

CALL DO_PART_SUM(2, NUMBER_OF_TASKS, PART_SUM(2)) EVENT(ECB2);

WAIT(ECB1, ECB2);

PUT SKIP LIST('SUM1= ', PART_SUM(1));

PUT SKIP LIST('SUM2= ', PART_SUM(2));

PUT SKIP LIST('N= ', HBOUND(X,1));

AVERAGE = (PART_SUM(1) + PART_SUM(2)) / (DECIMAL(HBOUND(X, 1)));

PUT SKIP LIST('AVERAGE= ', AVERAGE);

PUT SKIP LIST('STOP TIME: ', TIME());

DO_PART_SUM: PROC(TASK_NUM, NUM_TASKS, SUM) OPTIONS(REENTRANT);

DCL TASK_NUM FIXED BIN(15);

DCL NUM_TASKS FIXED BIN(15);

DCL SUM FIXED DEC(15, 2);

DCL I FIXED BIN(15) CONTROLLED;

ALLOCATE I;

SUM = 0.0;

DO I = TASK_NUM TO HBOUND(X, 1) BY NUM_TASKS;

SUM = SUM + X(I);

END;

FREE I;

END DO_PART_SUM;

END

PP;

The distribution of data for the X array in Figure 2 is cyclic. That is, the items from X, which are incorporated into a partial sum, are chosen according to a cycle. The first subtask forms a partial sum from X items 1, 3, 5, etc. The second subtask forms a partial sum of X from items 2, 4, 6, etc. The two partial sums are added together after the WAIT in main to yield the total sum of the Fixed Decimal numbers in the X array.

The first argument to DO_PART_SUM is the starting index position. If the call to DO_PART_SUM was in a loop and the loop index was passed for argument 1, the first subtask could receive the second subtask’s starting index position. It is important to be careful with shared data between subtasks. Such access must be synchronized or eliminated. The Pl/i code in Figure 2 eliminates the problem by explicitly passing the starting index values of 1 and 2 rather than the value of index variable within a loop. The sum is calculated correctly, even when the subtasks execute at the same time during parallel execution.

Note also in Figure 2 that the subtask code is REENTRANT. This means that each subtask will share the same set of instructions. This is more efficient than loading separate copies of the code onto each processor. But one must allocate local variables to a given subtask. Hence, I is CONTROLLED (dynamically allocated). This is to ensure that I in subtask 1 is not confused with I in subtask 2. The shared instructions of a reentrant program reduces the number of loads of multiple copies of a program, but is not necessary for effective parallel processing. The code in Figure 1 is not re-entrant.

How about the performance of the Pl/i code in Figure 2. As it turns out, once again, a sequential version of the sum function is quicker than it’s parallel equivalent. The reasons for this were cited previously. Bottom line, with the current state of mainframe technology, CPU bound application functions are executed more quickly in a sequential manner.

Now suppose your subtasks need to share access to data stored in the main routine. Access to the common data in main must be synchronized. For example, you don’t want subtask 1 to write over data which subtask 4 has just written when subtask 3 needs to know what subtask 4 just wrote. Therefore, the application takes turns writing to the common data area stored in main.

Synchronization can be accomplished in a number of ways. A semaphore can be employed. Basically, a semaphore is a flag variable in the main routine which is set each time a subtask intends to write a variable shared by the subtasks. The reading and writing of the semaphore must be made atomic. Atomic means the entire set of operations are performed completely or all the operations are not performed at all. A DB2 transaction involving updates to several DB2 tables is an atomic entity. All the updates are applied and the transaction is committed or the transaction is totally backed. Atomicity implies an all or nothing proposition. The posting of a semaphore can be made atomic by using the Compare and Swap assembler instruction(Note that UNIX has a time window between it’s read and write of a semaphore). To use a semaphore, each subtask has to loop while it’s waiting to get the semaphore. This action burns CPU cycles. The problem can be better solved with the 370 Assembler ENQUEUE macro. ENQUEUE puts a subtask to sleep while awaiting a semaphore. Pl/i uses ENQUEUE macro indirectly via the Pl/i WAIT instruction. This technique is illustrated in Figure 3. The WAIT on the semaphore synchronizes access to the common variable MSG. Each subtask takes turns writing to MSG. None of the common data in main(MSG) is ever lost or corrupted.

Figure 3

MULTIE: PROC OPTIONS(MAIN);

/* MULIPLE TASKS, TASK IS REENTRANT, */

/* SHARED DATA(MSG), SEMAPHORE VIA */

/* EVENT SEMA_FREE. TASKS WAIT, NQ? */

/* ->TASKS SYNCHRONIZE THEIR ACCESS */

/* TO SHARED DATA. */

DCL TIME BUILTIN;

DCL TIME_STR CHAR(9);

DCL MSG CHAR(80) INIT(' ');

DCL ECB1 EVENT;

DCL ECB2 EVENT;

DCL ECB3 EVENT;

DCL SEMA_FREE EVENT;

DCL COMPLETION BUILTIN;

COMPLETION(SEMA_FREE) = '1'B;

TIME_STR = TIME;

PUT SKIP LIST('*START TIME=', TIME_STR);

CALL A_TASK('A') EVENT(ECB1);

PUT SKIP LIST('*AFTER CALL TASK A');

CALL A_TASK('B') EVENT(ECB2);

PUT SKIP LIST('*AFTER CALL TASK B');

CALL A_TASK('C') EVENT(ECB3);

PUT SKIP LIST('*AFTER CALL TASK C');

WAIT(ECB1, ECB2, ECB3);

PUT SKIP LIST('*AFTER WAIT');

TIME_STR = TIME;

PUT SKIP LIST('*STOP TIME=', TIME_STR);

A_TASK: PROC(TASK_ID) OPTIONS(REENTRANT);

DCL TASK_ID CHAR(1);

PUT SKIP LIST('IN TASK ', TASK_ID);

WAIT(SEMA_FREE);

COMPLETION(SEMA_FREE) = '0'B; /* CRITICAL SECTION START */

MSG = 'TASK= ' || TASK_ID;

PUT SKIP LIST('MSG= ', MSG);

COMPLETION(SEMA_FREE) = '1'B; /* CRITITCAL SECTION STOP */

END A_TASK;

END MULTIE;

So what is the advantage of using Pl/i for parallel processing? How about an I/O bound application. One I/O is roughly the equivalent of 60,000 CPU instructions. A program often waits on an I/O controller. If subtask 1 does something useful while subtask 2 is waiting on I/O, the expected amount of time to run the application will be less. Any time a task blocks at some point in it’s processing, another task can be doing useful work. A programmer can take advantage of the fact by coding parallel tasks.

Consider Figure 4. The subtask in Figure 4 performs a Simple Linear Regression on the X and Y arrays. At the end of each iteration, a record is written to SYSPRINT with the result. Paralleling this function is faster than it’s sequential equivalent. Waiting for an I/O controller is a waste of time. Something else can be done during the waiting time, like an I/O by a different subtask.

Figure 4

REGP: PROC OPTIONS(MAIN);

DCL ECB1 EVENT;

DCL ECB2 EVENT;

DCL ECB3 EVENT;

DCL ECB4 EVENT;

DCL B0 FLOAT DECIMAL(16);

DCL B1 FLOAT DECIMAL(16);

DCL SUMXX FLOAT DECIMAL(16);

DCL SUMXY FLOAT DECIMAL(16);

DCL XBAR FLOAT DECIMAL(16);

DCL YBAR FLOAT DECIMAL(16);

DCL X(25) FLOAT DECIMAL(16);

DCL Y(25) FLOAT DECIMAL(16);

DCL N FIXED BIN(15);

DCL TIME BUILTIN;

N = 25;

X(1) = 1000.0;

X(2) = 3000.0;

.

.

X(25) = 4000.0;

Y(1) = 16.0;

Y(2) = 59.0;

.

.

Y(25) = 72.0;

PUT SKIP LIST('START TIME: ', TIME());

CALL REG EVENT(ECB1);

CALL REG EVENT(ECB2);

CALL REG EVENT(ECB3);

CALL REG EVENT(ECB4);

WAIT(ECB1, ECB2, ECB3, ECB4);

PUT SKIP LIST('STOP TIME: ', TIME());

REG: PROC;

DCL I FIXED BIN(31);

DO I = 1 TO 20000;

CALL AVE(X, XBAR);

CALL AVE(Y, YBAR);

CALL SUMVV(X, Y, SUMXY);

CALL SUMVV(X, X, SUMXX);

B1 = (SUMXY - (N * XBAR * YBAR)) /

(SUMXX - (N * XBAR ** 2));

B0 = YBAR - B1 * XBAR;

PUT SKIP LIST('B0= ', B0, ' B1= ', B1);

END;

END REG;

AVE: PROC(ANARRAY, MEAN);

DCL ANARRAY(25) FLOAT DECIMAL(16);

DCL MEAN FLOAT DECIMAL(16);

DCL TOTAL FLOAT DECIMAL(16) INIT(0.0);

DCL I FIXED BIN(15);

DCL FLOAT BUILTIN;

DO I = 1 TO N;

TOTAL = TOTAL + ANARRAY(I);

END;

MEAN = TOTAL / FLOAT(N, 16);

END AVE;

SUMVV: PROC(ANARRAY1, ANARRAY2, TOTAL);

DCL ANARRAY1(25) FLOAT DECIMAL(16);

DCL ANARRAY2(25) FLOAT DECIMAL(16);

DCL PRODSUM FLOAT DECIMAL(16);

DCL TOTAL FLOAT DECIMAL(16);

DCL I FIXED BIN(15);

TOTAL = 0.0;

DO I = 1 TO N;

TOTAL = TOTAL + ANARRAY1(I) * ANARRAY2(I);

END;

END SUMVV;

END REGP;

At a major telecommunications company in Colorado, a parallel database archive system was developed in Pl/i. The system dumps twenty-eight DB2 tables, seven at a time. The key to the increased efficiency of this system is related to I/O overlap. While one of the seven concurrent tasks is performing I/O against the database tables, the other seven tasks are performing other processing. The total amount of time to dump the twenty-eight DB2 tables is effectively less under the parallel scheme.

At a major computer company in Colorado, a Program Product Acquisition system is being re-engineered to take advantage of inherent parallelism. Basically, the Acquisition system processes a list of files, one after another. The files can be very large and each file is processed one after the other. Execution time of the present system can take several hours.

The proposed system will process multiple files at the same time. Once again, while one parallel subtask is performing I/O, the other subtasks are performing the various functions of the Acquisition application. Total execution time can be reduced, thereby allowing Acquisition to run within an allotted time period.

The following is illustrative of the time savings realized in the two I/O bound parallel application systems just described. The application involves the processing of two files. Each file requires one read and each file is processed by four instructions.

SEQUENTIAL

Time Action

1 File 1 - Execute Instruction 1

2 File 1 - Execute Instruction 2

3 File 1 - I/O Wait 1

4 File 1 - I/O Wait 2

5 File 1 - I/O 1

6 File 1 - Execute Instruction 3

7 File 1 - Execute Instruction 4

8 File 2 - Execute Instruction 1

9 File 2 - Execute Instruction 2

10 File 2 - I/O Wait 1

11 File 2 - I/O Wait 2

12 File 2 - I/O 1

13 File 2 - Execute Instruction 3

14 File 2 - Execute Instruction 4

MULTIPROCESSOR

Time Action

1 Task 1 and 2 Setup

2 Task 1 - Execute Instruction 1 + Task 2 - Execute Instruction 1

3 Task 1 - Execute Instruction 2 + Task 2 - Execute Instruction 2

4 Task 1 - I/O Wait 1 + Task 2 - I/O Wait 1

5 Task 1 - I/O Wait 2 + Task 2 - I/O Wait 2

6 Task 1 - I/O 1 + Task 2 - I/O Wait 3

7 Task 1 - Execute Instruction 3 + Task 2 - I/O 1

8 Task 1 - Execute Instruction 4 + Task 2 - Execute Instruction 3

9 Task 2 - Execute Instruction 4

10 Contexts Switches Between Tasks 1 and 2

The MULTIPROCESSOR technique is approximately twenty-nine percent faster than the SEQUENTIAL equivalent. For a process which take hours, the time savings due to parallelism can be significant.

Conclusion

Parallel processing is alive and well on IBM mainframes today and can be written in the Pl/i programming language. Parallel I/O bound applications are faster than their sequential equivalents. Parallel CPU bound applications take about the same amount of time as their sequential equivalents. But it’s only a matter of time until the IBM mainframe reduces the amount of time to access data across processors. Parallel constructs will show up in COBOL as the advantages of parallelism become better understood and more effective. So the next time you write a piece of code, take a look at the program from a parallel perspective. You may find something interesting which can save you a significant amount of application processing time.

ABOUT THE AUTHOR:

Dick Brodine is a Consultant at Highland Management Inc. (Longmont, Colo.) and has been a teacher, writer and software developer on mainframes and other operating platforms for 21 years. He holds a Masters of Science in Computer Science, a Masters of Science in Operations Research and a Masters in Energy Resources. Contact him at vendrpb@us.ibm.com.


To Sidebar

Must Read Articles