The recursive decoupling method for solving tridiagonal linear systems

Size: px

Start display at page:

Download "The recursive decoupling method for solving tridiagonal linear systems"

Harold Davidson
5 years ago
Views:

1 Loughborough University Institutional Repository The reursive deoupling method for solving tridiagonal linear systems This item was submitted to Loughborough University's Institutional Repository by the/an author. Additional Information: A Master's Thesis submitted in partial fulfilment of the requirements for the award of Master of Philosophy of the Loughborough University of Tehnology. Metadata Reord: Publisher: Giulia Spaletta Please ite the published version.

This item was submitted to Loughborough University as an MPhil thesis by the author and is made available in the Institutional Repository (https://dspae.

2 This item was submitted to Loughborough University as an MPhil thesis by the author and is made available in the Institutional Repository ( under the following Creative Commons Liene onditions. For the full text of this liene, please go to:

3 LOUGHBOROUGH UNIVERSITY OF TECHNOLOGY LIBRARY AUTHOR/FILING TITLE S (.l,!\l" rr A G ~ , ~ ACCESSION/COPY NO. VOL. NO. J~~.s.G_~_Q_Q$.':t: =L~ CLASS MARK.,

5 The Reursive Deoupling Method for Solving Tridiagonal Linear Systems by GIULIA SPALETTA, Dott. A Master's Thesis Submitted in partial fulfilment of the requirements for the award of Master of Philosophy of the Loughborough University of Tehnology September, 1991 Supervisor: Professor D. J. EVANS, D.S. by Giulia Spaletta, 1991

6 LOtlghborough Unrvel1llty of Tehnol."y Library - ".~~"L (.1:1. :. A"" No 03booo3't7

7 Delaration I delare that this thesis is a reord of researh work arried out by me, and that is my own omposition. I also ertify that neither this thesis nor the original work ontained therein has been submitted to this or any other institution for a higher degree. G.SPALETTA

8 Aknowledgements I wish to express my deepest and most sinere gratitude to Professor D. J. Evans for giving me the opportunity to arry out this work, in the first instane; and subsequently for his friendly and unfailing guidane, ontinuous help and inspiring enthusiasm throughout this researh. Finally, for his invaluable advie and infinite patiene during the writing of this thesis. Thanks to Professor Evans, I an onsider the period I spent studying under his supervision as one of the most fruitful experienes both in my aademi areer and life. I also wish to thank: - Mrs. J. Poulton for her professional help and onstant, friendly presene; - Mr. M. Sofroniou for his interest and fruitful ollaboration, for many stimulating disussions, not to mention his typing of the whole of hapter 5 (ore of this thesis) and the improvements he has brought to the original manusript. Most of all, I wish to express to him my speial indebtedness for being suh a patient, true friend; - Miss. H. Y. Sanossian, Mr. N. M. Bahoshy and Dr. A. Osbaldestin for their ative o-operation and onstant support as olleagues, but most of all as very good friends; - Dr. W. S. Yousif and Mr. G. S. Samra for all their tehnial advie (and their infinite patiene); - Miss. L. Howard for her help in typing part of this thesis; - all my olleagues and the staff of the Department of Computer Studies. Finally, I thank my family for their love and understanding.

9 Abstrat

10 Abstrat The work presented in this thesis mainly onerns the analysis of parallel algorithms for the solution of tridiagonal linear systems and the design of a new tridiagonal equation solver, whih an be run on a MIMD (Multiple Instrution Multiple Data stream) type parallel omputer, in partiular the Balane 8000 Sequent system at Loughborough University of Tehnology. In the first hapter, an introdution to the existing omputer models IS given, together with a brief desription of the proess that has led from the uniproessor mahine to the development of different parallel arhitetures. Enhanement is given to MIMD shared memory systems. In this respet, the main harateristis of the Sequent system are presented, as well as the main programming features supported by the Balane Operating System, the Dynix. The seond hapter presents the fundamentals of parallel programmmg on the Balane 8000 omputer. Terms and onepts that are speifi to multitasking programs are introdued. Also, the two multitasking methods, data partitioning and funtion partitioning, are outlined. In the same hapter, we investigate problems (suh as program dependenies, sharing of data, synhronization of onurrent proess) arising from the adaptation of an appliation to parallel versions, and the related programming tehniques. Some of the parallel programming tools are desribed, with partiular attention to the so-alled "data partitioning with Sequent Fortran" and "data partitioning with Dynix". Chapter 3 starts with an outline of the most well known algorithms for the solution of tridiagonal systems, one of whih is analysed in more detail in hapter 4. Parameters used to evaluate performane are defined, suh as

11 speed-up, effiieny and omputational omplexity, together with the basi priniples of Parallel Numerial Analysis. In the fourth hapter, the Wang tridiagonal system solver is presented. We have onsidered a variant of this partitioning method suitable for MIMD arhitetures, and we have modified it to run on the Balane Test matries have then been used, in order to evaluate the performane of the Wang routine on the Balane omputer and to form a omparison with the new Reursive Deoupling routine of hapter 5. The fifth hapter onstitutes the ore of the whole thesis. The new algorithm also belongs to the lass of partitioning methods, sine it is based on repeated partitioning of the oeffiient matrix into 2x2 submatriesj this strategy, together with a rank-one updating proedure, allows us to alulate the solution expliitly, by solving independent sets of subsystems. Furthermore, the methods turns out to be intrinsially parallel and suitable for solution on multiproessor arhitetures. The performane of the Reursive Deoupling routine on the Balane 8000 omputer has been tested by using the same example matries as those used to test the Wang method. The thesis onludes with a hapter summanzmg the mam results and suggestions for further researh. Keywords Tridiagonal Linear Equations; Shared Memory Parallel Computers; Sequent Balane 8000 Multiproessor; Partition Method; Reursive Deoupling Method; Parallel Numerial Analysis.

12 Contents

13 Contents Aknowledgements Abstrat 1. Introdution to Parallel Computers 1.1. Introdution 1.2. A Classifiation of Computer Models 1.3. Shared Memory Systems Page Parallel Numerial Analysis and the Flynn Classifiation of Computer Models 1.5. The Balane 8000 Parallel Proessing System Priniples of Parallel Programming on the Balane Introdution to Parallel Programming on the Balane Parallelizable appliations: Homogeneous and Heterogeneous Multitasking Program Dependenies Elements of Parallel Programming Parallel Programming Tools Parallel Numerial Analysis: the Tridiagonal Linear Systems Problem 3.1. Introdution Performane Evaluation Parameters: Speed-up and Computational Complexity 3.3. Fundamentals of Parallel Numerial Analysis 41 50

14 4. The Wang Partitioning Method 4.1. Introdution The Wang Algorithm The Wang Fortran Routine Numerial Experiments and Remarks The Reursive Deoupling Method 5.1. Introdution to the Reursive Deoupling Method The Partitioning Proess The Reursive Deoupling Proess The Reursive Deoupling Algorithm An Analytial Example A Numerial Example The Reursive Deoupling Routine Numerial Experiments and Remarks Conlusions and Further Work 6.1. Conlusions and Suggestions for Further Work 129 Referenes Appendix. Programs Listings

15 1. Introdution to Parallel Computers

16 1.1. Introdution In the last few years we have seen an explosion in the interest In parallel proessors and parallel programming. The sope of parallel proessing is to redue the elapsed time to omplete a job. This time will basially depend on the oding style, the arhiteture of the mahine and the hardware implementation. The job of everybody in harge of software development (system designers, ompiler and library writers, programmers) is to get the atual time required by the alulations as lose as possible to the ideal. Tools have been developed to express the parallelism expliitly, either in the form of subroutine libraries or language extensions; furthermore, studies are still in progress, onerning the automati parallelization of sequential ode. To date, the only automati system available is limited to individual loops. Parallelism at a higher level must still be speified by the programmer A Classifiation of Computer Models A knowledge of the omputer arhiteture and the hardware implementation is not essential to the programmer. However, when performane beomes ritial, a good understanding of the hardware parallelism an be fundamental to the program's tuning. In spite of all the efforts made to write portable programs, some algorithms will run effiiently on ertain arhitetures, poorly on others. The situation is worse for parallel proessors than for uniproessors, due to the wider variety of arhitetures. 1

17 We an state a lassifiation of different omputer models, based on those aspets in the hardware implementation of parallelism that most affet the oding style [16]: 1) shared memory systems (figure1.1); 2) distributed memory systems, also alled message passmg systems (figure 1. 2 & figure 1.4); 3) hybrid systems (figure1.3). We are mostly interested in the first type of omputer arhiteture, therefore we shall present a brief study of this kind of parallel mahine. I CPU I I CPU I I CPU I I MEMORY I FIGURE 1.1. Shemati of a shared memory system. I CPU MEMORY CPU C P 1! MEMORY MEMORY I CPU JI MEMORY FIGURE 1.2. Shemati of a distributed memory system: fully interonneted message passing mahine. CPU CPU CPU MEMORY MEMORY MEMORY CONNECTION NETWORK FIGURE 1.3. Shemati of a hybrid mahine. 2

18 (al (bl (l (dl FIGURE 1.4. Distributed memory systems. (a) Ring onnetion mahine. (b) Star onnetion mahine. () Mesh mahine. (d) Hyperube of order 3. (M: memory). 3

19 1.3. Shared Memory Systems A shared memory mahine has a single global memory aessible to all proessors. Eah proessor may have some loal memory (suh as "registers" on the Cray X-MP or the "ahe" on the IBM 3090). The data organization inside the memory (global and loal memory) is totally transparent to the user. The data aess time is independent of the proessor making the request. This is not to say that there is no memory ontention. Problems like page faults, memory bank onflits, et., still affet the performane. Algorithms are easy to design for shared memory systems. The data input on these mahines is done as if running on a uniproessor. On the other hand, programs are hard to debug. The most ommon type of error involves piking up wrong data from a global variable. There is no indiation of when the error ourred, so that the omputing proess ontinues, produing an erroneous final result. Data organization, therefore, is a key to parallel algorithms, even on a shared memory omputer. Unfortunately, the most ommonly used language for sientifi purposes (Fortran) only allows quite simple data strutures (just salars and arrays), induing the programmer to onentrate on program flow rather than on data management. The latest version of Fortran language permits the use of a wider variety of strutures and mehanisms. The data sharing speifiations, though, still onstitutes a fundamental problem on shared memory systems, a problem that beomes even more ritial when the parallelism is nested. 4

20 To simplify the programmer's job, in this last ase, most parallel proessors provide only a single level of parallelism; that is to say that a master proess is allowed to spawn subproesses, while the subproesses may not themselves spawn proesses. Data is either known to all the reated proesses or is private. As a onsequene of everything that has been said so far, the shared memory systems need a few language extensions. Firstly, the need to delare whih data is private to eah proessor (loal da:ta) and whih is known to all proessors (global data) arises. Seondly, synhronization is needed to prevent out-of-sequene aess of different proessors to the shared memory. The following onsiderations answer the above mentioned problems. The work in a shared memory mahine is usually divided up in a so-alled "fork-join" style: one proess spawns the subproesses (fork) and waits for them to finish (join). A means to restrit aess to the ode is needed and obtained, introduing the onept of a "ritial setion"; this is a setion of ode exeuted by all proessors, one at a time (suh as in the ase of a redution variable). The onept of a "sequential setion" is also introdued, whih is a part of the ode that has to be exeuted by only one proessor and skipped by all the others. A sequential setion is typially used to initialise global data. The easiest way of obtaining synhronization is the JOIN onstrut. When this is not possible, other onstruts have to be used, suh as "barriers" or 5

21 "semaphores". All these onepts will be more preisely illustrated in the following paragraphs. Finally, sme the ost of sharing data is very small in shared memory mahines, programmers often tend to parallelize the ode at the Do-loop level. In the ase of independent loop iterations, eah proessor an run a different subset of the loop index range, providing that eah index value is used exatly one. There are basially two ways of parallelizing a Do-loop. One way is to assign the first loop index value to the first arrived proessor, the seond index value to the seond proessor, and so on. Whenever a proessor has ompleted its task (its loop iteration), it returns to the top to get more work. In this way, an automati load balaning is realized. On the other hand, this way of obtaining a parallel Do-loop requires some form of synhronization, to assure that eah proessor gets a unique value of the loop index. A seond way to parallelize a Do-loop is to partition it so that eah proessor will do a ertain set of loop iterations. This way of proeeding is to be preferred if the work is naturally load balaned, and expeially if the synhronization ost is high Parallel Numerial Analysis and the Flynn Classifiation of Computer Models In lassial numerial analysis, a universal omputer model is represented by the Von Neumann mahine; this an be shematized as follows (figure 1.5): 6

22 PROCESSOR INPUT... " L. A. U. la.. I" " OUTPUT C. u I' " MEMORY I PROGRAM I FIGURE 1.5. Sheme of the Von Neumann mahine. L. A. U. LogiC & Arithmeti Unit. C. U. Control Unit. The main features of this universal omputer are: a) digital representation of variables; b) serial proessing, arried out aording to the basi operations of arithmeti and logi; ) the program is a oded version of the algorithm to be implemented; d) data are held in the main memory. The algorithms of lassial numerial analysis are then based on the Von Neumann model and entail a large number of elementary operations. This basi serial model has been taken as the starting point for all further developments, until the onept of "parallelism" began to be disussed. Parallelism was to be interpreted in the widest sense, that is not just to build a parallel digital omputer, but also to reate a body of numerial mathematis 7

23 whih exploits the possibilities offered by parallel omputers. Furthermore, the question arose as to whether there exists a maximal parallelism for a given range of problems. All these fats led to the need for a "parallel numerial analysis". Conneted to this need was the problem of formulating a standard mahine model for parallel numerial methods. During the last thirty years, the performane of serial mahines has been improved greatly, due to the use of a new tehnology and new design. Parallel features have been introdued: in the organization of input/output hannels; by overlapping the exeution of instrutions; by using interleaved storage tehniques. Starting from these ones, new developments have been realized, leading to a truly parallel mahine. Gains have been obtained, suh as: 1) inrease of omputing speed; 2) possibility of solving problems too omplex for serial omputers; 3) exploitation of the inherent parallelism of some problems; 4) possibility of alulation of a solution in real time. On the other hand, parallel omputers present new diffiulties, due to a ompliated organization of the data and also due to mahine dependent optimization for effiieny. At present, there is still no standard model for parallel systems. Suh a model ould be represented as shown in the following figure: 8

24 ontrol network Deoding of instrutions and ontrol unit ontrol network 2...,. FIGURE 1.6. General onfiguration of a parallel omputer with different levels of parallelism (M: memory; P: proessor). In the above diagram parallelism is possible at different levels: within the ontrol unit; among proessors; among the stores; in the data network. The above figure, though, is too general both for the building of a funtioning omputer and the development of algorithms. Suh a standard diagram an only be taken as a theoretial basis for parallel numerial analysis and parallel omputers. 9

25 Depending on whih level of parallelism is implemented in the diagram of figure 1.6, we an state the following lassifiation of omputer (this lassifiation is due to Flynn [13]): 1) SISD mahines: it is the Von Neumann model (Single Instrution - Single Data stream); 2) SIMD mahines: array proessors, pipeline proessors and assoiative mahines belong to this lass (Single Instrution - Multiple Data stream); 3) MIMD mahines: omputers with several data proessors and multiple proessor systems belong to this lass (Multiple Instrution - Multiple Data stream); 4) MISD mahines: it has been proven that this type of organization (Multiple Instrution - Single Data stream) is equivalent to that of a Von Neumann mahine. Therefore the MISD lass is onsidered empty. Control unit Proessor Memory FIGURE!.7. Sheme of a SI50 omputer. 10

26 FIGURE I.B. Sheme of a SIMO omputer. Data Organisation Network FIGURE 1.9. Sheme of a MISO omputer. Data Organisation Network FIGURE Sheme of a MIMO omputer. NOTE. C: ontrol unit; P: proessor; M: memory. 11

27 In the ontext of parallel numerial analysis, all these omputer models involve problems of rounding errors and their propagation, together with questions of numerial stability of the algorithm used. The SIMD organization, in partiular, is suitable for lasses of numerial problems suh as: matrix operations; numerial integration of differential equations; MonteCarlo methods; pattern reognitions. MIMD mahines onsist of a ertain number m of independent proessors PI, P2,.., Pm, eah having its own ontrol unit (Cl, C2,.., Cm respetively). All these proessors share, among other things, a number of input/output units and a main memory. At every instant eah proessor an arry out different instrutions in parallel, that is to say all proessors an operate simultaneously. Unlike the SIMD mahines, the MIMD omputers are onsidered as "general purpose" omputers, beause they are muh more flexible than the SIMD ones and a greater variety of problems an be solved through them. As mentioned before, in this work we are only onerned with true multiproessor shared memory mahines; an example of this kind of mahine is represented by the Balane 8000 omputer. In the following paragraph we will briefly introdue the Balane arhiteture and the parallel programming apabilities of this system. 12

28 1.5. The Balane 8000 Parallel Proessing System The Balane 8000 Sequent system is a multiproessor shared memory mahine and therefore it belongs to the MIMD lass. Its main features are the following ones [24]: a) it is a true multiproessor, onsisting of multiple idential proessors (CPUs); eah CPU is a general purpose 32 bit miroproessor; b) it is a shared memory mahine, i.e. there is a single ommon memory; an appliation an onsist of multiple proesses, all aessing shared data held in the memory; ) it is a tightly oupled mahine, i.e. all proessors share a single pool of memory; sharing memory is a natural way for two proesses (running on different proessors) to ommuniate with eah other. Note that a tightly oupled multiproessor an do more than assign non-interating proesses to a different proessor. It an also distribute a single proess among many proessors, so that eah proessor only exeutes part of the alulation. This is done, as we will see in the following hapter, to get a "speed-up" (that is if a proess takes time t to run on an uniproessor, it ould take time tin to run on n proessors); d) the Balane system has a symmetri arhiteture, sine all proessors are idential and an exeute both user ode and operating system ode; e) there is a single high-speed Common Bus, used by all the proessors, the memory modules and the input/output ontrollers: this is done to simplify the adding of proessors, memory and input/output bandwidth; f) programs written for a uniproessor system an run on the Balane system in suh a way that it appears transparent to the user; that is programs do 13

29 not need to be modified for multiproessing support. Proessors an be added or removed, with no need of modifying either the operating system appliations or the user appliations; g) dynami load balaning is provided automatially by the proessors, to ensure that all proessors are kept busy (in the most effiient possible way) as long as there are exeutable proesses available; h) hardware support for mutual exlusion is provided, to enable the user to lok any setion of physial memory, whenever there is the need for exlusive aess to shared data strutures. The following figure illustrate the omponents of a typial Balane 8000 system (taken from Sequent Computer System, "Balane 8000 Sy~tem Tehnial Summary ",(26)): - - I f- -CUIT"I*. SYSTEII - - IlULn.,. IIULn.,. - Iotl INTlAI'ACI ICe AI)A~A II-IIT CfIUo HI 80AIID IOAIID ITHIANn CONIOU 0 _na :::::::::... I ~I I l I I IL, I:: :J DIll! Oil. r- l -- I '-- 0 _ &Sw. J 0 TAN a TAN 00 14

30 Proessors The Balane 8000 omputer is designed to employ from two to twelve 32 bit CPUs, in a tightly oupled multiproessing arhiteture. The CPUs are pakaged two per board. To hange the number of CPUs in the system it is neessary only to shut down the system and add or remove one or more dual-proessor boards. No hanges to the operating system or user appliations are required. Memory The Balane 8000 an employ from 2 to 28 Megabytes of primary memory and it an provide 16 Megabytes of virtual address spae per proess. Memory is pakaged in one-board or two-board memory modules. Memory an be added or removed in muh the same way as the CPU s. SCSI bus The SCSI bus (Small Computer System Interfae bus) is used to onnet blok-oriented devies, suh as disk drivers or tape drivers to the system. It supports high-speed, high-volume data transfer between memory and peripherals (disks, tape units). SCED board A Balane 8000 system an inlude from 1 to 4 SCED boards (SCSI Ethernet Diagnosti ontroller boards). Eah SCED board an serve as host adaptor on a SCSI bus. In any Balane 8000 system one SCED board is designated the "master" SCED board: this master board onnets to the system onsole and provides 15

31 power-up diagnostis. It also provides a power-up monitor for any program running on the main CPU, suh as programs to boot the operating system. M ultibus interfae A Balane 8000 system an inlude up to 4 Multibus interfaes: they enable the system to inorporate any of a variety of peripherals and ustom devies. The Balane 8000 System bus It is a high-performane data bus, tailored to multiproessing in the sense that it provides the high bus bandwidth needed to support multiple CPUs. The Balane System bus is a 64 bit system bus whih arries data among the CPUs, the memory modules and the peripheral subsystems. Network interfaes A Balane 8000 an onnet to up to 4 other systems both in loal area networks (one per SCED board), using Ethernet, and in wide-area networks, using ordinary telephone lines. The onnetion in loal area networks failitates ommuniation among users as well as the sharing of files and devies. Eah of the four onnetable Ethernet loal area networks an onnet hundreds of systems, over distanes of one mile or more. Furthermore, the Balane system networking apabilities inlude those ommon to all modern Unix systems. Terminal multiplexor This is a two-board module that resides on the Multibus and an onnet to a terminal, printer, modem or other ompatible devie. 16

32 There an be up to 4 terminal multiplexors per multibus. Operating system: the Dynix The Dynix operating system is a version of Unix 4.2BSD modified to exploit the Balane parallel arhiteture; differenes between Dynix and Unix 4.2BSD are transparent to the user. Dynix also supports most utilities, libraries and system alls provided by Unix System V and, like other versions of Unix, it is a multi-user operating system. Two or more users an use the system simultaneously, while eah user seems to have the system's undivided attention. This is ahieved through an operating system tehnique alled multiprogramming: a CPU moves from one proess to another many times per seond, so that the omputer system is allowed to exeute multiple unrelated proesses (programs) onurrently. All the exeutable proesses wait in a "run queue": when the CPU suspends or terminates the exeution of one proess, it swithes to the proess at the head of the run queue. The Dynix operating system uses the same tehnique, exept that multiprogramming on Dynix is enhaned by the Balane multiproessing arhiteture: in a Balane system a pool of proessors is available to exeute proesses from the run queue. Dynix balanes the system load among the available proessors, keeping all proessors busy as long as there is enough work available. Note that the Dynix operating system does multiprogramming for all the users automatially. Along with the multiprogramming tehnique, the Balane system also supports another kind of parallel programming: multitasking. 17

33 Multitasking is a programming tehnique that allows a single appliation to onsist of multiple losely o-operating proesses [9]. As a onsequene of multitasking and multiprogramming, we an make the following onsiderations. By definition, parallel programs exeute onurrently, meamng that at any instant the system is exeuting multiple programs. On a Balane system, parallel programs exeute simultaneously: at any instant, the Dynix operating system an be exeuting multiple instrutions from multiple proesses (one proess per CPU). Thus, parallel programming on a Balane system has two speial benefits: multiprogramming yields improved "system throughput" for multiple unrelated programs. That is, eah program finishes in about the time it would take on a uniproessor (whih is running that program alone); multitasking yields improved "exeution speed" for individual programs, that is the owner of an appliation (onsisting of multiple proesses) sees an improvement in the exeution speed of the appliation itself, beyond what would be possible on a uniproessor. In the following hapter we will analyze parallel programmmg on the Balane 8000, using the multitasking tehnique. 18

34 2. Priniples of Parallel Programming on the Balane 8000

35 2.1. Introdution to Parallel Programming on the Balane 8000 As illustrated in setion 1.5, the two basi kinds of parallel programmmg are multitasking and multiprogramming. This hapter is primarily about multitasking, sine the Dynix operating system of the Balane 8000 does multiprogramming for all users automatially. Many appliations an be onverted from sequential algorithms to parallel algorithms with relative ease, yielding linear or quasi-linear performane improvements, as more CPUs are dediated to the task. In addition, ertain types of appliations an be designed speifially to exploit the Balane multiproessing arhiteture. The gam m the exeution speed, that an be ahieved by means of the multitasking tehnique, is determined by the following fators: the perentage of the program's time that an be spent exeuting parallel ode (a great number of appliations need to spend less than 2-3% of their time exeuting sequential ode); the number of proessors available to the appliation; the hardware ontention imposed by multiple proessors ompeting for the same resoures (suh as the system bus, the system ommon memory, et.). Note that on a Balane system the overhead due to this hardware ontention is negligible, sine most CPU memory operations aess ahe memory, not the system bus; the overhead in reating multiple proesses; the overhead in synhronization and ommuniation among multiple proesses. 19

36 In adapting an appliation for multitasking, therefore, we will aim to run as muh of the program in parallel as possible; at the same time, we will aim to balane the omputational load as evenly as possible among parallel proesses Parallelizable Appliations: Homogeneous and Heterogeneous M ultitasking We also have to determine whether an appliation an benefit from parallelization and whih kind of multitasking tehnique is the most suitable. A parallel appliation, in fat, onsists of two or more proesses exeuting simultaneously. These proesses an be multiple instanes of the same program ("homogeneous multitasking" or "data partitioning") or they may be distint but o-operating programs ("heterogeneous multitasking" or "funtion partitioning"). Homogeneous multitasking basially onsists of running the same ode on eah CPU. Multiple idential proesses are reated and work on different portions of the data struture simultaneously. Data partitioning, therefore, applies to appliations performing many iterations on large data strutures (e.g. matrix multipliations, Fourier transformations). The entire data struture an be divided up evenly among proesses, before they start work (stati load balaning), or eah proess an work on one portion at a time, going bak for more work when it finishes (dynami load balaning). 20

37 Heterogeneous multitasking, on the ontrary, assigns different ode to eah CPU; that is, all the proesses work simultaneously on a shared data set but eah proess handles a different task. Appliations performing many different operations on the same data set are andidates for funtion partitioning (e.g. flight simulation, program ompilation ). While some appliations require funtion partitioning or a ombination of data and funtion partitioning, most problems adapt more easily to data partitioning. This last method offers some advantages over funtion partitioning, suh as less programming effort is required to onvert a serial program to a parallel algorithm. Furthermore, with data partitioning it is easier to ahieve an even load balaning among proessors; it is also easier to adapt the programs automatially to the number of available proessors. In the remaining part of this hapter, we will only refer to the homogeneous multitasking tehnique. As far as it onerns the deision whether to parallelize a program, we an point out that many programs spend the majority of their time exeuting in very few routines (usually just one or two). When onverting a program to a parallel version, it is often possible to ahieve maximum gain in exeution speed simply by parallelizing these few routines. Furthermore, a typial fration of ode that annot be parallelized turns out to be just 2-3% for most programs (as already been mentioned). Typial setions of ode that have to be performed serially are those related to initialisation phases and input/output operations. 21

38 2.3. Program Dependenies One the portions of parallel ode have been identified, the next step is to analyse all the possible program dependenies, for any program unit [2, 3, 24]. Some program operations, in fat, may depend on previous operations, while some may be exeuted in any order. Program Dependene Analysis, therefore, is needed to arry out all the ordering neessary to guarantee orret results. When a program unit has no dependenies, the statements in that unit an be exeuted in any order or even simultaneously. Most of the time, this is not the ase; we an group the kinds of dependenies into two lasses: data dependenies and ontrol dependenies. Within the data dependenies lass, we separate: flow dependene; anti dependene; output dependene. Flow dependene ours when one operation sets a data value that is used by a subsequent operation: 1) A=B+C II) D=3xA Statement (ll) depends on the result of statement (1). Antidependene ours when one operation uses a memory loation that is loaded by a subsequent operation: 22

39 I) A = B + C H) C=3xB Statement (I) must exeute before statement (H), sine the first statement uses the urrent value of the variable C. Output dependene ours when one operation loads a memory loation whih is also loaded in a subsequent operation: I) A = B + C H) A=D-3 Statement (H) must exeute after statement (I), or A will ontain the wrong value at the end of this program unit. The seond lass of program dependenies is the ontrol dependenies lass;.. it inludes dependenies due to the required flow of ontrol in a program: I) IF (X.GT.O) H) A OB + 3 ' Statement (H) is onditionally exeuted, depending on the result of the test in statement (I). It is neessary' to identify all the program dependen'ies within a program unit (and for all program units), in order to transform a given program, loop or subroutine, to orretly run in parallel. It is also neessary to orr'etly organise the data struture (shared or private) and to get synhronization points and loking mehanisms for all the proesses. 23

40 2.4. Elements of Parallel Programming The remaining setion introdues some elements of parallel programming that are not ommon in sequential programming. We have already disussed the multitasking tehnique and the program dependene analysis. What we still need to onsider is: the reation of shared and private data; the reation and termination of multiple proesses; the division of omputing tasks among parallel proesses ("sheduling"); the synhronization of parallel proesses; the mutual exlusion of parallel proesses (loks mehanisms). Let us study all these subjets, one at a time, in the above order. Shared memory and shared data The Dynix operating system allows any number of proesses to share a ommon region of system memory. Any proess that has aess to a shared-memory region an read or write in that region, in the same way it reads and writes in ordinary memory. Shared memory provides a diret and effiient method for o-operating proesses to share data. It also simplifies the onversion of sequential algorithms to parallel (and it simplifies this onversion muh more than message-passing mehanisms or network-based mahines). Multitasking programs inlude both shared and private data. Shared data is aessible by all the proesses, while private data is aessible by only one proess. 24

41 The following figure 2.1 illustrates the virtual memory ontents of a proess (16 Megabytes of virtual memory are alloated for eah proess): 16Mb Virtual Memory Convent iona I UNIX model Stak ~ DYNIX Parallel Programming model, Stak t Shared Data o ~ Data Text (shared) ~ Private Data Text (shared) FIGURE Comparison of virtual memory ontents, If the proess forks any hild proesses (as we will see later), eah hild proess inherits aess to the parent's shared memory area and shared stak. Both the parent and the hild proesses an then aess the shared data. This mehanism (besides providing an effiient way of interproess ommuniation) uses less memory than having multiple opies of shared data; it also avoids the overhead of making suh opies of shared data. 25

42 Proess reation, sheduling and termination In Dynix, as in other Unix-based operating systems, a new proess is reated by using a system all alled a FORK. The new proess (hild) is a dupliate of the old proess (parent): the hild proess shares the same files and shared memory aessible to the parent proess. A proess identifiation number (proess id) distinguishes the parent proess from all the reated hild proesses: when some hild proesses are forked, the proess id number 0 (zero) is assigned to the parent, while the proess id number 1 is assigned to the first hild proess, the proess id number 2 is assigned to the seond hild proess, and so on. From this point on (until reahing the JOIN phase), they are separate entities. The fork operation is relatively expensive. Therefore, a parallel appliation should fork as many proesses as it is likely to need at the beginning of the program and terminate them at the end of the program (on ompletion of the program itself). If a proess is not needed during ertain ode sequenes, the proess an wait in a busy loop (spinning) or it an relinquish its proessors to other appliations (until it is needed again). In multitasking programming, tasks an be sheduled among all the proesses reated using three different tehniques: presheduling; stati sheduling; dynami sheduling. In presheduling the task division is determined by the programmer before the program is ompiled. The programmer assigns a speifi task to eah proess. 26

43 Automati load balaning, therefore, is not allowed by the presheduling tehnique (that only applies to heterogeneous multitasking). In stati sheduling the tasks are sheduled by the proesses at run time, but they are divided in some predetermined (stati) way. The stati sheduling proedure for one proess is as follows: 1st step) it works out all the tasks that it will do; 2nd step) it does all its tasks; 3rd step) it waits until all other proesses finish their work. Stati sheduling produes stati load balaning: sine the division of tasks is statially determined, some proessors may stand idle while one proessor ompletes its work. This stati tehnique only applies to homogeneous multitasking. In dynami sheduling the tasks are sheduled by the proesses at run time and they are taken from a task queue. The dynami sheduling proedure for one proess is: 1st step) it waits until there are some tasks to exeute; 2nd step) it removes the first task from the task queue and exeutes it; 3rd step) if there are any more tasks to exeute, it goes on to the seond step. Otherwise, it goes bak to the first step. Dynami sheduling produes dynami load balaning: all the proesses are kept busy as long as there is work to be done; the work-load is evenly distributed among the proesses. 27

44 This dynami tehnique applies to both homogeneous and heterogeneous multitasking. Dynami sheduling, though, entails more overhead than stati sheduling: eah time a proess shedules a task for itself, it must hek the shared task queue (to make sure that there is more work to do) and it must remove that task from the queue. Proess synhronization, loop sheduling and lok mehanisms Synhronization is fundamental to ensure that eah proess performs its work without interfering with the other proesses. It is not unusual for a looping subprogram (to be exeuted in parallel) to ontain a ode setion whih depends on all the proesses having ompleted exeution of the preeding ode. All real appliation programs ontain the program dependenies we have studied in setion 2.3. We then need some synhronization mehanisms to ensure the orret exeution of multiple o-operating proesses; these mehanisms are basially: barriers; loks and semaphores. A barrier is a synhronization point: on reahing a barrier, one proess marks itself as "present"; then it waits for all the other proesses to arrive. There are two kinds of barriers. It is possible to synhronise all proesses at a single pre-initialised barrier. 28

45 With the seond type of barrier, the programmer is allowed to set more than one barrier or to synhronise just a subset of the proesses. A lok is the simplest kind of semaphore in the Balane Dynix system. It ensures that only one proess at a time an aess a shared data struture. A lok has two values: loked or unloked. Before attempting to aess a shared data struture, a proess waits until the lok assoiated with the data struture is unloked (indiating that no other proess is aessing the struture). The proess then loks the lok, aesses the shared data and finally unloks the lok. While a proess is waiting for a lok to beome unloked, it "spins" in a loop, produing no work. It is impossible for two proesses to aquire a lok at the same time. Even when a few proesses attempt the same lok immediately, only one sueeds, while all the others have to wait (until the first proess has released the lok). Semaphore8 are synhronization mehanisms based on the loking/unloking priniple; they are used to protet order-dependent setions of ode and to manage queues. "Counting/queueing" semaphores, for example, are useful for queue management. When several proesses are waiting for a lok, the lok will go to the first proess that tries to aquire it right after it is unloked. Counting/queueing semaphores an ensure that the lok is assigned (instead) to the proess that has waited the longest for it. If a barrier is used for synhronization, one proess is delayed in a spinning state (alled "busy-wait" state) until a set number of proesses have reahed the barrier. 29

46 When using a lok or a semaphore the situation is more omplex; in this ase, in fat, there exist four possibilities onerning what a proess should do while it is waiting for its turn to aess the loked ode setion. These four possibilities are: 1) the proess does not wait; it performs a different task and heks the lok again later; 2) the proess spins in a "busy-wait" state; 3) the proess "bloks", that is to say it relinquishes its proessors to another job; 4) the proess spins for a speified period of time, then it bloks. We omplete this paragraph with a onsideration affeting input/output handling. Input/output, in parallel programming, is ompliated by the need for aution when multiple proesses write to the same file. These ompliations an usually be redued by performing input/output only during sequential phases or by designating one proess as a server to perform the input/output operations. This hapter is onluded by introduing the parallel programming tools supported by the Balane system. 30

47 2.5. Parallel Programming Tools The appliations that an be adapted for parallel programming vary greatly in their requirements for data sharing, interproess ommuniations, synhronization, et. [4]. To gain optimum speed-up, the programmer must develop an algorithm that meets these requirements (while still exploiting the appliation's inherent parallelism). To aid in this effort, the Balane system supports programming tools that adapt to the needs of a wide range of appliations. We are mostly interested in two of these parallel programming tools: the Fortran Parallel Programming Diretives (Sequent Fortran); the Parallel Programming Library (Dynix). We illustrate these two tools in more detail in the following setions and we show how to employ them for data partitioning. Fortran Parallel Programming Diretives: data partitioning with Sequent Fortran The Fortran Parallel Programming Diretives support parallel exeution of Fortran Do-loops. By interpreting these diretives, the Sequent Fortran ompiler an restruture a Do-loop for parallel exeution. The user prepares the program for the preproessor by inserting a set of diretives: these diretives identify the loops to be exeuted in parallel; they also identify the shared and private data within eah loop and any ritial setions of the loop under onsideration. Furthermore, the Fortran Parallel Programming Diretives allow the user to ontrol the sheduling of loop iterations among proesses and the data division among all proesses. 31

48 Ideally, the loop to be hosen for parallel exeution should be an "independent" loop (i.e. a loop in whih no iteration depends on the operations in any other iteration). Otherwise, it is reasonable to hoose a loop whih aounts for a large portion of the omputation. Finally, in the ase of nested loops, hoose the outermost loop (if possible). One it has been determined whih loop to prepare for parallel exeution, it is neessary to analyse all the variables in that loop and to lassify them into one of the following ategories: shared variables; loal variables; redution variables; shared ordered variables; shared loked variables. After this analysis phase, the user is ready to use the Fortran Parallel Programming Diretives, to prepare the loop under onsideration for parallel exeution; these diretives are listed in the following table (Table 2.2): 32

49 TABLE 2.2 DIRECTIVES C$DOACROSS C$ORDER DESCRIPTIONS Identify Do-loop for parallel exeution Start loop setion whih ontains a shared ordered variable C$ENDORDER End loop setion whih ontains a shared ordered variable C$ Add Fortran statement for onditional ompilation C$& Continue parallel programming diretive Parallel Programming Diretives. At this point, the preproessor handles all the low-level tasks of data partitioning. By interpreting the diretives, the preproessor produes a program that performs the following features: sets up shared data strutures; reates a set of idential proesses; shedules tasks among proesses; handles mutual exlusion and proess synhronization. All this is done in a way that is totally transparent to the user. For more detailed information about the loop variables lassifiation and the use of the Parallel Programming Diretives refer to [24]. 33

50 If neessary, the user is allowed to all the Dynix Parallel Programming Library routines (see the next setion), in order to preserve the orret data flow within the loop. Parallel Programming Library: data partitioning with Dynix The Sequent Parallel Programming Library is a olletion of C routines whih allow the programmer to perform parallel Fortran programs (as well as C and Pasal programs). This library inludes three sets of routines: 1) mirotasking routines (mirotasking library); 2) routines for general use with data partitioning programs (data partitioning library); 3) routines for memory alloation in data partitioning programs (memory alloation library). By means of them, the user is able to: reate sets of proesses to exeute subprograms in parallel; shedule tasks among proesses; synhronise proesses among tasks; alloate memory for shared data. As a result, programs that use the Parallel Programming Library an be made to balane loads automatially among proessors and to adjust the division of tasks at run time (basing the division on the number of available proessors). 34

51 Data partitioning with Dynix onsists of the reation of multiple independent proesses to exeute iteration loops in parallel. This is done as follows: a) eah loop to be exeuted in parallel is ontained in a subroutine; b) for eah loop, the program alls a speial funtion (m_fork), whih forks a set of hild proesses and assigns a opy of the subroutine to eah proess; ) eah forked proess exeutes some of the loop iterations (either stati or dynami sheduling an be used); d) when neessary, the subroutine may ontain alls to synhronization routines (m_syn, m_lok, m_unlok, et.); e) when all the loop iterations have been exeuted, ontrol returns from the subroutine to the main program. At this point, the program either terminates the parallel proesses (by means of the m_killpros routine), or it suspends their exeution until they are needed again (m_park_pros and m_rele_pros routines), or it leaves the parallel proesses to spin in a busy-wait state and then uses them later. A omplete list of all the routines available in the mirotasking library, in the data partitioning library and in the memory alloation library is given in the following three tables (Tables 2.3, 2.4, 2.5). 35

52 TABLE 2.9 ROUTINES m_fork m_gelmyid DESCRIPTIONS Exeute a subprogram in parallel Return proess identifiation number m_get_numpros Return number of hild proesses m_killpros m_lok m_multi m_next m_park_pros m_rele_proes m_set_pros m_single m_syn m_unlok Terminate hild proesses Lok a lok End single-proess ode setion Inrement global ounter Suspend hild proess exeution Resume hild proess exeution Set number of hild proesses Begin single-proess ode setion Chek in at barrier Unlok a lok Parallel Programming Library Mirotasking Routines. Note: the mirotasking library is designed "around" the m_fork routine; any other routine belonging to this library should only be used in ombination with the m_fork routine. 36

53 TABLE 2.4 ROUTINES pus_online s_init_barrier S_INIT_BARRIER s-inillok S_INIT_LOCK s_lok, s_lok DESCRIPTIONS Return number of CPUs on-line Ini tialise a barrier C Maro Initialise a lok C maro Lok a lok S_LOCK, S_CLOCK C maro s_unlok S_UNLOCK s_ wait_ barrier Unlok a lok C maro Wait at a barrier S_ WAIT_BARRIER C maro Parallel Programming Library Data Partitioning Routines. Note: the data partitioning library inludes a routine to determine the number of available proessors; it also inludes several synhronization routines and their analogous C preproessor maros (these maros are faster than the normal funtion alls, but they an add to the ode size). 37

54 TABLE 2.5 ROUTINES brk, sbrk DESCRIPTIONS Change private data segment size shbrk, shsbrk Change shared data segment size shfree shmallo De-alloate shared data memory Alloate shared data memory Parallel Programming Library Memory Alloation Routines. Note: the memory alloation library onsists of routines that allow data partitioning programs to alloate or de-alloate shared memory; these routines also permit a hange in the amount of shared and private memory assigned to a proess. For more detailed information onerning the Parallel Programming Library usage, refer to the Sequent Guide To Parallel Programming [24]. Data partitioning with Dynix, as well as data partitioning with Sequent Fortran, requires an analysis of all the variables onerned with the setion of ode (Do-loop) to be performed in parallel. It is neessary to identify: shared variables, i.e. "read-only" arrays and salars or arrays whose elements are referened by only one loop iteration; private variables, i.e. variables that are initialised in eah loop iteration before their values are used; dependent variables (redution variables, ordered variables, loked variables ). 38

Outline: Software Design

Outline: Software Design. Goals History of software design ideas Design priniples Design methods Life belt or leg iron? (Budgen) Copyright Nany Leveson, Sept. 1999 A Little History... At first, struggling